
When AI hallucinates reality – Fallacies of language models and the challenges of recognizing them – Part I.
Large Language Models (LLMs) generate impressive answers, but sometimes confidently assert untruths – this is called hallucination. A recent study shows that automatic detection of these errors is theoretically impossible if the models learn only from correct examples. Yet, in practice, LLMs themselves often evaluate each other’s answers, which only exacerbates the problem.
Artificial intelligence-based chatbots have become more and more common, but with them has come a new problem: hallucination. Imagine a chatbot claiming with complete confidence that Albert Einstein won the Nobel Prize for Literature in 1923 for his poetic explanation of relativity. The sentence is smooth, it might even sound plausible at first, but in fact it is fiction. Einstein received the Nobel Prize in Physics in 1921, awarded in 1922, and gave his Nobel lecture in Gothenburg in 1923, entitled ‘Fundamental Ideas and Problems of Relativity’.
In the world of LLMs, this phenomenon is called hallucination: linguistically correct but factually false statements that undermine the reliability of the model and carry serious ethical risks.
Recently, a new study was published on the theoretical impossibility of detecting hallucinations under certain conditions. According to the study, if only positive examples (correct statements) are available when training a model, then the detection of hallucinations is in principle impossible. The reason is that hallucination detection is equivalent to language identification, which is known to be impossible to perform with certainty from correct examples alone.
This recognition is particularly worrying since in current practice it is often LLMs themselves evaluating the responses of other LLMs. When a language model validates the responses of another model, a ‘closed loop’ is in fact created in which errors are easily perpetuated or even amplified. Because models rely on the same probabilistic patterns and do not necessarily can accurately validate factual knowledge, this approach further increases the risk of propagating hallucinations.
Suppose that there exists a set of complete true statements, K. Let G denote the model’s answers. If all generated statements are elements of K, no problem; if any are outside it, the system is disconnected from reality. A theoretical framework proposed by the new study, Hallucination Detection in the Limit, investigates precisely whether there exists a procedure that can reliably decide whether G ⊆ K or not after a sufficiently long observation. The authors conclude that if the detector learns only from positive examples (i.e. correct statements), then automatic detection of hallucinations is not only theoretically problematic, but also practically impossible.
There are exact numbers to show why this is so problematic compared to current practice (evaluating LLMs on their own). According to the OpenAI o3 and o4-mini system card published by OpenAI in April of this year, the o3 model answered only 49% of the questions correctly on the SimpleQA dataset, while 51% of its answers were classified as hallucinations. The smaller o4-mini performed even worse, with 20% accuracy and a 79% hallucination rate. In the PersonQA test, which included factual questions about people, o3’s accuracy increased to 59%, but still one in three of its statements were wrong, while o4-mini had a 48% hallucination rate. The system card notes that o3 makes multiple statements, thus increasing the number of true and false statements at the same time – meaning that the hallucination rate is influenced not only by the model’s knowledge base, but also by its speech ability.
In practice, of course, no one sees the K set directly. That is why engineers in industry take a practical approach to the problem. One of the most common approaches is Retrieval-Augmented Generation (RAG), which adds a retrieval layer to the language model. RAG searches for relevant documents in real time in response to user queries, for example in corporate knowledge bases or in literature articles. The language model is then explicitly instructed to generate text based on these results. It will then be able to provide up-to-date and context-sensitive information without the need for continuous retraining.
Recent studies show that the use of RAG significantly improves the factual accuracy of language models. For example, in one study, the use of RAG increased the accuracy of models by an average of 39.7%, with the Meta Llama3 70B model achieving the best result with 94% accuracy. The use of RAG also reduced factual inaccuracies by 30%, especially in tasks where information is frequently updated, such as news or policy updates.
The biggest advantage of RAG is that it is easy to keep up to date: just update the indexed database, no need to retrain the model. This is particularly useful in applications where information changes rapidly and models must always use the most up-to-date data.
However, the RAG only protects if (1) the application’s retrieval layer can find relevant sources and (2) the generator (LLM) can correctly link these sources together. When both conditions are met, RAG proves to be a powerful tool for enhancing factual accuracy.
This becomes evident when comparing the performance of RAG-based models across different benchmarks. For example, HotPotQA is an open-domain, multi-hop benchmark where answering a question requires logically linking several independent Wikipedia entries. In this complex setting, RAG-based models achieved only 37% accuracy, highlighting the challenges of multi-hop reasoning. In contrast, on the simpler, single-hop SimpleQA task (where information is more directly accessible and well-structured) the same system reached 74% accuracy.
The second popular direction is prompt engineering. The chain of thought technique explicitly asks the model to first describe its steps and then give a conclusion (Let’s think step by step…). This increases internal consistency but does not guarantee that the premises described are true.
Self-critical LLMs are also becoming more common. The self-refine strategy asks the same model to improve its own text in a second round, while majority vote runs multiple instances and keeps the most common answer. These tricks reduce the number of surface glitches, but – as recent theoretical work has shown – we still entrust LLM with the supervision of LLM, so there is no guarantee in principle that it is correct.
Why is this not enough? First, the long list of methods above is largely based on empirical tricks, without a formal error limit. Second, the models see either their own productions or external texts – believed to be correct – during training. This positive example bias is the main attack surface of the recent theoretical result: it is usually impossible to completely filter out errors from such examples alone. Third, automatic hallucination detectors also often come from the same type and model architecture as the generator, so in the case of a common blind spot, neither RAG nor prompt tuning can reveal the error.
In 2023, a New York federal judge fined two lawyers in Mata v. Avianca $5,000 after their ChatGPT filing cited six precedents that never existed. Another study found that 12.7% of ChatGPT 3.5 drug and treatment recommendations fell into the “potentially harmful” category. Meanwhile, in Europe, the EU AI Act classifies generative language models as high-risk: the regulation, which will gradually come into force from 2025, makes systematic monitoring of hallucinations mandatory and, in the most serious cases, threatens with fines of up to 7% of global annual turnover.
István ÜVEGES, PhD is a Computational Linguist researcher and developer at MONTANA Knowledge Management Ltd. and a researcher at the HUN-REN Centre for Social Sciences. His main interests include the social impacts of Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.