
When AI hallucinates reality – Fallacies of language models and the challenges of recognizing them – Part II.
In the first part, we showed how Large Language Models can easily produce factually incorrect statements without being noticed. We now explore why automatic detection of these errors is difficult in principle and in practice, and why a purely self-checking approach is inadequate.
Our starting point will again be the study mentioned in Part I. The authors first model the problem in the form of a decision game. Let us assume that there is a truth set, K, which can be queried at any time, containing all the correct statements of a given domain. The output of the language model is a potentially infinite sequence of sentences, G. The task of the detector is to decide whether all elements of G are contained in K, i.e. whether the relation G⊆K holds; otherwise, we are talking about a hallucination. The authors prove that this decision problem can be reduced to Angluin’s classic language identification in the limit task, where the learner must find which language (here: fact set) he sees based only on positive examples. Most real language classes are not ordered enough to be distinguished by a finite, so-called tell tale set, so detection from purely positive samples is generally impossible.
In 2023, Google Cloud engineers identified five overlapping levels of hallucination (word, sentence, document, cross-domain, and contextual), illustrating where the G⊆K condition can be violated. These levels range from simple local errors (word and sentence level) to complex errors spanning multiple documents or domains. Word-level hallucinations typically manifest themselves in the use of inaccurate synonyms or misunderstood entities. Sentence-level hallucinations are more complex: here the model incorrectly connects individual statements or draws non-existent conclusions. During document-level hallucinations, the model can create fiction spanning entire paragraphs or chapters that it presents as real facts. Cross-domain hallucinations appear when the model incorrectly connects information from different knowledge domains, for example, by treating legal and medical concepts together. The highest level is contextual hallucinations, when the model generates erroneous information based on previous contexts, ignoring the continuity of the discourse. The higher we go on the scale, the less sufficient statistical heuristics are, especially when the detector does not have external, negative evidence. The higher we go on the scale, the less sufficient it is to apply statistical heuristics, especially if the detector does not have external, negative evidence.
OpenAI’s latest 2025 o3 / o4-mini system card vividly shows where this limit leads: the smaller o4-mini model achieved only 20% accuracy on the SimpleQA test set, while hallucinating 79% of the time; the larger o3 improved accuracy to 49%, but half of its answers still remained wrong. The card also notes that the models get it wrong precisely when they are most confident in their predictions. This is a classic overconfidence bias, confirmed by a 2025 study. In such cases, ChatGPT-4 impulsively sticks to the “I am confident even in uncertain situations” strategy, often exacerbating the error.
This empirical experience is also reflected in the first impossibility theorem of the study that forms the basis of this post. The essence of this theorem is that if the detector only sees sentences during training that it considers to be true (i.e., positive examples), there is demonstrably no general procedure that can filter out hallucinations. This means formally that the detector cannot distinguish actual truths from erroneous conclusions, since it does not receive information that would rule out the correctness of the generated sentences. Self-refine or majority-vote techniques therefore only seem useful on the surface. It is important to understand that both methods rely on the model’s own internal logic and statistical patterns.
In the self-refine procedure, the model tries to improve its previous answers after repeated reflection or regeneration. However, this often leads only to a reinterpretation of existing incorrect conclusions, since no correction from an external, reliable source is included in the process. The majority-vote technique, on the other hand, assumes that the most frequent of several independent answers will be the correct one. However, if the model is based on fundamentally incorrect premises, then voting will only reinforce the incorrect patterns. Since both procedures rely on generated data, these methods are unable to reveal the blind spots created by the model’s own learning limitations. As a result, they often reinforce the same incorrect patterns that were originally created, so the detection of hallucinations continues to face limitations.
The second, possibility theorem, paints a more optimistic picture. This theorem states that if the detector sees not only positive examples but also sentences that are not in K (the truth set), then it becomes possible to filter out hallucinations. Adding negative examples radically reduces the learning complexity, as it becomes clear to the detector that certain generated statements are wrong, allowing it to make more refined decisions. This approach also has significant industrial impact: according to OpenAI’s system card, periodic “red-teaming” rounds, during which models are bombarded with legal and healthcare adversarial examples, measurably reduces critical errors. Although the exact percentage reduction is not disclosed, industry experience shows that the conscious use of negative examples is an effective tool for increasing model accuracy and suppressing hallucinations.
A recent analysis highlights that authorities will not only audit the origin of training data in the future but will also introduce stricter control mechanisms regarding the source and quality of data. While the report does not specifically mention the concept of negative labels, it emphasizes the importance of data quality and the reliability of annotations. According to KPMG’s report, AI Regulations: Present and Future, tracking the origin of training data and systematically collecting errors is essential to increasing the reliability of models. Regulatory expectations also dictate that any erroneous predictions are properly documented and that developers can demonstrate that they have been processed and corrected. The analysis suggests that such stricter controls not only improve accuracy but also contribute to the legal compliance of models. With this legislative intervention, processing and correcting erroneous predictions will become not just an industry recommendation but a legal requirement. Whether this explicitly extends to negative examples is unclear, but the tightening of error correction mechanisms points in this direction.
The above results clearly point in the direction that protection against hallucinations cannot be solved by simply increasing the number of parameters in the models or by refining the prompt technique. To make hallucinations truly detectable, systematic collection and processing of negative labels is necessary.
Even OpenAI’s current most advanced publicly available models clearly demonstrate that statistical errors and blind spots persist even as the number of parameters increases. RAG literature also confirms this: retriever-based solutions such as HotPotQA or RC-RAG can only achieve significant accuracy gains when given hard negative examples.
The legislative framework, in particular the EU AI Act, and the preparations for ISO/IEC 42001 also indicate that practices like the investigation of negative labels will soon become not only a development benefit but also a regulatory expectation. Traceability, auditing of erroneous predictions and documentation of hallucinations are likely to become legal obligations, with severe financial penalties for non-compliance.
Overall, identifying and addressing hallucinations of LLMs is not only a technological challenge, but also a legal and industrial one. Formal results, industry practices and a tightening regulatory environment all point to the fact that hallucination prevention and filtering is essential for building robust AI applications. Consistent collection and processing of negative examples is essential both to improve model accuracy and to ensure legal compliance. Effectively integrating red-teaming circles, hard negative mining and tagging into a sustainable enterprise pipeline not only reduces the risk of hallucinations but also contributes to a more transparent and ethical AI ecosystem.
István ÜVEGES, PhD is a Computational Linguist researcher and developer at MONTANA Knowledge Management Ltd. and a researcher at the HUN-REN Centre for Social Sciences. His main interests include the social impacts of Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.