AI With Humans or AI Without Humans: Where Does the Deloitte Model Fail? – Part I.
Expensive, government-commissioned reports built on fabricated studies may sound at first like a tabloid-style “AI scandal,” but the story runs much deeper. The recent Canadian and Australian Deloitte cases—involving uncontrolled Gen AI use without validation—do not simply reveal technical errors. They show how professional responsibility can slip out of our hands when we delegate careful background work to generative AI. This is a structural problem, namely, under what conditions AI can be used responsibly in analytical work, and where we cross into a zone where “innovation” becomes little more than a euphemism for abandoning fact-checking altogether.
In just a few years, Gen AI seems to have moved from experimental “toy” status to playing a central role in the production of government briefing reports worth hundreds of thousands of dollars. That would not necessarily be a problem, provided the rules of the game were clear and everyone involved understood where the line runs between responsible automation and outsourcing intellectual due diligence—or worse, sensitive information[R13] . The Deloitte scandals of the past months show, however, that in practice this line is far from clear, and that the use of Gen AI has quickly brought core questions of accountability, evidence-based policymaking and professional ethics to the forefront.
At first glance, the Canadian Deloitte case looks like just another AI headline: a provincial government paid almost 1.6 million Canadian dollars for a several-hundred-page health care report that turned out to contain fabricated citations and supposedly AI-generated “research”. Local investigative journalists revealed that the document cited studies and authors that do not exist, and some of the named researchers publicly stated that they had never carried out the studies attributed to them. The presence of such invented references has long been known as one of the characteristic consequences of using generative models without adequate oversight. The provincial government acknowledged “irregularities,” while Deloitte maintained that the report’s substantive recommendations remained valid but promised to revise the footnotes and bibliography. From the perspective of the public, however, this is not a mere “irregularities” or an “editing” issue: it is the discovery that a costly expert report may be shaping government decisions is built on foundations that might not be verified in the real world. It is a question of public trust. Ultimately, it is not the scandal itself that erodes public trust, but the absence of even the most basic due diligence that citizens expect from institutions entrusted with public funds.
What really shows the gravity of the situation is that this is not an isolated case, but it appears to be part of a broader pattern. It later emerged that Deloitte had also relied on generative AI in Australia when preparing a report on the integrity of the welfare system, and that this document, too, contained invented court quotations and non-existent academic publications. The legal scholar who initiated the review identified dozens of errors, including sentences attributed to the Federal Court that had never been written, as well as books and articles that cannot be located in any catalogue or database. Under pressure, Deloitte agreed to a partial refund, the report was corrected, and the firm acknowledged after the fact that parts of the text had been generated using a large language model based on Azure OpenAI.
In both cases, the same structural failure emerges: generative AI was used not transparently and not under meaningful human control, but as a private drafting shortcut for documents that went on to shape major public policy decisions—including healthcare reform. When multimillion-dollar policy recommendations rest on unverifiable AI-generated material, the problem is no longer technical. It is, at its core, a human failure: the decision to rely on AI without performing the basic due diligence that professional standards require. This raises serious concerns about the soundness of the policymaking process itself and, inevitably, erodes public trust not only in the institutions commissioning these reports but also in the broader use of generative AI in government.
It is important to stress, however, that the mere fact that a consulting firm, law office or research institute uses generative AI is not a problem. AI tools have become a normal part of professional workflows, but no credible professional workflow permits delegating responsibility to an automated system. It goes without saying that verification and judgment must remain human duties. In preparing professional content, it is now standard practice for experts to rely on search engines, legal databases or text summarization tools, and generative models add a new layer of productivity on top of these. What we are entitled to expect, though, is that whatever ultimately reaches a client as a report, expert opinion or policy proposal should not be just a stylistically polished AI text, but a piece of work that has been checked and is supported by real, verifiable references. In practice, this requires basic safeguards, such as manual citation checks, cross-database verifications, and transparent documentation of how AI systems were used. The problem is therefore not the use of AI as such, but the moment when an organization behaves as if AI outputs were automatically true and, in doing so, effectively abandons its own professional quality control. In practice this is very close to an organization undermining its own added value and reputation as a top consulting firm.
At this point it is worth facing what OpenAI and other research groups have recently shown both empirically and theoretically. The so-called hallucinations, that is, plausible but false statements, are not just bugs but mathematically unavoidable features of the LLM technology. An OpenAI study published in September 2025, for instance, demonstrates within a formal statistical framework that even if a model were trained on perfectly clean, error-free data, there would still be a lower bound on the error rate for certain types of questions. The model behaves like an exam candidate who, when unsure of the answer, prefers to guess, because under current training and evaluation practices (e.g. Reinforcement Learning from Human Feedback, the backbone of today’s language model training setup) an “I don’t know” -response is penalized more heavily than a confident but wrong one. Isn’t it very human in a way?
The study also shows that many industry benchmarks reward guessing without explicitly penalizing unfounded statements, which means that hallucinations are, in effect, structurally built into the system. A different line of theoretical work goes even further and argues that if we want language models to have certain seemingly natural properties, then it is logically impossible to eliminate hallucinations altogether. These so-called impossibility theorems, which apply tools from mechanism design and game theory to language models, show that we cannot freely optimize creativity, efficiency and strict adherence to the truth all at the same time. At some point we must trade something off: if we allow the model room for creative combinations and for drawing inferences from partial information, we also accept a certain level of design error, which will sometimes appear as invented facts or fictional references. The Deloitte case is therefore not a minor technical glitch, but rests on a fundamental misunderstanding of large language models generating linguistic predictions, and not verified knowledge claims we might wish them to do.
If hallucinations are mathematically unavoidable, it follows that using generative AI as a direct source of truth in high-risk, regulated domains is essentially a professional failure. Specifically, in law, healthcare or public policy, it is not acceptable for a model’s guess to be written straight into a court filing, a report that underpins a health workforce strategy, or a proposal to redesign a tax system. These sectors operate under strict evidentiary and methodological standards, and incorporating unverified AI output directly contradicts these foundational norms and principles. Indeed, the minimum standard in these settings is to treat model outputs as hypotheses or working material that human experts must verify, supplement or, where necessary, discard entirely.
This perspective also fundamentally changes how we understand AI in the world of work. The question is not, and never really has been, whether it is AI or humans, and certainly not whether machines will replace people. The relevant comparison is between AI plus humans and AI without humans. The flawed reports produced by Deloitte mostly show what AI looks like without human oversight, or when oversight exists only on paper and does not involve any real control. If the model generates text but no one checks the references line by line, no one manually looks up the cited studies, and no one asks the very basic question of whether a particular court decision exists, then we are not looking at AI plus humans but at AI minus humans.

István ÜVEGES, PhD is a Computational Linguist researcher and developer at MONTANA Knowledge Management Ltd. and a researcher at the HUN-REN Centre for Social Sciences. His main interests include the social impacts of Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.