From Phantom Citations to Prompt Injection: The Crisis of Trust in Science in the Age of Generative AI – Part I.
Generative Language Models have significantly accelerated text production, and academic publishing is no exception. Increasingly, we see texts that appear fluent and polished but are built on shaky internal structures. The peer review process remains a bottleneck, so errors and misuse often surface only with delay and in scattered ways, such as phantom citations, generated figures, or hidden instructions embedded in manuscripts. The real question is not whether generative AI is good or bad. The key issue is whether academic writing still retains its core guarantees: verifiability, accountability, and transparency.
In recent years, generative AI has reshaped everyday writing practices. Today’s widely used language models were trained on vast amounts of text, allowing them to generate complex, seemingly coherent and high-quality output with minimal human effort. These tools are now broadly accessible. Developers compete to grow their user base and make generative language technologies available to as many people as possible. Unsurprisingly, this has led to widespread and often uncontrolled use. One of the clearest consequences is a sharp rise in text production. In academic contexts, this increase has been accompanied by a higher volume of low-quality texts produced with minimal human involvement. The scale of the problem is reflected in the emergence of the term “AI slop,” which has gained traction in public and media discourse. These texts often look acceptable and credible at first glance. However, they tend to be shallow, inaccurate, or unverifiable.
In education, workplaces, and the media, the influence of generative tools is (at best) immediately noticeable. Machine-generated texts are quickly integrated into everyday communication and decision-making processes. As a result, errors often surface quickly as well, at least in ideal circumstances. In academic publishing, however, the situation is different. It is not enough for a text to sound convincing; it must also allow readers to trace the foundations of its claims and verify them. This shifts the question from whether generative tools speed up and smooth out academic writing to how they affect the practice of verification.
When manuscripts become faster and cheaper to produce, it is natural that more texts enter circulation. There is nothing surprising about that. The problem arises when the time and attention available for review and verification do not grow at the same pace. In such cases, polished form can easily mask content that is uncertain or weak. From this perspective, “scientific slop” is not a matter of style. It is a signal-to-noise problem. Well-worded but poorly supported and difficult-to-check claims can proliferate, placing added strain on both peer review and editorial processes.
The filtering capacity of these processes is limited by design. Peer review is largely voluntary in most fields and typically carried out by researchers in their spare time. Time and attention are therefore finite resources. This limitation becomes especially visible when the volume of submissions rises sharply, which often happens due to the ease and speed of producing AI-generated texts. Under such conditions, detecting problematic cases becomes disproportionately difficult.
The issue is further complicated by the fact that scientific texts are often interpreted and used long after publication. The academic community does not process new content all at once. Instead, readers encounter it at different times, in different places, and often independently of one another. There is usually no direct link between these encounters, so errors do not reveal themselves in a single, dramatic moment. Instead, they emerge gradually. The damage becomes insidious, as small distortions are quietly incorporated into other texts, decisions, and chains of citation.
In this context, the problem is not primarily that the texts are poorly written in terms of style or grammar. In fact, experience shows that these mass-produced texts often exhibit better surface form than earlier articles written entirely by hand. This is precisely what makes quick quality assessment more difficult. For the reasons already mentioned, editorial and peer review capacities cannot scale indefinitely with the rising number of submissions. As a result, the entire quality assurance process comes under sustained pressure. A 2025 review describes this as a capacity crisis in peer review and notes that journals and funding agencies are exploring various ways to increase speed and efficiency. In some cases, they are even experimenting with financial compensation for reviewers.
The rapid spread of generative tools is therefore not just anecdotal. It is a well-documented trend in research literature. A scientometric analysis of publications from 2022 to 2023, based on 171 peer-reviewed articles indexed in Scopus, found that many researchers consider these tools helpful for improving wording and supporting the writing process. This is especially true for authors who are not native speakers of English. However, the same analysis also highlights recurring risks. These include hallucinated references, increased risk of plagiarism, unclear authorship and ethical responsibility, and the limitations of relying on detection as a safety net. This duality reinforces the broader concern that while surface quality may improve, the burden of verification and filtering continues to grow.
One of the first clear and tangible warning signs related to generated and unverified texts is the phenomenon of phantom citations. In such cases, the bibliography may look acceptable at first glance, but closer inspection reveals that some of the references either do not exist or do not support the claims they are supposedly backing. Perhaps the most striking recent example is the controversy surrounding a Deloitte report prepared for a government client. It was discovered that the document included nonexistent academic references and other unverifiable sources. According to reports, even a court citation appeared in an incorrect or fabricated form. The incident eventually led to a partial refund of Deloitte’s fee and a forced revision of the document. This case serves as a reminder that even the most prominent players sometimes forget that citations are not decorative. They serve a structural purpose. In many cases, they are the only means by which a claim can be traced back and its validity assessed.
The signs of superficial or insufficiently verified work are not limited to the textual level. In academic writing, it is common for a significant portion of an argument to rely on figures, tables, or visual illustrations in order to be properly supported and clearly understood. A graph, a microscopic image, or a schematic diagram often appears more immediate and “solid” to readers than a paragraph of text. Because of this, visual elements carry particular weight and can easily create the impression that the text is backed by actual measurements or real observations.
This is precisely why visuals generated by AI tools without any underlying experiment, measurement, or empirical basis present a distinct risk. The danger lies largely in the subconscious credibility that such visuals tend to carry, and in the illusion of validation they may convey. Misuse of generative AI tools in this way is no longer science fiction. Reputable international journals have already found themselves in awkward situations because of it. A notable example is the case of Frontiers in Cell and Developmental Biology, where the article in question was later retracted.
István ÜVEGES, PhD is a Computational Linguist researcher and developer at GriffSoft Ltd. and a researcher at the ELTE Centre for Social Sciences. His main interests include the social impacts of Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.