The Reference Problem: Why You Can't Trust the Citations in AI-Generated Content
Artificial intelligence has transformed how we write. Large language models (LLMs) can generate fluent, authoritative-sounding prose in seconds, complete with what appear to be bone fide academic citations. The trouble is that those references are often entirely made up.
This is not a fringe issue. It is a documented, systemic flaw with real-world consequences, and the evidence is stacking up.
When Governments Publish Fake Science
In May 2025, the White House published its "Make America Healthy Again" report, billed as an example of "radical transparency" and "gold standard" science. Report by journalistic outlets (such as Science) told a different story. Multiple studies cited in the report don't appear to exist, and researchers whose supposed work was cited confirmed that the cited papers were not real. Epidemiologist Katherine Keyes, listed as first author of a study on anxiety and adolescents, told investigators: "The paper cited is not a real paper that I or my colleagues were involved with." Consequently, the errors undermine the validity and potential utility of the report.
Just months later, a remarkably similar incident unfolded on the other side of the world. The Australian government paid consultants Deloitte $290,000 for a 237-page report on automated penalties in the country's welfare system (Fortune). Soon after publication, University of Sydney researcher Chris Rudge noticed fundamental flaws: citations that didn't exist and misquoted legal judgments (Vectara). The revised report, quietly republished, removed fabricated quotes attributed to a federal court judge and references to non-existent reports. Australian Senator Barbara Pocock was blunt: Deloitte had done "the kinds of things that a first-year university student would be in deep trouble for" (Above the Law).
Why Do LLMs Fabricate References?
These failures are not accidents, they are an almost inevitable consequence of how LLMs work.
At their core, LLMs are next-token prediction engines. They generate text by calculating which word or phrase is statistically most likely to follow what has come before, based on patterns learned from vast quantities of training data. They do not retrieve information from a verified database, and they have no internal mechanism for distinguishing what they know from what they are guessing. As AI researcher Steven Piantadosi explains, AI models "don't have any way to know what is true", and when prompted to produce academic references to support a specific point, they will often invent something if no exact match exists (PolitiFact).
Several factors compound this. When demands exceed a model's knowledge boundaries, LLMs are trained to complete responses rather than express uncertainty, making them more likely to fabricate content than acknowledge a gap (Huang et al, 2025). Training data also has a cutoff date, meaning recent scientific literature is largely absent. A significant portion of valuable knowledge, encapsulated in recent scientific research and proprietary data, remains inaccessible to LLMs, creating knowledge gaps that lead to hallucination when the model attempts to generate information in those domains (Huang et al, 2025). The result is a reference that looks entirely plausible (correct journal name, plausible author list, a realistic DOI) but points to nothing real.
Tangible Evidence from ISMPP 2026 and beyond
Our research presented at the 2026 EU Annual Meeting of the International Society for Medical Publication Professionals (ISMPP) provides quantitative evidence of the scale of the problem. We found that two of the leading LLMs provided inaccurate references between 11% and 43% of the time.
The academic publishing community is waking up to this. Analysis of papers accepted by NeurIPS 2025, one of the most prestigious machine learning conferences, uncovered hundreds of hallucinated citations across dozens of accepted papers, each of which had been reviewed by three or more expert referees (GPTZero). The problem has become widespread enough to earn its own vocabulary: "vibe citing", i.e. the LLM tendency to derive or amalgamate real sources into uncanny imitations that look accurate at first glance but crumble under scrutiny (GPTZero).
The Stakes Are High
A fabricated reference is not merely an embarrassment. It is the illusion of evidence; it creates the appearance of a claim being grounded in science when it is not. In medical publishing, policy documents, and legal filings, this distinction can have serious consequences. The solution cannot simply be to distrust AI-generated content wholesale; the productivity benefits are real and the technology is here to stay. What is needed is a workflow that preserves the efficiency of AI-assisted writing while guaranteeing that every reference is genuine and verifiable.
In our next post, we'll introduce exactly that.