Regulatory & Conduct

The open-data literature now has a measured contamination rate

Spick et al. put a number on the paper-mill exploitation of FAERS, NHANES, UK Biobank and the rest — 11,577 excess publications in 2025 — just as EMA's concept paper on external controls lands.

Open science & reproducibility
Biostatistics
Regulatory
Methodology Frontier

The number to remember is 11,577. That is the excess of 2025 publications above ARIMA-forecast trend across nine canonical open-access health datasets — FAERS, NHANES, UK Biobank, FinnGen, GBD, MIMIC, CHARLS, CDC WONDER, TriNetX — identified by Spick et al. in a medRxiv preprint now reflected in the Journal of Clinical Epidemiology. Combined 2025 output across those nine sources hit 23,005 papers, a 3.0x fold change over 7,655 in 2022. The earlier version of the analysis put the China-authored fold change at 9.5x. These are the same datasets biometrics teams cite for external controls, covariate-selection precedent, and methods-validation benchmarks.

The detection method matters as much as the count. Spick et al. combined ARIMA trend-break forecasting with textual signals — formulaic titles — and geographic authorship anomalies, which gives the field a replicable surveillance approach rather than an anecdote. The JCE commentary track frames this as the first “sobering quantification” of a phenomenon previously argued without numbers. In their reply to Zil-E-Ali, the authors concede the equity objection — blunt restrictions on open-data submissions disadvantage under-resourced researchers — but argue that technological change is already outpacing the proposed protective framework. Some publishers are restricting open-dataset submissions anyway. It is a tremor in the editorial policy, not yet in the SAP appendix.

Why this hits biometrics specifically

External-control methodology rests on published analyses of exactly these datasets. EMA’s concept paper on external controls is in flight; FDA’s reliance on premarket and postmarket RWE rose steadily through 2024. If 11,577 publications of unknown quality have entered the corpus that informs RWD source selection, covariate plausibility, and historical-control assembly, the regulatory pipeline inherits the noise — quietly, and through citations no submission reviewer is going to chase to ground. Meta-analytic pipelines compound the problem: a separate JCE analysis finds that RCTs reporting large effect sizes for continuous outcomes in their abstracts are systematically less likely to carry transparency markers — exactly the papers most likely to be picked up by guideline panels and over-weighted downstream.

The infrastructure response is visible and losing. arXiv, per Derek Lowe, has stacked three interventions in under six months: a November 2025 ban on CS review articles unless peer-reviewed elsewhere; a January 2026 endorsement requirement after moderator rejection rates tripled across 2025; and a March/April 2026 policy of 1-year bans plus a permanent peer-review-first requirement for authors whose papers show “incontrovertible evidence” of unchecked LLM output — hallucinated references, leftover chatbot meta-commentary. A cited arXiv manuscript estimates >146,000 hallucinated citations in 2025 alone, still rising. medRxiv and bioRxiv have announced no equivalent regime. The clinical-methods preprint pipeline is, for now, unguarded.

The integrity baseline was already poor

Three concurrent JCE items confirm the AI story is layered on top of pre-existing breakdown. A study on stepped-wedge CRTs documents that “data available upon request” routinely fails in practice — a direct problem for anyone trying to validate time-effect or correlation-structure methods for designs whose analytical complexity makes real data essential. A retraction of “Treatment effects in pharmacological clinical randomized controlled trials are mainly due to placebo” (Schmidt/Walach, JCE 179, 2025) ships with boilerplate withdrawal language and no substantive explanation, leaving anyone who cited the 69%-variance claim without grounds to assess whether the methodology or the conclusion failed. And a volunteer copy-paste duplication scan of 600 biomedical Dryad datasets flagged ~3% as containing serious duplications — non-AI integrity failures that predate the LLM era entirely.

The practical implication is calibration, not panic. The contamination is empirically real and measurable; the regulatory weighting of affected literature is not yet adjusted; the equity-versus-integrity governance debate is unresolved.

Protocol read: Citations to published analyses of FAERS, NHANES, TriNetX, UK Biobank and the rest now need explicit due diligence in any submission or HTA dossier — the literature underpinning your external-control rationale has a documented excess publication rate, and “we cited a peer-reviewed paper” is no longer a sufficient defence.

What to do now:

Audit pending submissions and evidence-synthesis dossiers for citations to secondary analyses of the nine flagged datasets; flag any used to justify covariate sets or historical comparator construction.
Pre-specify provenance criteria for any published RWD analysis cited in support of external controls — author institution stability, dataset version, code availability — before the next EMA external-controls exchange.
Treat large-effect-size RCTs surfaced in meta-analyses as a transparency-flag stratum, not a precision-weighted input; document the sensitivity analysis.
For sponsor-run SW-CRTs, replace “available upon request” language with a named repository deposition plan in the SAP and data-sharing statement.

The open-data literature now has a measured contamination rate

Read next

EMA starts drawing lines around external control arms

Dataset-JSON, USDM and IDMP arrive in the same quarter

Pfizer's $10B Innovent bet meets the China-data reckoning