Evaluating modern ML models is hard. The strong incentive to report a state-of-the-art result on some metric often leads to questionable research practices (QRPs): bad practices which fall short of outright research fraud. We describe 44 such practices which can undermine reported results, giving examples where possible. Our list emphasises the evaluation of large language models (LLMs) on public benchmarks. We also discuss “irreproducible research practices” — decisions that make it difficult or impossible for other researchers to reproduce, build on, or audit previous research.
Converted from the LaTeX source. Prose is reproduced faithfully with LaTeX markup removed and citation keys rendered as author–year where named in prose, otherwise dropped. Figures omitted; the two summary tables are kept. Canonical source: the arXiv PDF.
Epigraph: “If, like truth, falsehood had but one face, we’d be better off: we could take as true the opposite of what a liar says. But the opposite of the truth has a hundred thousand faces and a limitless field.” — Montaigne
To understand the actual capabilities of models like LLMs and develop reliable systems based on them, it is critical to have trustworthy evaluations comparing different models and approaches on meaningful benchmarks. However, researchers and companies have strong incentives to engage in ‘questionable research practices’ (QRPs) to inflate their reported results: inflated results help researchers publish in high-impact venues and help companies attract investment and users. Not only is there motive: the complexity of the pre-training, post-training and evaluation procedure also gives ample opportunity. These opportunities fall into three families. First, contamination, in which test-set information is used in the pre-training, post-training, runtime, or prompt. Second, cherrypicking, in which researchers ‘hack’ experimental settings (selecting those under which their model works better, after testing multiple times), or “nerf” (degrade) baselines. Third are various forms of misreporting, such as making broad claims (e.g. about “reasoning”) based on narrow benchmarks. We additionally consider irreproducible research practices (IRPs), which make it more difficult for others to reproduce, build on, or audit previous research — the most prevalent example being dataset hiding.
We do not claim that most performance is spurious. Nor do we show the general prevalence of these problems. This paper answers the limited question: what could make a model’s reported performance to some extent spurious?
The two lenses we use — QRPs and researcher degrees of freedom (RDOFs) — originate from psychological science, but similar issues have been studied in ML under other names (Lipton & Steinhardt on inflated language and superfluous design; Sculley et al. on the “winner’s curse”; the NeurIPS 2019 Reproducibility programme; Liao et al. on benchmark/real-world mismatch; Biderman et al. on LLM-evaluation problems). Our work is similar in spirit to Huff’s How to Lie with Statistics.
Researcher degrees of freedom. Scientific analyses have many RDOFs — free choices in design and analysis that a researcher can manipulate to give themselves more chances of a (real or spurious) ‘significant’ result, without increasing the process’s ability to detect genuine effects. This is unavoidable: no science has a one-to-one mapping between theories, experiments, and analyses. Each degree of freedom is an opportunity to introduce a QRP, intentionally or not. An ML evaluation usually has a main method (the researchers’ own) and baselines; to publish usually requires showing the method is statistically significantly better, giving an incentive to exploit RDOFs. It is essentially never acceptable to ‘optimise’ any aspect of the evaluation procedure. (By contrast, it is necessary that researchers optimise the hyperparameters of baselines and of their own method a “similar amount” — though operationalising this is fraught.)
Stage abbreviations: where in the development path each practice acts (Design → Collection → Training → Evaluation → Reporting).
| Practice | Description | Stage | Accidental? |
|---|---|---|---|
| Contamination | |||
| Training Contamination | Training on the test set (e.g. in the web corpus) | Training | Plausibly |
| Prompt Contamination | Putting test data into the prompt (few-shot) | Evaluation | Plausibly |
| RAG Contamination | Leaking benchmark data via Retrieval-Augmented Generation | Evaluation | Plausibly |
| Dirty Paraphrases | Rephrasing test data and training on it | Collection | No |
| Contamination Laundering | Contaminated models generating training data | Collection | Plausibly |
| Thieved Test | Obtaining private test labels | Collection | No |
| User Contamination | Post-training on test in user prompts | Training | Plausibly |
| Over-hyping | Tuning hyperparameters further after test | Training | Plausibly |
| Meta-contamination | Reusing contaminated hyperparameters/designs | Training | Plausibly |
| Semantic Duplicates | Train and test set include near-identical points | Collection | Plausibly |
| Cherrypicking | |||
| Baseline Nerfing | Optimising training parameters of baselines less | Evaluation | Plausibly |
| Baseline Hacking | Choosing weak baselines to compare to | Evaluation | No |
| Runtime Nerfing | Optimising baselines’ inference parameters less | Evaluation | Plausibly |
| Runtime Hacking | Post-hoc best inference parameters or decoding | Evaluation | No |
| Benchmark Hacking | Choosing easier benchmarks | Evaluation | Plausibly |
| Subset Hacking | Subsetting the benchmark until you win | Evaluation | No |
| Harness Hacking | Choosing post-hoc best evaluation harness | Evaluation | No |
| Golden Seed | Training/tuning with many different seeds | Training | No |
| Prompt Nerfing | Undertuning prompts of baseline models | Evaluation | Plausibly |
| Prompt Hacking | Choosing the best prompt strategy post-hoc | Evaluation | Plausibly |
| Misreporting | |||
| Superfluous Cog | Redundant module added to claim novelty | Design | Plausibly |
| Whack-a-mole | Monitoring for specific failures and fine-tuning them away ad hoc | Training | No |
| Benchmark Decoration | Pretraining on benchmark / instruction data | Training | Plausibly |
| $p$-hacking | Flawed sampling when bolding SOTA results | Reporting | Plausibly |
| Point Scores | Reporting single-run results (no error bars) | Reporting | No |
| Outright Lies | Fabricating results (included for completeness) | Reporting | No |
| Over/Underclaiming | Misleading claims about model capabilities | Reporting | No |
| Reification | General claims from narrow ML benchmarks | Reporting | No |
| Nonzero-shot | Claiming ‘zero-shot’ while training on examples | Reporting | Plausibly |
| Misarithmetic Mean | Using arithmetic mean on normalised results | Reporting | Plausibly |
| Parameter Smuggling | Under-reporting model size; or substituting in more embedding parameters | Reporting | No |
| File Drawer | Failing to report negative benchmark studies | Reporting | No |
| Amplifiers | |||
| Inductive Smuggling | Handcrafting inductive bias for a task | Design | No |
| Label Noise | Using benchmarks known to be error-ridden | Collection | Plausibly |
Most ways to mislead others, or delude yourself, in ML evaluation fall into one of three categories, plus a residual “amplifier” category:
Plus Amplifiers, which have indirect effects by enabling other QRPs. Two key terms: nerfing is intervening to weaken baselines (e.g. tuning their hyperparameters less); hacking is selecting shared experimental settings post-hoc — after seeing results — then reporting only the favourable ones (and often failing to correct for multiple comparisons).
Contamination (AKA leakage) is any influence of test-set information on model development, from subtle (reusing hyperparameters) to blatant (training on the test set). It can totally invalidate a reported benchmark score. Poorly-filtered web-scale corpora have led to many cases of plausibly accidental contamination; new versions of existing benchmarks generally show substantial performance drops.
Cherrypicking results from running multiple tests and reporting the best; a subtler form reports the best in the main table and relegates variations to the appendix.
Little of the above would be unsalvageable if researchers honestly reported all their work in sufficient and correct detail.
Residual practices that amplify the effects of other QRPs.
Practices that prevent third parties from reproducing ML training or evaluation; not classic QRPs, but they enable QRPs by preventing auditing.
| Practice | Description | Accidental? |
|---|---|---|
| Fishing | Conducting confirmatory research without any hypothesis | Plausibly |
| Half-Fishing | Confirmatory analysis without specifying effect direction | Plausibly |
| Dataset Hiding | Not disclosing the sources of the training data | Plausibly |
| Stochastic Runs | e.g. GPU nondeterminism despite fixed seeds | Plausibly |
| No Access | Providing no way to evaluate your model (or no easy API) | Plausibly |
| Closed Evaluation | Using closed-source evaluation data | Plausibly |
| Hidden CoT | Not providing the full completion of your model | No |
| Runtime Hiding | Failing to disclose inference parameters and methodology | No |
| Dummy Code | Uploading empty placeholder files to foil casual inspection | No |
| API Drift | Not reporting behaviour changes of proprietary LLM services over time | Plausibly |
Misreporting is all you need? One could argue the truly fundamental QRP is inadequate reporting — problems of interpretation and even cheating would in theory evaporate if researchers exhaustively reported what they did. But this assumes the vast data from a modern training process can be sensibly analysed by readers (supplementary information can run to hundreds of pages), and dataset hiding means we usually cannot know the training corpus even at the metadata level — so, except in rare cases, we cannot separate memorisation from generalisation.
Defences (some existing, some underused): standardised evaluation harnesses (e.g. EleutherAI Harness, UK AISI’s Inspect); semantic decontamination; contamination databases; private benchmarks; periodic benchmark refresh; gestalt human-preference evals (LMSYS Arena, though hackable via style); canary strings (reusing BIG-Bench’s exact string gives a zero-overhead filter); full logging and log summarisation (Weights & Biases); and preregistration of analyses (evaluation is more preregistrable than chaotic model design).
Root causes. (1) Researchers must self-certify SOTA: the publication process makes algorithm designers evaluate their own performance and usually blocks negative results, heavily incentivising upward bias. (2) Industrialization of AI research: the industrial era retains the scientific trappings of the academic era, but AI products also pursue “build the best product” and “get the most attention” (marketing). Goals diverge on “does training on the test set serve my ends?” — a scientist says no, a business may say yes. There is nothing inherently wrong with the business perspective; the problem is using unscientific means (e.g. contamination) and then making scientific claims. Departures from the default academic norms should be declared explicitly. The biggest science/business divergences are often on IRPs such as obscuring the training dataset (withheld for competitive and copyright-litigation reasons).
Limitations. The list is not exhaustive. The paper is dual-use, but bad actors already have simpler options (fabrication). QRPs are hard to detect systematically, so the work offers existence proofs, not prevalence or effect-size estimates (planned for follow-up studies).
We reviewed 44 QRPs — most involving some form of contamination, cherrypicking, or misreporting — which affect the internal and external validity of ML (and especially LLM) evaluations, plus IRPs which prevent reproduction, building-on, and auditing. We listed possible mitigations and suggested two explanations for QRPs: researcher incentives and the industrialization of research.