Research results are often not reproducible and/or replicable. Intense self-criticism in psychology over the last several years has shown that only 40–65% of classic results replicate. Call these “reversals.” Shockingly, they are not incorporated into research training or undergraduate education. This work underlies a hand-built dataset of such reversals.
Contribution: ~200 hours to the underlying dataset.
This was an AIMOS 2021 workshop presentation, not a journal article, so there is no paper PDF. The text below is compiled from the project’s own materials: the FORRT Replications & Reversals project page (forrt.org/reversals) and Gavin’s original write-up (“Reversals in psychology”, gleech.org/psych), which seeded the dataset. It documents the project rather than reproducing a manuscript. (The later, formal write-up is tracking-replications.md; this entry is the earlier community/educational project it grew out of.)
The FORRT Replications & Reversals project is a crowdsourced, community-driven collection documenting effects in the social and behavioural sciences that failed to replicate — or were outright reversed — under empirical scrutiny. Its three goals: (1) education — let students conduct replications as coursework, giving them real research experience while assessing robustness; (2) scholarship — help researchers stay current with replication evidence in their field; and (3) open-science literacy — make these resources accessible. The data was crowdsourced from researchers across 22+ disciplines, growing from ~150 entries in its initial phase to over 600 effects across social, cognitive, and developmental psychology and many other fields. Each entry records the original study and citation, the critique/replication attempts, original vs replication effect sizes, and an overall status (replicated / not replicated / mixed / reversed). The project has since concluded and transitioned into the broader FORRT Replication Hub (including FReD, with effect sizes, and FLoRA, the Library of Replication Attempts).
Intense self-criticism in psychology over recent years showed that only 40–65% of classic social results replicated, with average effects in replications about half the originally reported size. Gavin’s original write-up catalogued ~40+ major reversals (updated through March 2020), organised by subfield:
Recurring methodological causes: publication bias and p-hacking; tiny samples (median n≈30–50 in many originals); researcher degrees of freedom and optional stopping; effect-size shrinkage in honest replications; and statistical artifacts confounding apparent phenomena.
Author’s takeaways: trust replications over originals (preregistration, data sharing); deflate novel effect sizes by factors of 2–100 before waiting for evidence; distinguish “no effect” from “small effect” with proper power analysis; psychology’s openness enabled self-correction, but the field operated under false pretenses for decades — “98 pieces of very weak evidence cannot sum to strong evidence, whatever the p-value says.”
Research results are often not reproducible and/or replicable; call the non-replications and sign-flips “reversals”. Shockingly, these are not incorporated into research training or undergraduate education — so the project builds a hand-curated, teachable dataset of reversals and integrates it into curricula (effects serve as ready-made templates for student replication projects), making replications visible, findable, and part of how the next generation is trained.