Abstract

Research results are often not reproducible and/or replicable. Intense self-criticism in psychology over the last several years has shown that only 40–65% of classic results replicate. Call these “reversals.” Shockingly, they are not incorporated into research training or undergraduate education. This work underlies a hand-built dataset of such reversals.

Contribution: ~200 hours to the underlying dataset.

Full text

This was an AIMOS 2021 workshop presentation, not a journal article, so there is no paper PDF. The text below is compiled from the project’s own materials: the FORRT Replications & Reversals project page (forrt.org/reversals) and Gavin’s original write-up (“Reversals in psychology”, gleech.org/psych), which seeded the dataset. It documents the project rather than reproducing a manuscript. (The later, formal write-up is tracking-replications.md; this entry is the earlier community/educational project it grew out of.)

The project

The FORRT Replications & Reversals project is a crowdsourced, community-driven collection documenting effects in the social and behavioural sciences that failed to replicate — or were outright reversed — under empirical scrutiny. Its three goals: (1) education — let students conduct replications as coursework, giving them real research experience while assessing robustness; (2) scholarship — help researchers stay current with replication evidence in their field; and (3) open-science literacy — make these resources accessible. The data was crowdsourced from researchers across 22+ disciplines, growing from ~150 entries in its initial phase to over 600 effects across social, cognitive, and developmental psychology and many other fields. Each entry records the original study and citation, the critique/replication attempts, original vs replication effect sizes, and an overall status (replicated / not replicated / mixed / reversed). The project has since concluded and transitioned into the broader FORRT Replication Hub (including FReD, with effect sizes, and FLoRA, the Library of Replication Attempts).

Background: reversals in psychology (the seed write-up)

Intense self-criticism in psychology over recent years showed that only 40–65% of classic social results replicated, with average effects in replications about half the originally reported size. Gavin’s original write-up catalogued ~40+ major reversals (updated through March 2020), organised by subfield:

Social psychology: priming effects (elderly, professor, money, moral cleansing); the Stanford prison experiment (demand characteristics); reinterpretation of Milgram; the hurricane gender-naming effect; the Implicit Association Test (weak predictive validity, r≈0.15); stereotype threat; the Pygmalion/teacher-expectation effect (r<0.1).
Positive psychology: power posing (disowned by a co-author; no hormonal effects); facial feedback (d≈0.03 vs original 0.43); mindfulness (suspected publication bias); “Blue Monday”.
Cognitive psychology: ego depletion (d≈0.04 vs original −1.96); the Dunning–Kruger effect (statistical artifact); choice overload; brain-training far-transfer (d≈0.14).
Developmental: growth-mindset interventions (d≈0.08); the marshmallow test (drops to r≈0.05 with SES controls); learning styles (no attainment benefit).
Evolutionary psychology: romantic priming (d≈0.00 vs 0.55; severe publication bias); Dunbar’s number (range 4–520, not a fixed 150); menstrual-cycle mating preferences.
Neuroscience / behavioural: split-brain “two minds” overinterpretation; nudges averaging “six times smaller than billed”.

Recurring methodological causes: publication bias and p-hacking; tiny samples (median n≈30–50 in many originals); researcher degrees of freedom and optional stopping; effect-size shrinkage in honest replications; and statistical artifacts confounding apparent phenomena.

Author’s takeaways: trust replications over originals (preregistration, data sharing); deflate novel effect sizes by factors of 2–100 before waiting for evidence; distinguish “no effect” from “small effect” with proper power analysis; psychology’s openness enabled self-correction, but the field operated under false pretenses for decades — “98 pieces of very weak evidence cannot sum to strong evidence, whatever the p-value says.”

The argument of the workshop

Research results are often not reproducible and/or replicable; call the non-replications and sign-flips “reversals”. Shockingly, these are not incorporated into research training or undergraduate education — so the project builds a hand-curated, teachable dataset of reversals and integrates it into curricula (effects serve as ready-made templates for student replication projects), making replications visible, findable, and part of how the next generation is trained.