---
title: "Replications and Reversals in the Social Sciences"
authors:
  - Helena Hartmann
  - Shilaan Alzahawi
  - Meng Liu
  - Mahmoud Elsherif
  - Alaa AlDoh
  - Gavin Leech
  - Flavio Azevedo
gleech_role: co-author
year: 2021
venue: AIMOS workshop
type: workshop
url: https://aimos.community/
links:
  video: https://www.youtube.com/watch?v=M1TzjegeEtw
  professionalised: https://forrt.org/reversals/
  original: https://www.gleech.org/psych
contribution_hours: 200
topics: [metascience, replication, reversals, psychology, open-science]
---

## Abstract

Research results are often not reproducible and/or replicable. Intense
self-criticism in psychology over the last several years has shown that only
40–65% of classic results replicate. Call these "reversals." Shockingly, they
are not incorporated into research training or undergraduate education. This work
underlies a hand-built dataset of such reversals.

*Contribution: ~200 hours to the underlying dataset.*

## Full text

> This was an AIMOS 2021 workshop presentation, not a journal article, so there is no paper
> PDF. The text below is compiled from the project's own materials: the **FORRT Replications
> & Reversals** project page (forrt.org/reversals) and Gavin's original write-up
> ("Reversals in psychology", gleech.org/psych), which seeded the dataset. It documents the
> project rather than reproducing a manuscript. (The later, formal write-up is
> [tracking-replications.md](tracking-replications.md); this entry is the earlier
> community/educational project it grew out of.)

### The project

The **FORRT Replications & Reversals** project is a crowdsourced, community-driven collection
documenting effects in the social and behavioural sciences that failed to replicate — or were
outright *reversed* — under empirical scrutiny. Its three goals: (1) **education** — let
students conduct replications as coursework, giving them real research experience while
assessing robustness; (2) **scholarship** — help researchers stay current with replication
evidence in their field; and (3) **open-science literacy** — make these resources accessible.
The data was crowdsourced from researchers across 22+ disciplines, growing from ~150 entries
in its initial phase to over 600 effects across social, cognitive, and developmental
psychology and many other fields. Each entry records the original study and citation, the
critique/replication attempts, original vs replication effect sizes, and an overall status
(replicated / not replicated / mixed / reversed). The project has since concluded and
transitioned into the broader **FORRT Replication Hub** (including FReD, with effect sizes,
and FLoRA, the Library of Replication Attempts).

### Background: reversals in psychology (the seed write-up)

Intense self-criticism in psychology over recent years showed that only **40–65% of classic
social results replicated**, with average effects in replications about **half** the
originally reported size. Gavin's original write-up catalogued ~40+ major reversals
(updated through March 2020), organised by subfield:

- **Social psychology:** priming effects (elderly, professor, money, moral cleansing); the
  Stanford prison experiment (demand characteristics); reinterpretation of Milgram; the
  hurricane gender-naming effect; the Implicit Association Test (weak predictive validity,
  r≈0.15); stereotype threat; the Pygmalion/teacher-expectation effect (r<0.1).
- **Positive psychology:** power posing (disowned by a co-author; no hormonal effects);
  facial feedback (d≈0.03 vs original 0.43); mindfulness (suspected publication bias); "Blue
  Monday".
- **Cognitive psychology:** ego depletion (d≈0.04 vs original −1.96); the Dunning–Kruger
  effect (statistical artifact); choice overload; brain-training far-transfer (d≈0.14).
- **Developmental:** growth-mindset interventions (d≈0.08); the marshmallow test (drops to
  r≈0.05 with SES controls); learning styles (no attainment benefit).
- **Evolutionary psychology:** romantic priming (d≈0.00 vs 0.55; severe publication bias);
  Dunbar's number (range 4–520, not a fixed 150); menstrual-cycle mating preferences.
- **Neuroscience / behavioural:** split-brain "two minds" overinterpretation; nudges
  averaging "six times smaller than billed".

**Recurring methodological causes:** publication bias and p-hacking; tiny samples (median
n≈30–50 in many originals); researcher degrees of freedom and optional stopping; effect-size
shrinkage in honest replications; and statistical artifacts confounding apparent phenomena.

**Author's takeaways:** trust replications over originals (preregistration, data sharing);
deflate novel effect sizes by factors of 2–100 before waiting for evidence; distinguish "no
effect" from "small effect" with proper power analysis; psychology's openness enabled
self-correction, but the field operated under false pretenses for decades — "98 pieces of
very weak evidence cannot sum to strong evidence, whatever the p-value says."

### The argument of the workshop

Research results are often not reproducible and/or replicable; call the non-replications and
sign-flips "reversals". Shockingly, these are not incorporated into research training or
undergraduate education — so the project builds a hand-curated, teachable dataset of reversals
and integrates it into curricula (effects serve as ready-made templates for student
replication projects), making replications visible, findable, and part of how the next
generation is trained.