A medical reversal is when an existing treatment is found to actually be useless or harmful. Psychology has in recent years been racking up reversals: in fact only 40-65% of its classic social results were replicated, in the weakest sense of finding ‘significant’ results in the same direction. (Even in those that replicated, the average effect found was half the originally reported effect.) Such errors are far less costly to society than medical errors, but it’s still pollution, so here’s the cleanup. 1

Psychology is not alone: medicine, cancer biology, and economics all have many irreplicable results. It’d be wrong to write off psychology: we know about most of the problems here because of psychologists, and its subfields differ a lot by replication rate and effect-size shrinkage.

One reason psychology reversals are so prominent is that it’s an unusually ‘open’ field in terms of code and data sharing. A less scientific field would never have caught its own bullshit.

The following are empirical findings about empirical findings; they’re all open to re-reversal. Also it’s not that “we know these claims are false”: failed replications (or proofs of fraud) usually just challenge the evidence for a hypothesis, rather than affirm the opposite hypothesis. I’ve tried to ban myself from saying “successful” or “failed” replication, and to report the best-guess effect size rather than play the bad old Yes/No science game. 2

Figures correct as of March 2020; I will put some effort into keeping this current, but not that much.
Code for converting means to Cohen’s d and Hedge’s g here.



Social psychology

No good evidence for many forms of priming, automatic behaviour change from ‘related’ (often only metaphorically related) stimuli. 3

  • Questionable evidence for elderly priming, that hearing about old age makes people walk slower. The p-curve alone argues against the first 20 years of studies.

Stats


  • No good evidence for professor priming, improved (“+13%”) performance at trivia after picturing yourself as a professor vs as a thug.

Stats


  • No good evidence for the Macbeth effect, that moral aspersions induce literal physical hygiene.

Stats


  • No good evidence for money priming, that “images or phrases related to money cause increased faith in capitalism, and the belief that victims deserve their fate”, etc.

Stats

  • Original paper: 'Mere exposure to money increases endorsement of free-market systems and social inequality', Caruso 2013. n between 30 and 168
    (~120 citations).

  • Critiques: Rohrer 2015, n=136. Lodder 2019, a meta-analysis of 246 experiments.
    (total citations: ~70)

  • Original effect size: system justification d=0.8, just world d=0.44, dominance d=0.51

  • Replication effect size: For 47 preregistered experiments in Lodder:

  • g = 0.01 [-0.03, 0.05] for system justification,
    g = 0.11 [-0.08, 0.3] for belief in a just world,
    g = 0.07 [-0.02, 0.15] for fair market ideology.


  • No good evidence of anything from the Stanford prison ‘experiment’. It was not an experiment; ‘demand characteristics’ and scripting of the abuse; constant experimenter intervention; faked reactions from participants; as Zimbardo concedes, they began with a complete “absence of specific hypotheses”.

Stats

  • Original paper: 'Interpersonal dynamics in a simulated prison', Zimbardo 1973
    (1800 citations, but cited by books with hundreds of thousands of citations).

  • Critiques: convincing method & data inspection - Le Texier 2019
    (total citations: ~8)

  • Original effect size: Key claims were insinuation plus a battery of difference in means tests at up to 20% significance(!). n="21" (typo for 24?).

  • Replication effect size: N/A


  • No good evidence that 65% of people will inflict pain if ordered to (Milgram). Experiment was riddled with researcher degrees of freedom, implausible agreement between very different treatments, and “only half of the people who undertook the experiment fully believed it was real and of those, 66% disobeyed the experimenter.” So upper bound around 16% and no strong warrant for that.

Stats


  • Lots of screen-time is not strongly associated with low wellbeing; it explains about as much of teen sadness as eating potatoes, 0.35%.

Stats

  • Original paper: Media speculation? (millions of 'citations').

  • Critiques: Orben 2019, n=355,358

  • Original effect size: N/A

  • Replication effect size: median association of technology use with adolescent well-being was β=−0.035, s.e.=0.004


  • No good evidence that female-named hurricanes are more deadly than male-named ones. Original effect size was a 176% increase in deaths, driven entirely by four outliers; reanalysis using a greatly expanded historical dataset found a nonsignificant decrease in deaths from female named storms.

Stats

  • Original paper: 'Female hurricanes are deadlier than male hurricanes', Jung 2014, n=92 hurricanes discarding two important outliers.
    (~76 citations).

  • Critiques: Christensen 2014. Smith 2016, n=420 large storms.
    (total citations: ~15)

  • Original effect size: d=0.65: 176% increase in deaths from flipping names from relatively masculine to relatively feminine

  • Replication effect size: Smith: 264% decrease in deaths (Atlantic); 103% decrease (Pacific).


  • At most weak use in implicit bias testing for racism. Implicit bias scores poorly predict actual bias, r = 0.15. The operationalisations used to measure that predictive power are often unrelated to actual discrimination (e.g. ambiguous brain activations). Test-retest reliability of 0.44 for race, which is usually classed as “unacceptable”. This isn’t news; the original study also found very low test-criterion correlations.

Stats


  • The Pygmalion effect, that a teacher’s expectations about a student affects their performance, is at most small, temporary, and inconsistent, r<0.1 with a reset after weeks. Rosenthal’s original claims about massive IQ gains, persisting for years, are straightforwardly false (“The largest gain… 24.8 IQ points in excess of the gain shown by the controls.”), and used an invalid test battery. Jussim: “90%–95% of the time, students are unaffected by teacher expectations”.

Stats


  • At most weak evidence for stereotype threat suppressing girls’ maths scores. i.e. the interaction between gender and stereotyping.

Stats

  • Original paper: 'Stereotype Threat and Women’s Math Performance', Spencer 1999, n=30 women
    (~3900 citations).

  • Critiques: Stoet & Geary 2012, meta-analysis of 23 studies. Ganley 2013, n=931. Flore 2015, meta-analysis of 47 measurements. Flore 2018, n=2064.
    (total citations: ~500)

  • Original effect size: Not reported properly; Fig.2 looks like control-group-women-mean-score = 17 with sd=20, and experiment-group-women-score = 5 with sd=15. Which might mean roughly d= −0.7.

  • Replication effect size:
    Stoet: d= −0.17 [−0.27, −0.07] for unadjusted scores.
    Ganley: various groups, d= minus 0.27 to 0.17.
    Flore 2015: g= −0.07 [−0.21; 0.06] after accounting for publication bias.
    Flore 2018: d= −0.05 [−0.18, 0.07]


  • Questionable evidence for an increase in “narcissism” (leadership, vanity, entitlement) in young people over the last thirty years. The basic counterargument is that they’re misidentifying an age effect as a cohort effect (The narcissism construct apparently decreases by about a standard deviation between adolescence and retirement.) “every generation is Generation Me”
    All such “generational” analyses are at best needlessly noisy approximations of social change, since generations are not discrete natural kinds, and since people at the supposed boundaries are indistinguishable.

Stats



Positive psychology

  • No good evidence that taking a “power pose” lowers cortisol, raises testosterone, risk tolerance.

    That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.

       After the initial backlash, it focussed on subjective effect, a claim about “increased feelings of power”. Even then: weak evidence for decreased “feelings of power” from contractive posture only. My reanalysis is here.

Stats


  • No good evidence for facial-feedback (that smiling causes good mood and pouting bad mood).

Stats


  • Reason to be cautious about mindfulness for mental health. Most studies are low quality and use inconsistent designs, there’s higher heterogeneity than other mental health treatments, and there’s strong reason to suspect reporting bias. None of the 36 meta-analyses before 2016 mentioned publication bias. The hammer may fall.

Stats

  • Critiques: Coronado-Montoya 2016

  • Original effect size: prima facie, d=0.3 for anxiety or depression

  • Replication effect size: Not yet.


  • No good evidence for Blue Monday, that the third week in January is the peak of depression or low affect ‘as measured by a simple mathematical formula developed on behalf of Sky Travel’. You’d need a huge sample size, in the thousands, to detect the effect reliably and this has never been done.


Cognitive psychology

  • Good evidence against ego depletion, that making decisions suffers muscle-like fatigue.

Stats

  • Original paper: 'Ego Depletion: Is the Active Self a Limited Resource?', Baumeister 1998, n=67.
    (~5700 citations).

  • Critiques: Hagger 2016, 23 independent conceptual replications
    (total citations: ~640)

  • Original effect size: something like d = minus 1.96 between control and worst condition. (I hope I'm calculating that wrong.)
    4

  • Replication effect size: d = 0.04 [−0.07, 0.14]. (NB: not testing the construct the same way.)






  • Questionable evidence for the Dunning-Kruger effect. There are three claims involved:

    1) the original: that poor performers (e.g. the bottom quartile) overestimate their performance more than good performers do;
    2) that people in general overestimate their own ability;
    3) the meme: that there’s a u-shaped relationship between perceived and actual ability.

Stats

  • Original paper: 'Unskilled and unaware of it: how difficulties in recognizing one's own incompetence lead to inflated self-assessments.', Dunning & Kruger 1999, n=334. This contains claim (1) and (2) but no hint of (3).
    (~5660 citations).

  • Critiques: Nuhfer 2016 and Nuhfer 2017, n=1154; Luu 2015; Greenberg 2018, n=
    (total citations: ~20)

  • Original effect size: 2 of the 4 experiments showed a positive ~linear relationship between score and perceived ability; 2 showed no strong relationship. The original paper concedes that there's a purely statistical explanation for the effect they found: just that it is much easier to overestimate a low number! And if I am a perfect performer, I am unable to overestimate myself. D&K just think there's something left when you subtract this.

  • Replication effect size: 3 of the 4 original studies can be explained by noisy tests, bounded scales, and artefacts in the plotting procedure. Only about 5% of people were very overconfident (more than 30% off) in the Nuhfer data. Some suggestion that, if claim (2) is a big effect, it's a WEIRD effect only.



  • Questionable evidence for the “hungry judge” effect, of massively reduced acquittals (d=2) just before lunch. Case order isn’t independent of acquittal probability (“unrepresented prisoners usually go last and are less likely to be granted parole”); favourable cases may take predictably longer and so are pushed until after recess; effect size is implausible on priors; explanation involved ego depletion.

Stats


  • No good evidence for multiple intelligences (in the sense of statistically independent components of cognition). Gardner, the inventor: “Nor, indeed, have I carried out experiments designed to test the theory… I readily admit that the theory is no longer current. Several fields of knowledge have advanced significantly since the early 1980s.

Stats


  • At most weak evidence for brain training (that is, “far transfer” from daily training games to fluid intelligence) in general, in particular from the Dual n-Back game. Maybe some effect on non-Gf skills of the elderly.

Stats

  • Original paper: 'Improving fluid intelligence with training on working memory', Jaeggi 2008, n=70.
    (2200 citations).

  • Critiques: Melby-Lervåg 2013, meta-analysis of 23 studies.
    Gwern 2012, meta-analysis of 45 studies.

  • Original effect size: d=0.4 over control, 1-2 days after training

  • Replication effect size:
    Melby: d=0.19 [0.03, 0.37] nonverbal; d=0.13 [-0.09, 0.34] verbal.
    Gwern: d=0.1397 [-0.0292, 0.3085], among studies using active controls.


  • In general, be highly suspicious of anything that claims a positive permanent effect on adult IQ. Even in children the absolute maximum is 4-15 points for a powerful single intervention (iodine supplementation during pregnancy in deficient populations).

  • See also the hydrocephaly claim under “Neuroscience”.

  • Good replication rate elsewhere.


Developmental psychology

  • Some evidence for a tiny effect of growth mindset (thinking that skill is improveable) on attainment.

Stats

Really we should distinguish the correlation of the mindset with attainment vs. the effect of a 1-hour class about the importance of growth-mindset on attainment. I cover the latter but check out Sisk for evidence against both.


  • Some evidence for a small marshmallow effect, that ability to delay gratification as a 4 year old predicts life outcomes at 15 (Mischel).
    After controlling for the socioeconomic status of the child’s family, the Marshmallow effect is r=0.05 or d=0.1 one-tenth of a standard deviation for an additional minute delay, with nonsignificant p-values. And since it’s usually easier to get SES data…

Stats

  • Original paper: 'Attention in Delay of Gratification', Mischel 1970, n=68, all children of Stanford academics or their friends.
    (~ 1000 citations).

  • Critiques: Watts et al 2018 (n=918)
    (total citations: ~ 120)

  • Original effect size: r= [0.42, 0.57] for SAT

  • Replication effect size: Watts: r=0.28.
    After controlling for the socioeconomic status of the child's family, the Marshmallow effect is r=0.05 or d=0.1, one-tenth of a standard deviation for an additional minute delay, nonsignificant p-values.


  • No good evidence that tailoring teaching to students’ preferred learning styles has any effect on objective measures of attainment. There are dozens of these inventories, and really you’d have to look at each. (I won’t.)

Stats



Personality psychology


Marketing

  • Brian Wansink accidentally admitted gross malpractice; fatal errors were found in 50 of his lab’s papers. These include flashy results about increased portion size massively reducing satiety.


Neuroscience

  • No good evidence that brains contain one mind per hemisphere. The corpus callosotomy studies which purported to show “two consciousnesses” inhabiting the same brain were badly overinterpreted.

  • Very weak evidence for the existence of high-functioning (IQ ~ 100) hydrocephalic people. The hypothesis begins from extreme prior improbability; the effect of massive volume loss is claimed to be on average positive for cognition; the case studies are often questionable and involve little detailed study of the brains (e.g. 1970 scanners were not capable of the precision claimed).

Stats


  • Readiness potentials seem to be actually causal, not diagnostic. So Libet’s studies also do not show what they purport to. We still don’t have free will (since random circuit noise can tip us when the evidence is weak), but in a different way.

  • No good evidence for left/right hemisphere dominance correlating with personality differences. No clear hemisphere dominance at all in this study.

Stats

  • Original paper: Media speculation?

  • Critiques:
    (total citations: )

  • Original effect size: N/A?

  • Replication effect size: [ ], n=



Psychiatry

  • At most extremely weak evidence that psychiatric hospitals (of the 1970s) could not detect sane patients in the absence of deception.


Parapsychology

  • No good evidence for precognition, undergraduates improving memory test performance by studying after the test. This one is fun because Bem’s statistical methods were “impeccable” in the sense that they were what everyone else was using. He is Patient Zero in the replication crisis, and has done us all a great service. (Heavily reliant on a flat / frequentist prior; evidence of optional stopping; forking paths analysis.)

Stats



Evolutionary psychology

  • Questionable evidence for the menstrual cycle version of the dual-mating-strategy hypothesis (that “heterosexual women show stronger preferences for uncommitted sexual relationships… during the high-fertility ovulatory phase of the menstrual cycle, while preferring long-term relationships at other points”).

Stats

  • Original paper: ( citations).

  • Critiques:
    (total citations: )

  • Original effect size: [ ], n=

  • Replication effect size: [ ], n=


  • Note from a professor friend: the idea of a dual-mating hypothesis itself is not in trouble:
    the specific menstrual cycle research doesn't seem to replicate well. However, to my knowledge the basic pattern of short vs long term relationship goals predicting [women's] masculinity preferences is still robust.


  • No good evidence that large parents have more sons (Kanazawa); original analysis makes several errors and reanalysis shows near-zero effect. (Original effect size: 8% more likely.)

Stats

  • Original paper: ( citations).

  • Critiques:
    (total citations: )

  • Original effect size: [ ], n=

  • Replication effect size: [ ], n=


Stats

  • Original paper: ( citations).

  • Critiques:
    (total citations: )

  • Original effect size: [ ], n=

  • Replication effect size: [ ], n=



Psychophysiology

  • At most very weak evidence that sympathetic nervous system activity predicts political ideology in a simple fashion. In particular, subjects’ skin conductance reaction to threatening or disgusting visual prompts - a noisy and questionable measure.

Stats

  • Original paper: Oxley et al, n=46 ( citations). p=0.05 on a falsely binarised measure of ideology.

  • Critiques: Six replications so far (Knoll et al; 3 from Bakker et al) , five negative as in nonsignificant, one forking ("holds in US but not Denmark")

    (total citations: )

  • Original effect size: [ ], n=

  • Replication effect size: [ ], n=



Behavioural genetics

  • No good evidence that 5-HTTLPR is strongly linked to depression, insomnia, PTSD, anxiety, and more.

  • Be very suspicious of any such “candidate gene” finding (post-hoc data mining showing large >1% contributions from a single allele). 0/18 replications in candidate genes for depression. 73% of candidates failed to replicate in psychiatry in general. One big journal won’t publish them anymore without several accompanying replications. A huge GWAS, n=1 million: “We find no evidence of enrichment for genes previously hypothesized to relate to risk tolerance.




[What I propose] is not a reform of significance testing as currently practiced in soft-psych. We are making a more heretical point… We are attacking the whole tradition of null-hypothesis refutation as a way of appraising theories… Most psychology using conventional H_0 refutation in appraising the weak theories of soft psychology… [is] living in a fantasy world of “testing” weak theories by feeble methods.

Paul Meehl (1990)



What now? When the next flashy WEIRD paper out of a world-class university arrives, will we swallow it?

Andrew Gelman and others suggest deflating all single-study effect sizes you encounter in the social sciences, without waiting for the subsequent shrinkage from publication bias, measurement error, data-analytic degrees of freedom, and so on. There is no uniform factor, but it seems sensible to divide novel effect sizes by a number between 2 and 100 (depending on its sample size, method, measurement noise, maybe its p-value if it’s really tiny)…


The melancholy of pareidolia

Let me drop out of my proper "effect size estimation > binary truth or falsity" talk for a moment: There is something unbearably sad about the Wikipedia page for Ego depletion.

It's 3500 words, not including the Criticism section. It is rich with talk of moderators, physiological mechanisms, and practical upshots for the layman. And it is quite possible that the whole lot of it is a phantom, a giant mistake. For small effect sizes, we can't tell the difference. Even people quite a bit smarter than us can't.

If I wander around an old bookshop, I can run my fingers over sophisticated theories of ectoplasm, kundalini, past lives, numerology, clairvoyance, alchemy. Some were written by brilliant people who also discovered real things, whose minds worked, damnit.

We are so good at explaining that we can explain things which aren't there. We have made many whole libraries and entire fields without the slightest correspondence to anything. Except our deadly ingenuity.

Selection criteria

I include a claim if there was at least one of: several failed replications, several good meta-analyses with notably smaller d, very strong publication bias, clear fatal errors in the analysis, a formal retraction, or clear fraud.

I also include cases like growth mindset, where the eventual effect size, though positive, was a tiny fraction of the hyped original claim.

I compare to the original paper's effect size because I am lazy and barely capable of the better alternative, a careful averaging of high-quality supporting papers.

Thousands of bad papers are published every year, and catching them all would be a full-time job even if they were all included in systematic replication or reanalysis projects, ripe fruit. My rule is that if I hear a spurious effect discussed, or see it in a book, or if it could hurt someone, it's noteworthy.

Why trust replications more than originals?

Near-universal rates of pre-registration and higher rates of code and data sharing. (For "direct" replications, the original target study has in effect pre-registered their hypotheses, methods, and analysis plan.)

But I don't trust any of them. I look for 3+ failed replications from different labs, just to save me lots of rewriting, as the garden of forking paths and the mystery of the lefty p-curve unfold.

Mandatory errata

Some popular books with uncritical treatments of the above

  • Outliers by Malcolm Gladwell
    founded on the 10,000 hours for mastery claim.

  • Behave by Robert Sapolsky
    Himmicanes, power pose, facial feedback, ego depletion, Implicit Association, stereotype threat, broken windows theory, Macbeth effect.

  • Thinking, Fast and Slow by Daniel Kahneman
    Entire chapter on all kinds of priming. Facial-feedback, Effects of Head-Movements on Persuasion, Location as Prime, Money Priming, Death Priming, Lady Macbeth Effect. Cognitive disfluency. Ego depletion. Wansink. Hungry judges. Denies the "hot hand".

  • Nudge by Thaler and Sunstein
    Wansink, Baumeister, Dweck.

  • Smarter by Dan Hurley.
    Dual n-Back and all manner of nonsense nootropics.

  • Peter Watts is an Angry Sentient Tumor
    A sadly muddled defence of Bem

TODO

* Heat priming
* Intelligence priming
* Disgust priming
* Honesty priming
* Cleanliness priming
* God priming boosts pro-sociality
* Social distance priming
* Commitment priming reduces forgiveness
* Achievement priming
Positive mood boost helping effect
Superiority-of-unconscious decision-making effect
Behavioral-consequences-of automatic-evaluation
Self-control relies on glucose effect
Physical warmth promotes interpersonal warmth
Power impairs perspective-taking effect
Fertility facial-preferences effect
Font disfluency impairs/improves cognitive performance
https://link.springer.com/article/10.1007/s11409-016-9154-x

Status-legitimacy effect
Red-impairs cognitive performance effect
Reduced pro-sociality of high SES effect
Moral licensing effect
Color on approach/avoidance
Playboy effect
Self-protective subjective temporal distance effect
Trait loneliness hot shower effect
Superstition boosts performance effect
Red-boosts-attractiveness effect
American flag priming boosts Republican support
Unethicality darkens perception of light
Fertility on voting
Modulation of 1/f noise on WIT
Time is money effect
Embodiment of secrets
Warmer-hearts-warmer-room effect
Treating-prejudice-with-imagery effect
Grammar influences perceived intentionality effect
Attachment-warmth embodiment effect

Comments

Post a comment:

Comments appear after moderation. You can use html or markdown.
Page just refreshes after successful submit.

  1. A counterexample from the COVID-19 pandemic: the UK's March 2020 policy was based on the idea of behavioural fatigue and Western resentment of restrictions; that a costly prohibition would only last a few weeks before the population revolt against it, and so it had to be delayed until the epidemic's peak.

    Now, this policy was so politically toxic that we know it had to be based on some domain reasoning, and it is in a way heartening that the government tried to go beyond socially naive epidemiology. But it was strongly criticised by hundreds of other behavioural scientists, who noted that the evidence for these ideas was too weak to base policy on. Here's a catalogue of bad psychological takes.

  2. The polite convention in psychology seems to be to not mention the original effect size.

    See also "the link between x and y may be more contingent than previously assumed" or
    "we found strong moderators for the association between x and y".
  3. Semantic priming is still solid, but the effect lasts only seconds.
  4. radishGroupMean = 8.35 #minutes
    controlGroupMean = 20.86
    radishGroupSd = 4.67
    controlGroupSd = 7.30
    radishGroupN = 22
    controlGroupN = 22

    cohens_d(radishGroupMean, controlGroupMean, radishGroupSd, controlGroupSd, radishGroupN, controlGroupN, True)


Deep thanks to Andrew Gelman, Stuart Ritchie, Anne Scheel, Daniël Lakens, Gwern Branwen, and Nick Brown for pointers to effectively all of these.

All honour to the hundreds of ‘data thug’ / ‘methodological terrorist’ psychologists I’ve cited, who in the last decade did the hard work of cleaning up their field.