---
title: "Rationality, superforecasting, and other psychometrics"
authors:
  - Gavin Leech
  - Misha Yagudin
  - Vasco Grilo
gleech_role: first author
year: 2023
type: report
venue: Atlas Fellowship
doi: 10.31234/osf.io/sh7y8
url: https://osf.io/sh7y8
links:
  related_talent_paper: https://psyarxiv.com/gq9r6
  related_iq_paper: https://osf.io/preprints/psyarxiv/fjba7
contribution_hours: 80
topics: [psychometrics, forecasting, talent-identification, rationality, intelligence]
---

## Abstract

Commissioned for Atlas, a fellowship for ambitious young people, this project
answered three questions: How do you find extremely gifted people? What other
measures could work? What do people do in practice in elite firms and talent
programmes? Part of a three-paper project (see the related talent and IQ papers
in the frontmatter).

*Contribution: ~80 hours across all three papers.*

## Full text

> Converted from the OSF/PsyArXiv PDF ("Good judgment: rationality, superforecasting, and
> other psychometrics"). PDF ligature artifacts repaired; the CRT-extensions appendix is
> summarised. Authors: Misha Yagudin, Vasco Grilo, Gavin Leech (Atlas Fellowship, January
> 2023).

**Abstract.** This study summarizes the literature on psychometric constructs other than
IQ. We examine the provocative critique of Stanovich (2009) and look into the measures of
reflectivity and rationality highlighted by Stanovich, West, and Toplak (2014): the
Cognitive Reflection Test (CRT; Frederick, 2005) and the Comprehensive Assessment of
Rational Thinking (CART; Stanovich, 2013), designed to capture shortcomings of human
thinking. From shortcomings, we then turn to success, and examine the Superforecasting
phenomenon (Tetlock & Gardner, 2014) — a positive definition of good judgment,
specifically the ability to reduce one's error about the uncertain future. We conclude with
an assorted review of other psychometric constructs that connect Stanovich's RQ and
Tetlock's Superforecasting.

### Introduction: what's wrong with IQ tests?

Modern IQ tests are powerful psychometric instruments that function well even in the right
tail — but they don't capture everything we care about. Drawing on Stanovich's *What
Intelligence Tests Miss* and *The Rationality Quotient*, the core claim is that intelligence
tests (which Stanovich relabels **MAMBIT** — "the mental abilities measured by intelligence
tests") fail to encompass *rational thinking*. We all know very smart people who act
irrationally (even von Neumann, co-inventor of utility theory, was a famously reckless
driver). A key distinction: all intelligence tests are *optimal-performance* assessments
(participants are told to maximize), whereas rationality is often expressed under *typical-
performance* conditions, where much of our thinking is automatic ("System 1"). Stanovich
argues that many rationality and heuristics-and-biases quantities are uncorrelated or weakly
correlated with IQ — though he honestly caveats that his samples are mostly university
students, so range restriction attenuates the correlations (which makes the near-zero
correlations all the more striking).

### Cognitive Reflection Test (CRT)

Frederick's (2005) CRT has just three items (the bat-and-ball, the widget-machines, and the
lily-pads-on-a-lake problems). It measures the disposition to think reflectively and
override an attractive-but-wrong intuitive answer. Despite only three items it is
surprisingly reliable (Cronbach's α ≈ 0.6) and predicts temporal discounting, expected-value
choices, framing effects, probabilistic reasoning, and utilitarian moral judgments. CRT
correlates with IQ measures from ~0.31 to 0.61 (and 0.44 with SAT), yet Toplak et al. (2011)
showed it accounts for substantial *additional* variance in heuristics-and-biases tasks even
after controlling for IQ, executive function, and thinking dispositions (CRT added 11.2%
unique variance). A limitation: the test is now widely known (~half of new participants have
seen it), and it's simple enough that elementary arithmetic checks suffice — so the report
sketches a harder, more confusing extended version (Appendix B), designed (per Wason's
4-card logic) so that one must *check whether something is possible*, with counterintuitive
problems hidden among mundane ones.

### Comprehensive Assessment of Rational Thinking (CART)

Stanovich, West, and Toplak's (2016) CART comprehensively covers the Kahneman–Tversky
heuristics-and-biases program across ~20 subtests (Reflection vs Intuition, Syllogistic
Reasoning, Ratio Bias, Disjunctive Reasoning, Framing, Anchoring, Preference Anomalies,
Argument Evaluation, Knowledge Calibration, Temporal Discounting, Probabilistic Numeracy,
Financial Literacy, Sensitivity to Expected Value, Risk Knowledge, Superstitious Thinking,
Antiscience Attitudes, Conspiracy Beliefs, Dysfunctional Personal Beliefs). However, the
full-scale CART correlates **r = 0.695** with an IQ-like "cognitive composite" — i.e. almost
half its variance is explained by measured intelligence (about as much as IQ correlates with
reading comprehension). After examining all papers citing Stanovich (2016), we found no
further work on CART's psychometric properties, and tend to agree with Ritchie (2017) that
its practical interest is limited: it isn't shown to predict important outcomes *over and
above* IQ.

### Evidence from Superforecasting

CRT and CART are "double negatives" (rationality defined as *not* falling for errors); good
judgment is a *positive* property. Crucially, good judgment should be grounded in objective
results, not peer reputation (Burgman et al. 2011 found that experts' rankings of each other
are a poor guide to actual performance). The Superforecasters were the top ~2% of thousands
of participants across IARPA tournaments; their accuracy *increased* with practice (avoiding
regression to the mean, suggesting skill not luck), and a market of top forecasters beat one
of US Intelligence Community members by 10% (Brier score). Mellers et al. (2017) attribute
their success to: greater active open-mindedness and fluid/crystallized intelligence; more
motivation (more questions, more updates); task-specific skills like scope sensitivity and
more granular probability judgments; and more stimulating team environments. Tetlock's
"portrait of the modal superforecaster": cautious, humble, nondeterministic; actively
open-minded, intelligent with a need for cognition, reflective, numerate; pragmatic,
analytical, "dragonfly-eyed", probabilistic, thoughtful updaters, good intuitive
psychologists; with a growth mindset and grit.

**Identifying forecasters cheaply.** The standard recipe (recruit thousands into a 6–12 month
tournament, then take the top few percent) is costly. Atanasov & Himmelstein's (2022) "Talent
Spotting in Crowd Prediction" groups skill-identification measures into five families:
(1) accuracy-related (proper scores; *most* predictive); (2) intersubjective (proxy/surrogate/
similarity scores — usable before questions resolve); (3) forecasting behaviours (activity,
belief updating, extremity, coherence); (4) dispositional (fluid intelligence, cognitive
reflection, numeracy, personality — fluid intelligence highly correlated, self-reported
thinking-styles only weakly); and (5) expertise measures (calibration, demonstrated
knowledge). Quick on-the-spot proxies the report suggests: calibration games (in/out-of-
specialization calibration correlates r≈0.39); mock competitions scored against a
Superforecaster panel; and signals like forecast granularity, precision loss under rounding,
scope sensitivity, and readiness to update on evidence.

### Other psychometrics

The CART subtests overlap with the dispositions Tetlock and Mellers used to characterize
Superforecasters — e.g. **Actively Open-Minded Thinking** (CART's 30 items vs the
Superforecaster assessment's 7-item Haran et al. scale). CART also taps **pseudodiagnosticity**
(Doherty & Mynatt) — a probabilistic version of Wason's 4-card task (choosing which of P(H),
P(¬H), P(E|H), P(E|¬H) you'd need; most people fail to pick the critical P(E|¬H)), relevant
to base-rate neglect and confirmation seeking. On a similar task, Superforecasters were
slightly more likely to select a correct diagnostic pair (41% vs 36% for regular forecasters,
32% for undergrads).

### Conclusion

A wide range of psychometric constructs beyond IQ is worth exploring; the work of Frederick,
of Stanovich/West/Toplak, and of Tetlock/Gardner offers valuable insight into measuring other
aspects of thinking. By examining both shortcomings (CRT, CART) and successes
(Superforecasting), we can build a more comprehensive understanding of how people make
judgments well. We sketch an extension of CRT aimed at exceptionally talented people.

*(Appendix A surveys six published CRT extensions — Toplak et al. 2014; Primi et al. 2015;
Baron et al. 2015; Thompson & Oppenheimer's four-item CRT-2; Sirota et al.'s nonmathematical
verbal CRT-V; and Young et al.'s child-friendly version — that address recognizability, item
count, and over-reliance on arithmetic. Appendix B presents the authors' own harder, more
confusing candidate items.)*