Commissioned for Atlas, a fellowship for ambitious young people, this project answered three questions: How do you find extremely gifted people? What other measures could work? What do people do in practice in elite firms and talent programmes? Part of a three-paper project (see the related talent and IQ papers in the frontmatter).
Contribution: ~80 hours across all three papers.
Converted from the OSF/PsyArXiv PDF (“Good judgment: rationality, superforecasting, and other psychometrics”). PDF ligature artifacts repaired; the CRT-extensions appendix is summarised. Authors: Misha Yagudin, Vasco Grilo, Gavin Leech (Atlas Fellowship, January 2023).
Abstract. This study summarizes the literature on psychometric constructs other than IQ. We examine the provocative critique of Stanovich (2009) and look into the measures of reflectivity and rationality highlighted by Stanovich, West, and Toplak (2014): the Cognitive Reflection Test (CRT; Frederick, 2005) and the Comprehensive Assessment of Rational Thinking (CART; Stanovich, 2013), designed to capture shortcomings of human thinking. From shortcomings, we then turn to success, and examine the Superforecasting phenomenon (Tetlock & Gardner, 2014) — a positive definition of good judgment, specifically the ability to reduce one’s error about the uncertain future. We conclude with an assorted review of other psychometric constructs that connect Stanovich’s RQ and Tetlock’s Superforecasting.
Modern IQ tests are powerful psychometric instruments that function well even in the right tail — but they don’t capture everything we care about. Drawing on Stanovich’s What Intelligence Tests Miss and The Rationality Quotient, the core claim is that intelligence tests (which Stanovich relabels MAMBIT — “the mental abilities measured by intelligence tests”) fail to encompass rational thinking. We all know very smart people who act irrationally (even von Neumann, co-inventor of utility theory, was a famously reckless driver). A key distinction: all intelligence tests are optimal-performance assessments (participants are told to maximize), whereas rationality is often expressed under typical- performance conditions, where much of our thinking is automatic (“System 1”). Stanovich argues that many rationality and heuristics-and-biases quantities are uncorrelated or weakly correlated with IQ — though he honestly caveats that his samples are mostly university students, so range restriction attenuates the correlations (which makes the near-zero correlations all the more striking).
Frederick’s (2005) CRT has just three items (the bat-and-ball, the widget-machines, and the lily-pads-on-a-lake problems). It measures the disposition to think reflectively and override an attractive-but-wrong intuitive answer. Despite only three items it is surprisingly reliable (Cronbach’s α ≈ 0.6) and predicts temporal discounting, expected-value choices, framing effects, probabilistic reasoning, and utilitarian moral judgments. CRT correlates with IQ measures from ~0.31 to 0.61 (and 0.44 with SAT), yet Toplak et al. (2011) showed it accounts for substantial additional variance in heuristics-and-biases tasks even after controlling for IQ, executive function, and thinking dispositions (CRT added 11.2% unique variance). A limitation: the test is now widely known (~half of new participants have seen it), and it’s simple enough that elementary arithmetic checks suffice — so the report sketches a harder, more confusing extended version (Appendix B), designed (per Wason’s 4-card logic) so that one must check whether something is possible, with counterintuitive problems hidden among mundane ones.
Stanovich, West, and Toplak’s (2016) CART comprehensively covers the Kahneman–Tversky heuristics-and-biases program across ~20 subtests (Reflection vs Intuition, Syllogistic Reasoning, Ratio Bias, Disjunctive Reasoning, Framing, Anchoring, Preference Anomalies, Argument Evaluation, Knowledge Calibration, Temporal Discounting, Probabilistic Numeracy, Financial Literacy, Sensitivity to Expected Value, Risk Knowledge, Superstitious Thinking, Antiscience Attitudes, Conspiracy Beliefs, Dysfunctional Personal Beliefs). However, the full-scale CART correlates r = 0.695 with an IQ-like “cognitive composite” — i.e. almost half its variance is explained by measured intelligence (about as much as IQ correlates with reading comprehension). After examining all papers citing Stanovich (2016), we found no further work on CART’s psychometric properties, and tend to agree with Ritchie (2017) that its practical interest is limited: it isn’t shown to predict important outcomes over and above IQ.
CRT and CART are “double negatives” (rationality defined as not falling for errors); good judgment is a positive property. Crucially, good judgment should be grounded in objective results, not peer reputation (Burgman et al. 2011 found that experts’ rankings of each other are a poor guide to actual performance). The Superforecasters were the top ~2% of thousands of participants across IARPA tournaments; their accuracy increased with practice (avoiding regression to the mean, suggesting skill not luck), and a market of top forecasters beat one of US Intelligence Community members by 10% (Brier score). Mellers et al. (2017) attribute their success to: greater active open-mindedness and fluid/crystallized intelligence; more motivation (more questions, more updates); task-specific skills like scope sensitivity and more granular probability judgments; and more stimulating team environments. Tetlock’s “portrait of the modal superforecaster”: cautious, humble, nondeterministic; actively open-minded, intelligent with a need for cognition, reflective, numerate; pragmatic, analytical, “dragonfly-eyed”, probabilistic, thoughtful updaters, good intuitive psychologists; with a growth mindset and grit.
Identifying forecasters cheaply. The standard recipe (recruit thousands into a 6–12 month tournament, then take the top few percent) is costly. Atanasov & Himmelstein’s (2022) “Talent Spotting in Crowd Prediction” groups skill-identification measures into five families: (1) accuracy-related (proper scores; most predictive); (2) intersubjective (proxy/surrogate/ similarity scores — usable before questions resolve); (3) forecasting behaviours (activity, belief updating, extremity, coherence); (4) dispositional (fluid intelligence, cognitive reflection, numeracy, personality — fluid intelligence highly correlated, self-reported thinking-styles only weakly); and (5) expertise measures (calibration, demonstrated knowledge). Quick on-the-spot proxies the report suggests: calibration games (in/out-of- specialization calibration correlates r≈0.39); mock competitions scored against a Superforecaster panel; and signals like forecast granularity, precision loss under rounding, scope sensitivity, and readiness to update on evidence.
The CART subtests overlap with the dispositions Tetlock and Mellers used to characterize Superforecasters — e.g. Actively Open-Minded Thinking (CART’s 30 items vs the Superforecaster assessment’s 7-item Haran et al. scale). CART also taps pseudodiagnosticity (Doherty & Mynatt) — a probabilistic version of Wason’s 4-card task (choosing which of P(H), P(¬H), P(E|H), P(E|¬H) you’d need; most people fail to pick the critical P(E|¬H)), relevant to base-rate neglect and confirmation seeking. On a similar task, Superforecasters were slightly more likely to select a correct diagnostic pair (41% vs 36% for regular forecasters, 32% for undergrads).
A wide range of psychometric constructs beyond IQ is worth exploring; the work of Frederick, of Stanovich/West/Toplak, and of Tetlock/Gardner offers valuable insight into measuring other aspects of thinking. By examining both shortcomings (CRT, CART) and successes (Superforecasting), we can build a more comprehensive understanding of how people make judgments well. We sketch an extension of CRT aimed at exceptionally talented people.
(Appendix A surveys six published CRT extensions — Toplak et al. 2014; Primi et al. 2015; Baron et al. 2015; Thompson & Oppenheimer’s four-item CRT-2; Sirota et al.’s nonmathematical verbal CRT-V; and Young et al.’s child-friendly version — that address recognizability, item count, and over-reliance on arithmetic. Appendix B presents the authors’ own harder, more confusing candidate items.)