Abstract

The official thesis abstract (verbatim).

Fields work in very different ways — but fail in similar ways. This thesis covers some of my work in epidemiology, psychology, and machine learning with the common thread of shared methodological issues. I identify frameworks which fail to cover actual practice, practices which fail to live up to normative principles, and propose practices which are sometimes able to address some failures at some cost.

In epidemiology I take a Bayesian approach to infectious disease modelling and infer the effect of entire populations wearing face masks during the Covid pandemic — with the key caveat that this is an observational study. I identify a ubiquitous methodological mistake (using mandate timings as a proxy for wearing behaviour, when these are, surprisingly, not strongly correlated).

In psychology I synthesise theories of the replication crisis and report on initiating a large (n = 1931) dataset of replication studies covering the original effect sizes, replication effect sizes, and both raw and recalculated statistics. These nonrandom data still give some insight into post-crisis psychology, confirming past results showing that even the sign of effects is often not replicable. I note a pattern of considerable ‘shrinkage’ in effect sizes between the original study and their replications.

In machine learning I trace recent methodological changes and provide a novel analysis of roughly forty ways that ML evaluations are often misleading.

Each chapter contains a self-contained background section for its respective field. I conclude with lessons for each field from the other two.

“an impressive work of scholarship… The epidemiology has had broad influence, including on policy discussion at a national level… the machine learning chapter is an admirable feat of investigation and synthesis.” — Conor Houghton

Author’s gloss: Broadly it is about epistemics: why we can’t learn that much from one study, or from many studies. In Newell’s typology the thesis “contradicts existing knowledge; thoroughly explores an area; provides empirical data; and produces a negative result.” A Gelmanian work — Gelman is cited 74 times, in every chapter. Contribution: ~3500 hours on the whole PhD (226 on the writeup).

Full text

The thesis runs to ~165,000 words across four chapters. Rather than inline it all (its chapters are the substance of several other entries in this folder), this section reproduces the thesis’s unique synthesis material — the structure, the per-paper contributions, and the cross-field conclusion — and cross-links each chapter to the corresponding paper. Full PDF linked in the frontmatter.

Table of contents

Contributions

The thesis draws on the following published work. Bullet points denote Gavin’s contributions; * denotes equal authorship.

As first or senior (last) author:

  1. Mask wearing in community settings reduces SARS-CoV-2 transmission (2022, PNAS) — Gavin Leech*, Charlie Rogers-Smith*, et al. → mask-wearing.md
  2. Ten Hard Problems in Artificial Intelligence We Must Get Right (2024, in review at ACM Computing Surveys) → ten-hard-problems.md
  3. Tracking replications in the social, cognitive, and behavioural sciences (2024, in review at Nature Human Behaviour) → tracking-replications.md
  4. Questionable practices in machine learning (2024, in review at JMLR) → questionable-practices-ml.md

As other author:

  1. Inferring the effectiveness of government interventions against COVID-19 (2020, Science) → inferring-npi-effectiveness.md
  2. How Robust are the Estimated Effects of Nonpharmaceutical Interventions against COVID-19? (2020, NeurIPS spotlight) → npi-robustness.md
  3. Seasonal variation in SARS-CoV-2 transmission in temperate climates (2022, PLOS Computational Biology) → seasonal-covid.md
  4. Massively Parallel Reweighted Wake-Sleep (2023, UAI) → parallel-reweighted-wake-sleep.md
  5. The Replication Database: Documenting the Replicability of Psychological Science (2024, Journal of Open Psychology Data) → replication-database.md

Conclusion

The thesis’s de facto research questions and their answers:

Epidemiology.

Psychology.

Machine learning.

As a whole. The thesis’s strength is that it collates work which led somewhere: the Covid papers were used in (at least) UK and Czech policy decisions; the psychology replication collection has some chance of being a standard reference with ongoing crowdsourcing; the QRP work has been used at Ofcom and as part of an evaluation checklist in at least one industrial lab. Its weakness is that it isn’t theoretically deep: the problems are not common to all three fields (besides underreporting, the ur-problem which hides all other problems). The original plan was to compare the evidential standards of the three fields and unify Bayesian inference, hypothesis testing, and statistical learning.