---
title: "Tracking (failed) replications in social science"
full_title: "Tracking and mainstreaming replications in the social, cognitive, and behavioral sciences"
authors:
  - Helena Hartmann
  - Flavio Azevedo
  - Lukas Röseler
  - "… (et al.)"
  - Gavin Leech
gleech_role: co-author
year: 2025
type: preprint
status: in review at Nature Human Behaviour
doi: 10.31222/osf.io/me2ub_v1
url: https://osf.io/preprints/metaarxiv/me2ub_v1
links:
  database: https://forrt.org/apps/fred_explorer.html
contribution_hours: 150
topics: [metascience, replication, open-science, social-science]
---

## Abstract

Replicability is a cornerstone of scientific progress. Yet, replications are
often undervalued, and are sometimes seen as redundant, unimportant, or lacking
novelty. This impedes their broader adoption in research and beyond. In response,
the credibility revolution calls for slower, more deliberate science and greater
responsiveness to fallibility. In this perspective piece, we argue that (a)
replications are essential for validating scientific claims, (b) replications need
to be made more visible, recognized, and integrated into research and educational
practices, and (c) we can change the way we view and judge replication results. We
propose a framework where replication studies can be systematically tracked and
normalized through the Replication Hub as part of the Framework for Open and
Reproducible Research Training (FORRT) initiative, with the goal of enhancing the
visibility, integration, and cumulative impact of replication research across
disciplines.

**Keywords**: Metascience; Open scholarship; Open science; Replicability;
Replications; Reproducibility; Replication Crisis

## Full text

> Converted from the manuscript (a Word document). Prose reproduced verbatim with
> docx conversion artifacts cleaned; in-text citations kept as the authors wrote
> them. Figures (the FReD Explorer, FReD Annotator, and Replication Hub
> screenshots) are referenced but not reproduced, and the full reference list is
> omitted — see the source.

### Introduction

Replication is the process by which researchers test whether the same claims, in
identical, similar, or varying contexts, lead to conclusions consistent with those
of the original study (Parsons et al., 2022). Replications are a cornerstone of
empirical research, where independent sources contribute cumulative evidence to
support or refute a given claim. However, recent metascientific studies across
scientific fields, particularly those in social, cognitive, and behavioral
sciences, have found that many prominent findings 'fail' to replicate, that is,
their results do not converge with those from the target studies and challenge the
credibility of previous scientific claims (Brodeur et al., 2024; Ioannidis, 2005;
Nosek & Errington, 2020).

Even when studies do replicate, the observed effects are often much smaller (Patil
et al., 2016), averaging half of the originally reported effect sizes (Camerer et
al., 2018; Open Science Collaboration, 2015). Such patterns appear across fields,
including psychology, medicine, biology, economics, and neuroscience. This has been
coined a 'replication crisis', raising concerns regarding the robustness of
scientific knowledge and challenging the validity of decades of research. In
response, this so-called crisis has given rise to a grassroots open science reform
movement and the emergence of the field of metascience, that is, research on how
science is conducted.

Despite this movement and evidence that many studies do not replicate, replication
attempts are still rare. Moreover, journals are often unwilling to publish
replication studies, which compromises our ability to build robust bodies of
evidence to inform policy and practice. It also highlights that replications are
not yet given the recognition they deserve, particularly by journal editors,
funders, policymakers, and even researchers themselves. In contrast, novel results
are often less scrutinized regarding reproducibility, published more readily, and
cited more frequently (Scheel et al., 2021; Serra-Garcia & Gneezy, 2021).

Contrasting the underappreciation of replications, especially those that challenge
long-established original findings, we argue for replications to be seen as a
critical resource for designing, conducting, and interpreting research. Viewing the
'replication crisis' as a 'credibility revolution' (Korbmacher et al., 2023;
Vazire, 2018) and an opportunity (Munafò et al., 2022), we are not alone in calling
for slower, more deliberate science and greater responsiveness to fallibility. In
the following sections, we discuss (a) the essential role of replications in
science, (b) the need for their increased visibility and recognition through
systematic tracking, (c) necessary changes in the way we judge replication results,
and (d) future directions for replication practices in professional and educational
contexts.

### The need for replications

There is a common and longstanding narrative of science being built on
replications, but recently they have been heralded as a key tool for 'saving
science' (Edlund et al., 2021). Two fundamental aspects of science make
replications indispensable: First, given the probabilistic nature of research and
the myriad contextual and random factors affecting outcomes, no single study can be
conclusive — including in the social, cognitive, and behavioral sciences. Second,
science should be self-correcting, cumulative, and incremental, with progress
building on prior work.

Despite this, current scientific practice often prioritizes novelty over
replication and treats individual findings as definitive rather than part of a
larger evolving picture. The credibility revolution has underscored the dangers of
prioritizing flashy and unexpected results over robustness. For example, research
on social priming appeared so convincing that Nobel laureate Daniel Kahneman
dedicated a chapter to it in *Thinking, Fast and Slow* (Kahneman, 2011); but once
preregistered replications were conducted more systematically, multiple independent
teams failed to replicate the originally reported social priming effects, and the
field became emblematic of concerns surrounding research integrity.

*Direct* replications are a crucial safeguard against the immense resource waste of
building a literature on false positive findings (Zwaan et al., 2018). By recreating
studies with highly similar or identical methods and sample characteristics, direct
replications help to identify which findings are reliable (as opposed to the
previously more common *conceptual* replications, which include differences in
sample, design, measurement and/or analysis). Given the regular occurrence of false
positive results — significantly amplified by publication bias and questionable
research practices — multiple and direct replications are essential. Replications
per se are not 'better' than original studies; each study needs to be judged on its
own merits.

Beyond verifying the existence of an effect, especially when science moves towards
application, it is crucial to estimate accurate effect sizes to determine practical
significance. Achieving greater precision is dramatically improved through larger
sample sizes, and biases in the literature often exaggerate effect sizes (e.g.
publication bias). In addition to corroborating or challenging original claims,
replications also help identify 'boundary' conditions that affect the presence
and/or magnitude of effects — particularly when moving beyond limited contexts
(e.g. WEIRD populations). Direct or close replications ensure core effects hold
under similar circumstances; conceptual replications are a crucial next step,
deliberately varying contextual or methodological features to assess robustness and
generalizability (e.g. Tunç and Tunç's Systematic Replications Framework).

While many studies replicate main effects before testing interactions, moderators,
or mediators, these tests are rarely labeled as replications and often deviate from
original protocols. This lack of consistency in naming and methods limits the
accumulation of evidence and the tracing of 'failed' replications, which usually
remain unpublished. Importantly, 70% of researchers have reported failing to
replicate findings at least once (Baker, 2016), yet the low publication rate
suggests many of these attempts are left in the metaphorical 'file drawer'
(Rosenthal, 1979), keeping potentially flawed research lines alive.

Taken together, these developments highlight why replicating results and making them
more visible are fundamental to producing reliable, trustworthy science. Fostering a
culture of replication offers benefits beyond merely assessing individual claims:
the expectation of future replication can improve reporting practices, reduce
errors, and potentially even prevent fraud. Despite these promises, existing
estimates suggest that between 0.2% and 5% of published studies in psychology are
replications, with even lower rates in other fields, and there is no standardized
way of indexing them. Developing comprehensive databases of replication studies is
one way to remedy this.

### Tracking replications systematically

Practical solutions are essential to shift replication studies from a niche effort
to a mainstream scientific practice. To make replications more mainstream and
visible, we created a comprehensive database of replications as a resource for
research and teaching. At present, the *FORRT Replication Database* (FReD) contains
a large index of original studies, their replications, and their raw statistics and
effect sizes (n = 1,118 original articles and n = 1,137 replication references from
151 different journals and 167 contributors as of 2025-02-11). With over 160
researchers having contributed since its conception in April 2022, we aim for this
to be a living, community-driven solution for collecting, updating, and
disseminating replications.

This database is embedded within the *FORRT Replication Hub*, a comprehensive and
living resource where authors, reviewers, educators, and editors can log and access
replication studies. FReD hosts: (1) the FReD Explorer (a database of original
studies and their replications); (2) the FReD Reference Annotator (a tool to check
reference lists for replications); and (3) a list of large-scale replication
projects. This centralized resource facilitates finding replications, eases
integration into scholarly workflows, and facilitates the citation of replications
alongside original studies.

Historically, the initial version of the database was created by gathering instances
of replication failures and successes from sources such as scientific mailing lists,
blogs, and social media platforms (see also the *FORRT Replications & Reversals*
project). Subsequently, participating FORRT volunteers contributed information about
replication studies from their subfields over multiple years and at various
hackathons starting in 2018, recording for each study the citation, study design,
sample sizes, and effect sizes of both the original and replication work.

This database has some limitations. Due to the self-selected sample of studies, we
explicitly refrained from presenting simple summaries or inferential statements
about fields or subfields based on the database alone. The resource is not an
exhaustive list of replications (or even 'failed' replications), as the initial
collection process was biased towards famous original works. New evidence is added
weekly (still largely volunteer-driven, with recent financial support from the
Center for Open Science), and we are making efforts to safeguard against such
selection biases. Lastly, our own effort to collate quantitative features of
replications has its own subjectivity and researcher degrees of freedom.

*(Figure 1: the FReD Explorer — automated summary of selected replications and
success rates, with filtering options. Figure 2: the FReD Annotator — checks
reference/reading lists to identify replication studies.)*

### Making replications more visible

Once researchers begin to conduct more replications, the next challenge is ensuring
that replications become a more easily accessible, valued, and normative part of
scientific practice. Key interested parties — researchers, journals, funders, and
policymakers — play critical roles in embedding replication into the research
culture.

The full value of replications can only be realized if they are systematically
incorporated into grant applications, publications, and educational curricula. For
example, educators could include replication studies in their syllabi and let
students conduct their own small-scale replications. Our own bottom-up efforts need
to be reinforced by top-down support from journals and funders: explicit incentives
for replication research, more replication-specific journals, and revised manuscript
evaluation criteria that reduce the emphasis on novelty. Some journals already invite
replications (e.g. *Replication Research*), the *Registered Reports* format reduces
publication bias by reviewing study designs before data collection, and funders like
the Dutch NWO and German DFG offer replication grants. Universities can adapt
curricula prioritizing transparent and robust science (e.g. FORRT's Lesson Plans,
Clusters, and Curated Resources), and communicators should shift away from
"sensational" findings.

### Judging replication results

Replication plays a critical role in ensuring robustness, but it is vital to
acknowledge the complexity behind failed replications, which can arise for many
reasons. Understanding these is essential to a constructive — rather than punitive —
approach. Potential explanations range from questionable research practices and
publication bias to measurement error and the inherent heterogeneity of social and
psychological phenomena.

One significant factor is the historic, widespread issue of low statistical power:
underpowered studies are more prone to false positives and inflated effect sizes.
The 'crud factor' — the tendency for almost everything to be weakly correlated —
makes it challenging to distinguish meaningful effects from noise, so large-sample
studies may detect effects lacking real-world significance. Moreover, social,
cognitive, and behavioral effects are not universal and may vary across time,
population, location, or context; heterogeneity can cause genuine effects to fail
under different circumstances without invalidating the original findings. Thus
replication failures can help identify boundary conditions and reveal moderators or
mediators, rather than simply indicating a lack of support for a hypothesis. While
there is no consensus on how to classify replications on a spectrum between
successful and failed, the credibility revolution gives us the chance to drive
reform.

### Replications in the future

We propose four key features a scientific ecosystem can adopt to take full advantage
of replication research: (1) findability of replications, (2) widespread adoption of
open science practices, (3) education and training surrounding replications, and (4)
incentivizing replications.

*First*, replication studies should be easy to find. It would be ideal if search
engines could automatically tag replication studies, though this is error-prone and
human, crowd-sourced validation is likely to remain essential to guarantee accuracy
and interpretative nuance — an approach we adopted in developing the *FORRT
Replication Hub*. The hub consolidates human-generated replication projects (the
Replications & Reversals project, FReD, and a handbook for conducting replications)
and includes a dedicated journal, *Replication Research*. Other innovations include
*PubPeer* and tools like Zotero plug-ins and Scite.ai that flag articles with
replication discussions and retraction notices.

*Second*, primary research needs to adopt open science practices across the board:
at a minimum, detailed methods, open materials, open data (when ethically
appropriate), and open analysis code, with preregistration or *Registered Reports*
to clearly label confirmatory vs exploratory analyses. Unfortunately, transparency
is still uncommon, and authors are not very responsive to data requests (of 65
contacted researchers from "available upon request" studies, only 27% actually
shared data). Journals should make transparency the default.

*Third*, researchers should be trained in replication-related methodologies
(equivalence testing, verification of original studies, reproducibility tests,
sample-size planning and power analyses, effect-size and confidence-interval
calculations, preregistration, and replication success criteria). Teaching about
replication research needs to be a major cornerstone of teaching science.

*Fourth*, replication research needs to be rewarded. Universities and funders should
officially recognize the value of replication studies. Updating journal submission
guidelines could include a Pottery Barn rule — "you break it, you buy it" — which
requires journals to publish replications of studies they previously published (a
policy implemented by *Royal Society Open Science*). As of February 2025, 131
journals have implemented policies supporting replication studies (TOP Factor level
3). A more systematic evaluation process based on cost-benefit analyses could help
determine which studies most urgently need replication.

*(Figure 3: the FORRT Replication Hub. The FORRT tower icon indicates a resource is
available in the Hub; all other projects are currently in development.)*

### Conclusion

Replications are intricate and complex. We recommend that the scientific community
adopts a pluralistic and dynamic approach to replication — one that appreciates the
various reasons why effects may fail to replicate and avoids treating every
replication failure as a definitive refutation. Replications should be valued for
their role in refining theories and improving the cumulative understanding of
scientific phenomena. Initiatives such as the *FORRT Replication Hub* provide a
platform to make replications more visible, accessible, rewarding, and integral to
scientific discourse. Ultimately, replications should not be seen as a final verdict
but as a dynamic part of the scientific process that drives progress through a
continuous and cumulative reassessment of claims and evidence.
