Replications: How Should We Analyze the Results?
Does This Effect Replicate?
It seems almost irresistible to think in terms of such a dichotomous question! We seem to crave an ‘it-did’ or ‘it-didn’t’ answer! However, rarely if ever is a bald yes-no decision the most informative way to think about replication.
One of the first large studies in psychology to grapple with the analysis of replications was the classic RP:P (Replication Project: Psychology) reported by Nosek and many colleagues in Open Science Collaboration (2015). The project identified 100 published studies in social and cognitive psychology then conducted a high-powered preregistered replication of each, trying hard to make each replication as close as practical to the original.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349 (6251) aac4716-1 to aac4716-8.
The authors discussed the analysis challenges they faced. They reported 5 assessments of the 100 replications:
- in terms of statistical significance (p<.05) of the replication
- in terms of whether the 95% CI in the replication included the point estimate in the original study
- comparison of the original and replication effect sizes
- meta-analytic combination of the original and replication effect sizes
- subjective assessment by the researchers of each replication study: “Did the effect replicate?”
Several further approaches to analyzing the 100 pairs of studies have since been published.
Even so, the one-liner that appeared in the media and swept the consciousness of psychologists was that RP:P found that fewer than half of the effects replicated: Dichotomous classification of replications rules! For me, the telling overall finding was that the replication effect sizes were, on average, just half the original effect sizes, with large spreads over the 100 effects. This strongly suggests reporting bias, p-hacking or some other selection bias influenced some unknown proportion of the 100 original articles.
OK, how should replications be analyzed? Happily, there has been progress.
Meta-Analytic Approaches to Assessing Replications
Larry Hedges, one of the giants of meta-analysis since the 1980s, and Jacob Schauer recently published a discussion of meta-analytic approaches to analyzing replications:
Hedges, L. V., & Schauer, J. M. (2019). Statistical analyses for studying replication: Meta-analytic perspectives. Psychological Methods, 24, 557-570. http://dx.doi.org/10.1037/met0000189
Formal empirical assessments of replication have recently become more prominent in several areas of science, including psychology. These assessments have used different statistical approaches to determine if a finding has been replicated. The purpose of this article is to provide several alternative conceptual frameworks that lead to different statistical analyses to test hypotheses about replication. All of these analyses are based on statistical methods used in meta-analysis. The differences among the methods described involve whether the burden of proof is placed on replication or nonreplication, whether replication is exact or allows for a small amount of “negligible heterogeneity,” and whether the studies observed are assumed to be fixed (constituting the entire body of relevant evidence) or are a sample from a universe of possibly relevant studies. The statistical power of each of these tests is computed and shown to be low in many cases, raising issues of the interpretability of tests for replication.
The discussion is a bit complex, but here are some issues that struck me:
- We usually wouldn’t expect the underlying effect to be identical in any two studies. What small difference would we regard as not of practical importance? There are conventions, differing across disciplines, but it’s a matter for informed judgment. In other words, how different could underlying effects be, while still justifying a conclusion of successful replication?
- Should we choose fixed-effect or random-effects models? Random-effects is usually more realistic, and what ITNS and many other books recommend for routine use. However, H&S use fixed-effect models throughout, to limit the complexity. They report that, for modest amounts of heterogeneity their results do not differ greatly from what random-effects would give.
- Meta-analysis and effect size estimation are the focus throughout, but even so the main aim is to carry out a hypothesis test. The researcher needs to choose whether to place the burden of proof on nonreplication or replication. In other words, is the null hypothesis that the effect replicates, or that it doesn’t?
- One main conclusion from the H&S discussion and the application of their methods to psychology examples is that replication projects typically need many studies (often 40+) to achieve adequate power for those hypothesis tests, and that even large psychology examples are under-powered.
A further sign that the issues are complex is the comment published immediately following H&S with suggestions for an alternative way to think about heterogeneity and replication:
Mathur, M. B., & VanderWeele, T. J. (2019). Challenges and suggestions for defining replication “success” when effects may be heterogeneous: Comment on Hedges and Schauer (2019). Psychological Methods, 24, 571-575. http://dx.doi.org/10.1037/met0000223
H&S gave a brief reply:
Hedges, L. V., & Schauer, J. M. (2019). Consistency of effects is important in replication: Rejoinder to Mathur and VanderWeele (2019). Psychological Methods, 24, 576-577. http://dx.doi.org/10.1037/met0000237
Our Simpler Approach
I welcome the above three articles and would study them in detail before setting out to design or analyze a large replication project.
In the meantime I’m happy to stick with the simpler estimation and meta-analysis approach of ITNS.
Meta-Analysis to Increase Precision
Given two or more studies that you judge to be sufficiently comparable, in particular by addressing more-or-less the same research question, then use random-effects meta-analysis to combine the estimates given by the studies. Almost certainly, you’ll find a more precise estimate of the effect most relevant for answering your research question.
Estimating a Difference
If you have an original study and a set of replication studies, you could consider (1) meta-analysis to combine evidence from the replication studies, then (2) finding the difference (with CI of course) between the point estimate found by the original study and that given by the meta-analysis. Interpret that difference and CI as you assess the extent to which the replication studies may or may not agree with the original study.
If the original study was possibly subject to publication or other bias, and the replication studies were all preregistered and conducted in accord with Open Science principles, then a substantial difference would provide evidence for such biases in the original study–although other causes couldn’t be ruled out.
Following meta-analysis, consider moderation analysis, especially if DR, the diamond ratio, is more than around 1.3 and if you can identify a likely moderating variable. Below is our example from ITNS in which we assess 6 original studies (in red) and Bob’s two preregistered replications (blue). Lab (red vs. blue) is a possible dichotomous moderator. The difference and its CI suggest publication or other bias may have influenced the red results, although other moderators (perhaps that red studies were conducted in Germany, blue in the U.S.) may help account for the red-blue difference.
Figure 9.8. A subsets analysis of 6 Damisch and 2 Calin studies. The difference between the two subset means is shown on the difference axis at the bottom, with its CI. From d subsets.
My overall conclusions are (1) it’s great to see that meta-analytic techniques continue to develop, and (2) our new-statistics approach in ITNS continues to look attractive.