Adventures in Replication: p values and Illusions of Incompatibility
Here’s an idea I run into a lot in peer reviews of replication studies:
If the original study found p < .05 but the replication found p > .05, then the results are incompatible and additional research is needed to explain the difference.
Poor p values. I’m sure they want to be able to tell us when results are incompatible, but they just can’t. The little beasts are too erratic (Cumming, 2008). And just because two p values live on opposite sides of alpha doesn’t mean the results that brought them into this world are notably different (Gelman & Stern, 2006) . It’s seductive, but if you compare p values to check consistency, you will end with illusions of incompatibility.
Here’s an example which I hope makes the issue very clear. The original study found p < .05. Three subsequent replications found p > .05. Comparing statistical significance (falsely) suggests the results are incompatible and that we need to start thinking about what caused the difference. Right? Wrong, as you can see in this forest plot:
The original study isn’t notably incompatible with the replication results because the original study is incredibly uninformative. Even though p < .05 (the CI does exclude the null hypothesis of 0), the CI suggests anywhere from a very large effect down to an incredibly small effect. The replication results all suggest an incredibly small effect. That’s no contradiction, just disappointment! Trying to “explain the difference” is a fool’s errand–the differences in results are not clearly more than would be expected from sampling error.
Unfortunately, most of the reviewers treat differences in statistical significance as a reliable indication of incompatibility. Here’s an extreme example:
First, I would like to understand how one can continue to assume that a replication experiment that leads to significantly different results can still be called a “direct” or “precise” replication in all respects. If you really believe in the informative value of significance tests – which is obviously at the heart of such replication work – should you then not resort to the alternative assumption that original and replication experiments must have been not equivalent? Was this a miracle, or is it simply an index of changing (though perhaps subtle) boundary conditions?
Want your head to explode? The forest plot above presents (most of) the data the reviewer was writing about! The reviewer is so ardently misled by p values that they believe there simply must be an substantive difference between the sets of studies. This is really amazing because this set of studies is the most precisely direct set of replications I’ve been able to complete. The original studies were all done online, and I was able to obtain the same exact survey and use it with the same exact population within 1 year of the original studies being done. So strong is this reviewer’s confidence in the ability of p values to do what they cannot, that they wag the whole dog by the tail.
- I’m not hating on this reviewer. I’d have made the same mistake 5 years ago… maybe even 2 years ago. It takes a methodological awakening. I wish, though, that prominent journals that pretend to welcome replications would select reviewers with just a bit less over-confidence in p values.
- I didn’t include the forest plot in the paper, just tables with CIs. Maybe the figure would have helped.
- There’s a bit more to this paper. I replicated 4 different experimental protocols (and 2 completely worked!). This reviewer was writing about the 2 protocols where replications showed essentially no effect. So it was really 2 forest plots, but both with the same pattern.
- Yes, the reviewer apparently thinks that 2 significant results (with 2 different protocols) not replicating would require a miracle.