Reply to Lakens: The correctly-used p value needs an effect size and CI
Updated 5/21: fixed a typo, added a section on when p > .05 demonstrates a negligible effect, and added a figure at the end.
Daniel Lakens recently posted a pre-print with a rousing defense of the much-maligned p-value:
In essence, the problems we have with how p-values are used is human factors problem. The challenge is to get researchers to improve the way they work. (Lakens, 2019) (p.8)
I agree! In fact, Geoff and I just wrote nearly the same thing:
The fault, we argue, is not in p-values, but in ourselves. Human cognition is subject to many biases and shortcomings. We believe the NHST approach exacerbates some of these failings, making it more likely that researchers will make overconfident and biased conclusions from data. (Calin-Jageman & Cumming, 2019) (p.271)
So we agree that p values are misused… what should we do about it? Lakens states that the solution will be creating educational materials that help researchers avoid misunderstanding p values:
If anyone seriously believes the misunderstanding of p-values lies at the heart of reproducibility issues in science, why are we not investing more effort to make sure misunderstandings of p-values are resolved before young scholars perform their first research project? Where is the innovative coordinated effort to create world class educational materials that can freely be used in statistical training to prevent such misunderstandings? (p.8-9)
But Lakens seems to think this is all work to be done, and that no one else is putting serious effort towards remedying the misinterpretation of p values. That’s just not the case! The “New Statistics” is an effort to teach statistics in a way that will eliminate misinterpretation of p values. Geoff has put enormous effort into software, visualizations, simulations, tutorials, papers, youtube videos and more–a whole eco-system that can help researchers better understand what a p-value does and does not tell them.
Let’s take a look.
Non-significant does not always mean no effect. Probably the most common abuse of p values in concluding that p > .05 means “no effect”. How can you avoid this problem? One way is to reflect on the effect size and its uncertainty and to consider the range of effects that remain compatible with the data. This can be especially easy if we use an estimation plot to draw attention specifically to the effect size and uncertainty.
Here’s a conventionally non-significant finding (p = .08) but one look at the CI and it’s pretty easy to understand that this does not mean that there is no effect (in fact, when this experiment was repeated across many different labs there is a small but reliable effect (this is data from one site of the ManyLabs 2 replication of the “tempting fate” experiment) (Klein et al., 2018; Risen & Gilovich, 2008)
It is possible, though, to prove an effect negligible. So non-significant does not always rule out an effect, but sometimes it can show an effect is negligible. How can you tell if you have an informative non-significant result or a non-informative one? Check out the effect size and confidence interval. If the confidence interval is narrow, all within a range that we’d agree has no practical significance, then that non-significant result indicates the effect is negligible. If the CI is wide (as in the tempting fate example above), then there are a range of compatible effect sizes remaining and we can’t yet conclude that the effect is negligible.
For example, in ManyLab2 the font disfluency effect was tested–this is the idea that a hard-to-read font can activate analytic System2 style thinking and therefore improve performance on tests of analytic thinking. Whereas the original finding found a large improvement in syllogistic reasoning (d = 0.64), the replications found d = -0.03 95% CI[-0.08, 0.01], p = .43 (Klein et al., 2018). This is non-significant but that doesn’t mean we just throw up our hands and discount it as meaningless. The CI is narrow and all within a range of very small effects. Most of the CI is in the negative direction…exactly what you might expect for a hard-to-read font. But we can’t rule out the positive effect that was predicted–effect sizes around d = 0.01 remain compatible with the data (and probably a bit beyond–remember, the selection of a 95% Ci and therefore its boundary is arbitrary). Still, it’s hard to imagine effect sizes in this range being meaningful or practical for further study. So, yes – sometimes p > .05 accompanies a result that is really meaningful and which fairly clearly establishes a negligible effect. The way to tell when a non-significant result is or is not very informative is to check the CI. You can also do a formal test for a negligible effect, called an equivalence test. Lakens has great materials on how to do this–though I think it helps alot to see and understand this “by eye” first.
Significant does not mean meaningful. Another common misconception is the conflation of statistical and practical significance. How can you avoid this error? After obtaining the p– value, graph the effect size and CI. Then reflect on the meaningfulness of the range of compatible effect sizes. This is even more useful if you specify in advance what you consider to be a meaningful effect size relative to your theory.
Here’s an example from a meta-analysis I recently published. The graph shows the estimated standardized effect size (and 95% CI) for the effect of red on perceived attractiveness for female raters. In numbers, the estimate is: d = 0.13 95% CI [0.01, 0.25], p = .03, N = 2,739. So this finding is statistically significant, but we then need to consider if it is meaningful. To do that, we should countenance the whole range of the CI (and probably beyond as well)… what would it mean if the real effect is d = .01? d = .25? For d = .01 that would pretty clearly be a non-meaningful finding, far too small to be perceptually noticeable for anyone wearing red. For d = .25…well, perhaps a modest edge is all you need when you’re being scanned by hundreds on Tinder. Interpreting what is and is not meaningful is difficult and takes domain-specific knowledge. What is clear is that looking at the effect size and CI helps focus attention on practical significance as a question distinct from statistical significance.
Nominal differences in statistical significance are not valid for judging an interaction. Another alluring pitfall of p-values is assuming that nominal differences in statistical significance are meaningful. That is, it’s easy to assume that if condition A is statistically significant and condition B is not, that A and B must themselves be different. Unfortunately, the transitive property does not apply to statistical significance. As Gelman and many others have pointed out, the difference between statistically significant and not significant is not, itself, statistically significant (Gelman & Stern, 2006; Nieuwenhuis, Forstmann, & Wagenmakers, 2011). That’s quite a mantra to remember, but one easy way to avoid this problem is to look at an estimation plot of the effect sizes and confidence intervals. You can see the evidence for an interaction ‘by eye’.
Here’s an example from a famous paper by Kosfeld et al. where they found that intranasal oxytocin increases human trust in a trust-related game but not in a similar game involving only risk (Kosfeld, Heinrichs, Zak, Fischbacher, & Fehr, 2005). As shown in A, the effect of oxytocin was statistically significant for the trust game, but not for the risk game. The paper concluded an effect specific to trust contexts, was published in Nature and has been cited over 3,000 times…apparently without anyone noticing that a formal test for a context interaction was not reported and would not have been statistically significant. In B, we see the estimation plots from both contexts. You can immediately see that there is huge overlap in the comptable effect sizes in both contexts, so the evidence for an interaction is weak (running the interaction gives p = .23). That’s some good human factors work–letting our eyes immediately tell us what would be hard to discern from the group-level p values alone.
If I have obtained statistical significance I have collected enough data. To my mind this is the most pernicious misconception about p-values–that obtaining significance proves you have collected enough data. I hear this alot in neuroscience. This mis-conception causes people to think that run-and-check is the right way to obtain a sample size, and it also licenses uncritical copying of previous sample sizes. Overall, I believe this misconception is responsible for the pervasive use of inadequate sample sizes in the life and health sciences. This, in combination with publication bias, can easily produce whole literatures that are nothing more than noise (
https://www.theatlantic.com/science/archive/2019/05/waste-1000-studies/589684/ ). Yikes.
How can you avoid this error? After obtaining your p-value, look at the effect size and an expression of uncertainty. Look at the range of compatible results–if the range is exceedingly long relative to your theoretical interests then your result is uninformative and you have not collected enough data. You actually can use run-and-check to keep checking on you margin of error and running until it is sufficiently narrow to enable a reasonable judgement from your data. How cool is that?
Here’s an example. This is data from a paper in Nature Neuroscience that found a statistically significant effect of caffeine on memory (Borota et al., 2014). Was this really enough data? Well, the CI is very long relative to the theoretical interests–it stretches from implausibly large impacts on memory to vanishingly small impacts on memory (in terms of % improvement the finding is: 31% 95% CI[0.2%, 62%]. So the data establishes a range of compatible effect sizes that is somewhere between 0 and ludicrous. That’s not very specific/informative, and that’s a clear sign you need more data. I corresponded with this lab extensively about this… but in all their subsequent work examining effects of different interventions on memory they’ve continued to use sample sizes that provide very little information about the true effect. Somehow, though, they’ve yet to publish a negative result (that I know of). Some people have all the luck.
p-values do not provide good guidance of what to expect in the future. If you find a statistically significant finding, you might expect that someone else repeating the same experiment with the same sample size will also find a statistically-significant finding (or at least be highly likely to). Unfortunately, this is wrong–even if your original study had 0 sampling error (the sample means were exactly what they are in the population) the odds of a statistically-significant replication may not be very high. Specifically, if your original study has a sample size that is uninformative (aka low powered), odds of replication might be stunningly small. Still, this is a seductive illusion that even well-trained statisticians fall into (Gigerenzer, 2018) .
How can you avoid this error? Well, one way is to work through simulations to get a better sense of the sampling distribution of p under different scenarios. Geoff has great simulations to explore; here’s a video of him walking through one of these: https://www.youtube.com/watch?v=OcJImS16jR4. Another strategy is to consider the effect size and a measure of uncertainty, because these actually do provide you with a better sense of what would happen in future replications. Specifically, CIs provide prediction intervals. For a 95% CI for an original finding, there is ~83% chance that a same-sized replication result will fall within the original CI (Cumming, 2008). Why only 83%? Because both the original and the replication study are subject to sampling error. Still, this is useful information about what to expect in a replication study–it can help your lab think critically about if you have a protocol that is likely to be reliable and can properly calibrate your sense of surprise over a series of replications. To my mind, having accurate expectations about the future is critical for fruitful science, so it is well worth some effort to train yourself out of the misconception that p values alone can give you much information about the future.
If you want to use a p value well, always report and interpret its corresponding effect size and CI. If you’re following along at home, you’ve probably noticed a pattern: to avoid misinterpretation of p values, reflect critically on the effect size and CI that accompany it. If you do this, it will be much harder to go wrong. If you keep at it for a while, you may begin to notice that the p value step isn’t really doing much for you. I mean, it does tell you the probability of obtaining your data under a specific null hypothesis. But you could also plot the expected distribution of sampling error and get that sense for all null hypotheses, which is pretty useful. And you may be interested in Laken’s excellent suggestions to not always use a point null and to use equivalence testing. But both of these are applied by considering the effect size and its CI. So you may find that p values become kind of vestigial to the science you’re doing. Or not. Whatever. I’m not going to fall into the “Statistician’s Fallacy” of trying to tell you what you want to know. But I can tell you this: If p values tell you something you want to know, then the best way to avoid some of the cognitive traps surrounding p values is to carefully consider effect sizes and CIs.
There’s even more. This short tour is just a taste of work Geoff and I (mostly Geoff) have done to try to provide tools and tutorials to help people avoid misinterpretation of p values. We have an undergrad textbook that helps students learn what p values are and are not (we include several “red flags” to consider every time p values are reported). And there’s ESCI, which will give you p values if you need them, but always along with helpful visualizations and outputs that we think will nudge you away from erroneous interpretation. And pedagogically, I’m convinced that teaching estimation first is the best way to teach statistics–it is more intuitive, it is more fun, and it helps students develop a much stronger sense of what a p value is (and is not). With all due respect to the wonderful memorization drills Lakens recounts in his pre-print, chanting mantras about p values is not going to avoid mindless statistics–but a teaching approach that takes a correct developmental sequence could.
So how do we disagree? So, I agree with Lakens that the problems of p values are in how they are used, and I agree with Lakens that we need a whole ecosystem to help people avoid using them incorrectly. Where we have a puzzling disagreement then, is that I see that work as largely complete and ready for use. That is, I see the “New Statistics” specifically as a way of teaching and thinking about statistics that will help make the errors common to p values more rare (but will probably make other types of errors more common; no system is perfect). Is the disagreement all down to labels and/or misconceptions? Stay tuned.
Oh and one other thing. Lakens complains that SPSS does not report an effect size for a t-test. But it does! At least as long as I’ve been using it, it provides an effect size and CI. It does not provide standardized effect sizes, but raw score? You bet.
- Borota, D., Murray, E., Keceli, G., Chang, A., Watabe, J. M., Ly, M., … Yassa, M. A. (2014). Post-study caffeine administration enhances memory consolidation in humans. Nature Neuroscience, 201–203. doi: 10.1038/nn.3623
- Calin-Jageman, R. J., & Cumming, G. (2019). The New Statistics for Better Science: Ask How Much, How Uncertain, and What Else Is Known. The American Statistician, 271–280. doi: 10.1080/00031305.2018.1518266
- Cumming, G. (2008). Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better. Perspectives on Psychological Science, 286–300. doi: 10.1111/j.1745-6924.2008.00079.x
- Gelman, A., & Stern, H. (2006). The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. The American Statistician, 328–331. doi: 10.1198/000313006×152649
- Gigerenzer, G. (2018). Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science, 198–218. doi: 10.1177/2515245918771329
- Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Jr., Alper, S., … Nosek, B. A. (2018). Many Labs 2: Investigating Variation in Replicability Across Samples and Settings. Advances in Methods and Practices in Psychological Science, 443–490. doi: 10.1177/2515245918810225
- Kosfeld, M., Heinrichs, M., Zak, P. J., Fischbacher, U., & Fehr, E. (2005). Oxytocin increases trust in humans. Nature, 673–676. doi: 10.1038/nature03701
- Lakens, D. (2019). The practical alternative to the p-value is the correctly used p-value. doi: 10.31234/osf.io/shm8v
- Nieuwenhuis, S., Forstmann, B. U., & Wagenmakers, E.-J. (2011). Erroneous analyses of interactions in neuroscience: a problem of significance. Nature Neuroscience, 1105–1107. doi: 10.1038/nn.2886
- Risen, J. L., & Gilovich, T. (2008). Why people are reluctant to tempt fate. Journal of Personality and Social Psychology, 293–307. doi: 10.1037/0022-3518.104.22.1683