A Cliff at p=.05? Recent Evidence Suggests Yes
- Two studies provide yet more evidence that researchers make poorer interpretations when they focus on statistical significance testing and its .05 threshold.
- Statisticians make similar errors, although in some situations to a lesser extent than do other researchers.
- Both statisticians and epidemiologists show, in some situations, clear evidence of a steep cliff between p values of .025 and .075. Researchers in psychology and economics show evidence of large cliff between p values of .01 and .26.
- Students without statistical training in some situations make better judgments than do students trained in statistical significance testing. 🙁
Folks have traditionally rejected H0 or not, based on a hard p = .05 criterion. Does this mean a sharp cliff in ratings of strength of evidence, or confidence that an effect exists, from .04 to .06? Or from just outside to just inside a 95% CI? In this recent post I discussed the original work from 1963 by Rosenthal and Gaito that suggested a somewhat steep cliff, and also more recent work. I concluded that there are lots of individual differences in how researchers rate the different values of p (or the CI equivalents) and that average results typically show only a small-to-moderate cliff effect.
Blake McShane (see this site of his for links to relevant pdfs) has kindly let me know of two articles of his, with David Gal, that report two studies providing evidence relevant to cliff. They studied p values, but not CI figures.
From the abstract:
… We also present new data showing, perhaps surprisingly, that researchers who are primarily statisticians are also prone to misuse and misinterpret p values thus resulting in similar errors. In particular, we show that statisticians tend to interpret evidence dichotomously based on whether or not a p value crosses the conventional 0.05 threshold for statistical significance. …
Additional results are reported in the following article, which has a particularly nice title, imho:
McShane, B.B., and Gal, D. (2016), “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence.” Management Science, 62(6), 1707-1718. doi: 10.1287/mnsc.2015.2212.
Here’s my brief take on the two studies:
Participants were (i) authors published in JASA, a top statistical journal; (ii) authors published in NEJM, a top medical journal; and (iii) members of the editorial board of Psychological Science. They saw the results of a fictitious two-group study and were asked about the difference between the means of the two particular groups of participants. They saw two versions of the results, identical except for the p value. In response to the p = .01 version, around 85% of all groups responded appropriately. In response to the p = .27 version, only around 48% of the statisticians and 17% of the medical and psychological researchers gave an appropriate response. The respondents’ explanations for their responses indicated that inappropriate use of statistical significance testing, especially when p = .27, was important in guiding very many of the responses.
In the Supplementary Materials to the ‘Blinding us to the obvious’ article, results are reported of a version of the study that compared responses from groups of undergraduates with and without statistical training. Those with such training showed the familiar large difference between .01 and .27 responses; those without statistical training gave the same rate of appropriate responses (around 75%) for the two values of p. Hence the title: statistical training is sufficiently tainted by a focus on statistical significance testing to worsen judgments in some situations. 🙁
Overall these findings add to the large body of evidence that even accomplished researchers are often led astray by reliance on statistical significance testing. I guess statisticians might find a hint of comfort that their results were not as terrible as those of the medicos or psychologists. The possibility of cliff was not a particular focus of this study. Yes, there’s a big difference between .01 and .27, but we don’t know whether that’s because of a cliff at around .05, or a more gradual change between those two values.
This study was similar but asked a specifically inferential question, and varied the p values between rather than within respondents. The p values used were .025, .075, .125, .175. Respondents were statisticians and epidemiologists. Two questions were asked, the judgment question being “A person drawn randomly from the same population as the subjects in the study is more likely to recover from the disease if given Drug A than if given Drug B”. Other response options included less likely and equally likely, rather than more likely. Both statisticians and epidemiologists responded very differently in the .025 and .075 conditions, while results were very similar for the three largest p values. In other words, there was a steep cliff between .025 and .075. The Supplementary Materials reported results from similar studies with researchers in psychology and economics, who showed evidence of a large cliff between .01 and .26.
The second type of question was a choice: “…if you were a patient from the same population as the subjects in the study, what drug would you prefer to take to maximize your chance of recovery?” Response options were: A, B, indifferent between A and B. The results were more equivocal than for the judgment question. For choice, the statisticians gave more appropriate responses, with small and graduated change with p value, whereas epidemiologists gave results suggesting a moderate cliff between .025 and .075. Researchers in psychology and economics showed evidence of a moderate cliff between .01 and .26.
Quotes in Response
Invited responses from distinguished statisticians were published immediately after the article; then came a rejoinder by McShane and Gal. Interesting reading! A couple of quotes:
From Donald Berry, referring to p values and how they are used: “We created a monster. And we keep feeding it, hoping that it will stop doing bad things. It is a forlorn hope. No cage can confine this monster. The only reasonable route forward is to kill it.” (p. 896)
From William M. Briggs: “There are no good reasons nor good ways to use p values. They should be retired forthwith.” (p. 897)
Did anyone mention CIs?
P.S. A big thankyou to Blake McShane for comments on a draft of this post.
What about the strange phenomenon of tests with p=0.08 or so which are deemed to be bravely ‘approaching’ or valiantly ‘trending towards’ significance?
Is there a similar phenomenon of tests with p=0.04 slouching away from significance?
Thanks Patrick, well said. There’s a famous list of more than 500 wordings that have been used to magically dress up a p > .05 as, pretty much, significant.
There has even been systematic study of the use of such tricks, with evidence that they have appeared more often over the years.
My question is ‘How do we know that p=.08 is not desperately running AWAY from significance?’
Yet one more reason to consign p values to the dustbin of history.