Abandon Statistical Significance!
That’s the title of a paper accepted for publication in The American Statistician. (I confess that I added the “!”) The paper is here. Scroll down below to see the abstract. The paper boasts an interdisciplinary team of authors, including Andrew Gelman of blog fame.
I was, of course, eager to agree with all they wrote. However, while there is much excellent stuff and I do largely agree, they don’t go far enough.
Down With Dichotomous Thinking!
Pages 1-11 discuss numerous reasons why it’s crazy to dichotomise results, whether on the basis of a p value threshold (.05, .005, or some other value) or in some other way–perhaps noting whether or not a CI includes zero, or the Bayes Factor is greater than some threshold. I totally agree. Dichotomising results, especially as statistically significant or not, throws away information, is likely to mislead, and is a root cause of selective publication, p-hacking, and other evils.
So, What to Do?
The authors state that they don’t want to ban p values. They recommend “that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence.” (abstract)
That all seems reasonable. Yes, if p values are mentioned at all they should be considered as a graduated measure. However, the authors “argue that it seldom makes sense to calibrate evidence as a function of the p-value.” (p. 6) Yet in later examples, for example in Appendix B, they interpret p values as indicating strength of evidence that some effect is non-zero. I sense an ambivalence: The authors present strong arguments against using p values, but cannot bring themselves to make the logical next step and not use them at all.
Why not? One reason, I think, is that they don’t discuss in detail any other way that inferential information from the data can be used to guide discussion and interpretation, alongside the ‘subordinate factors’ that they very reasonably emphasise all through the paper. For me, of course, that missing inferential information is, most often, estimation information. Once we have point and interval estimates from the data, p values add nothing and are only likely to mislead.
In the context of a neuroimaging example, the authors state that “Plotting images of estimates and uncertainties makes sense to us” (p. 15). Discussing pure research they state “we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research.” That’s fine, but a long way short of recommending that point and interval estimates almost always be reported and almost always used as the primary data-derived inferential information to guide interpretation.
There is a recommendation for “increased consideration of models that use informative priors, that feature varying treatment effects, and that are multilevel or meta-analytic in nature” (p. 16). That hint is the only mention of meta-analysis.
For me, the big thing missing in the paper is any sense of meta-analytic thinking–the sense that our study should be considered as a contribution to future meta-analyses, as providing evidence to be integrated with other evidence, past and future.. Replications are needed, and we should make it as easy as possible for others to replicate our work. From a meta-analytic perspective, of course we must report point and interval estimates, as well as very full information about all aspects of our study, because that’s what replication and meta-analysis will need. Better still, we should also make our full data and analysis open.
For a fuller discussion of estimation and meta-analysis, see our paper that’s also forthcoming in The American Statistician. It’s here.
Here’s the abstract of McShane et al.:
Thanks Mike. I agree–that’s all very reasonable. However, I would be (even) more negative about p values. Virtually always the point and interval estimates (effect size and CI) are more informative than the p value. In addition, a p value can easily be misleading. I suspect that most folks don’t appreciate just how unreliable p values are, in the sense that a close replication is very likely to give a very different p value.
PS For a demo, at YouTube search for ‘dance of the p values’ or ‘significance roulette’
Simply,p-values give only an preliminary evaluation of the experimental data and nothing about the merits of the experimental design or its relevance to the “question” being invesigated.
The experimental details and data should be available to all who are interested and for eventual replication by others. If folks don’t know the limitations of the “p” value, perhaps they are in the wrong business.
I am not a statistician but know enough to invite one into the game during experimental design and evaluation. All publication reviews should include a formal statistical evaluation.
Perhaps results should be accompanied with a “no known independent replications” statement as a reminder.