That’s the title of a paper accepted for publication in The American Statistician. (I confess that I added the “!”) The paper is here. Scroll down below to see the abstract. The paper boasts an interdisciplinary team of authors, including Andrew Gelman of blog fame.
I was, of course, eager to agree with all they wrote. However, while there is much excellent stuff and I do largely agree, they don’t go far enough.
Down With Dichotomous Thinking!
Pages 1-11 discuss numerous reasons why it’s crazy to dichotomise results, whether on the basis of a p value threshold (.05, .005, or some other value) or in some other way–perhaps noting whether or not a CI includes zero, or the Bayes Factor is greater than some threshold. I totally agree. Dichotomising results, especially as statistically significant or not, throws away information, is likely to mislead, and is a root cause of selective publication, p-hacking, and other evils.
So, What to Do?
The authors state that they don’t want to ban p values. They recommend “that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence.” (abstract)
That all seems reasonable. Yes, if p values are mentioned at all they should be considered as a graduated measure. However, the authors “argue that it seldom makes sense to calibrate evidence as a function of the p-value.” (p. 6) Yet in later examples, for example in Appendix B, they interpret p values as indicating strength of evidence that some effect is non-zero. I sense an ambivalence: The authors present strong arguments against using p values, but cannot bring themselves to make the logical next step and not use them at all.
Why not? One reason, I think, is that they don’t discuss in detail any other way that inferential information from the data can be used to guide discussion and interpretation, alongside the ‘subordinate factors’ that they very reasonably emphasise all through the paper. For me, of course, that missing inferential information is, most often, estimation information. Once we have point and interval estimates from the data, p values add nothing and are only likely to mislead.
In the context of a neuroimaging example, the authors state that “Plotting images of estimates and uncertainties makes sense to us” (p. 15). Discussing pure research they state “we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research.” That’s fine, but a long way short of recommending that point and interval estimates almost always be reported and almost always used as the primary data-derived inferential information to guide interpretation.
There is a recommendation for “increased consideration of models that use informative priors, that feature varying treatment effects, and that are multilevel or meta-analytic in nature” (p. 16). That hint is the only mention of meta-analysis.
For me, the big thing missing in the paper is any sense of meta-analytic thinking–the sense that our study should be considered as a contribution to future meta-analyses, as providing evidence to be integrated with other evidence, past and future.. Replications are needed, and we should make it as easy as possible for others to replicate our work. From a meta-analytic perspective, of course we must report point and interval estimates, as well as very full information about all aspects of our study, because that’s what replication and meta-analysis will need. Better still, we should also make our full data and analysis open.
For a fuller discussion of estimation and meta-analysis, see our paper that’s also forthcoming in The American Statistician. It’s here.
Here’s the abstract of McShane et al.: