Cardiac Surgery: Yet One More Research Field Highly Critical of p Values
Replacement heart valves, bypasses, transplants: Cardiac surgery research has given us these life-saving goodies, and more. Now this vital research field has joined many others in appreciating the damage that reliance on p values can bring.
David, originally from Queensland, explains that he became increasingly uneasy about p values during his decades of research and clinical practice. “Sometimes, it seemed that researchers can dial up just about any result they wanted.” In other words, p-hacking, cherry-picking, and other Questionable Research Practices seemed to be all around.
In 2013 he took up his present position at Monash. He started reading about p values, came across some of my work, and got in touch. We quickly discovered that we shared many views. David kindly arranged for the invitation that led to me giving two talks at a conference on cardiothoracic surgery in 2018, as I blogged about here. As usual, I found it great fun.
David started work on a p value article for his research field, then involved colleagues and undertook simulations. Several years and after many discussions and revisions, our review emerged.
In our review, we first discuss the weird backward logic of p values and NHST and how that leads to users being often misled by the inverse probability fallacy—which underlies many p value misconceptions. Then we have a section on the enormous sampling variability of the p value: Replicate an experiment and you are likely to get a very different p value, so p values are highly unreliable, and not to be trusted.
Lessons from simulations
Then we report three simulation studies run by John Reynolds that used a large database of cardiac surgery cases to explore ways that NHST calculations are typically conducted. We conclude that:
- Assumptions matter. For example, if a measure is distinctly not normally distributed in the population, inference calculations can be highly misleading. In some cases a transformation can help, but should be specified in advance.
- Failing to reject a null hypothesis is never sufficient justification for accepting that null, and this is especially the case when statistical power is low.
- Inference calculations can be misleading when analysing rates, especially when rates are low and some subgroups are very small.
- CI and p value calculations can suggest differing conclusions for a number of reasons, including different approximate calculation methods for the two, even if both methods are the defaults in a statistical package.
These conclusions are not new, but the simulations provide striking illustrations in a context familiar to cardiac surgery researchers.
After some comments—controversial in my view—about cases where p values may have some value, we describe a range of ways that p values can distort science. We mention HARKing—hypothesising after the results are known—and JARKing—justifying after the results are known. ‘JARK’ was new to me, but is a nice acronym. A three number summary, referring to an estimated effect size and the lower and upper limits of the CI on that estimate, is another nice expression particularly familiar to those in the field.
Our conclusion is that cardiac surgery researchers should adopt Open Science practices and use estimation, or other improved approaches, perhaps Bayesian.