Reminder: Significance Roulette Still Tells Us a p Value Can’t be Trusted
An appreciative comment on YouTube reminds me I haven’t mentioned Significance Roulette for a while, yet its message that a p value can’t be trusted remains as relevant as ever.

Dance of the p Values
The dance of the p values was my first go at making vivid the amazingly large sampling variability of the p value. The original video is here, or search YouTube for ‘dance of the p values’. A simulation of running the same experiment over and over finds that p values typically leap around wildly: Usually, the next p value can take just about any value.
But that’s when we know the population mean. What about a more realistic situation when all we know from the initial experiment is the p value? What is a close replication, just the same but with a new sample, likely to give? In other words, what’s replication p likely to be?
In most cases replication p can take just about any value 🙁
Significance Roulette
For explanation and all the formulas, see this paper (cited 375 times). For the demo, search YouTube for ‘Significance Roulette’ to find two videos. Or they are here and here.
The figure above is a wheel that’s equivalent to the distribution of replication p following an initial experiment that gives p = .05. (All p values are two-tailed.) Amazingly, the wheel applies whatever the N (unless very small), the power, or the population effect size. All we need is that p = .05, then spinning the wheel is equivalent to running a replication, if all we are interested in is the p value given by that replication. It’s also way cheaper and easier.
Of course, if the initial p value is different we’ll need a different wheel. Below is the wheel for initial p = .01. More red (p < .001) and less deep blue (p > .10), but still an alarmingly wide spread of possibilities.

You don’t believe me? It took me ages to accept what the formulas and simulations were telling me. But they are correct. We really, really can’t trust a p value–which seems to promise certainty, a clear outcome. Far, far better to use the confidence interval, whose length makes the degree of uncertainty salient, even tho’ that’s often a depressing message.
Geoff
Leave a Reply