One More Time: p Values are Scarily Unreliable

Replicate a study and you are highly likely to get a very different p value. Scarily different. The sampling variability of p values is so great that no p value deserves our trust.

Yes, I know, that’s been my mantra for more than a decade, but Steve Lindsay has just given me an excuse to beat the drum again. This is his great article:

It’s clear that Lindsay (2020) (preprint here) has been written by a practical, practising researcher. For example, there’s advice about references to help any new arrival in your lab get up to Open Science speed, and discussion about developing a laboratory manual to help everyone adopt good, systematic OS practices. His seven steps are an operationalisation of many such practices.

Possibly broader and more detailed accounts of what OS needs were given by Asendorpf et al. (2013), and Munafò et al. (2017). Dorothy Bishop (2020) gave a particularly compelling account, with a focus on overcoming the cognitive biases that make adopting OS practices a challenge.

Dance of the p Values

To illustrate his discussion of variability of ESs over replications, Steve Lindsay uses a figure (below) of the dance of the CIs and dance of the p values.

Screenshot from ESCI for UTNS. The blue and red distributions depict control and experimental populations, whose means differ by Cohen’s δ = 0.5 (or 10 raw points). The 20 open blue and 20 open red circles were randomly sampled from those populations. Each solid green circle represents the result of a simulated experiment: A raw effect size (experimental mean minus control mean) from random draws from the two populations, with the 95% CI around that ES estimate. The topmost (most recent) solid green circle represents the difference between the solid blue and red circles, the group means. In that random draw, the difference between conditions was not statistically significant (p = .229, as shown at left, corresponding to that CI considerably overlapping the vertical H0 line), a type II error. Of the 25 simulated experiments shown here, three came out in the wrong direction because of random sampling error; seven experiments detected the effect (p < .05, CI fully to the right of the H0 line) and every one of those overestimated the size of the effect.

CIs Make Uncertainty Salient

In the figure it’s easy, with a bit of practice, to focus on a single CI, note where it falls in relation to the vertical line (H0 of zero difference), and eyeball the p value. (ITNS Chap 6 has handy guidelines.) The two dances are telling us the same story.

But, and it’s a huge ‘but’, any single CI—which is all we know in real research life—gives us information about how wide, how frenetic the dance is. It makes the uncertainty salient. In stark contrast, any single p value tells us virtually nothing about the dancing p values. Its single value, perhaps reported to three decimal places, gives a seductive, but illusory, sense of certainty, even though it could have been very different. Yep, p is not to be trusted!

Software for the Dance of the p Values

I used ESCI in Excel 2003 (the best version ever, imho!) for the original dance of the p values video, also for this later video. It ran beautifully quickly. Unfortunately, Excel 2007 runs very much more slowly, for example in ESCI for UTNS, 2007 version, the replications plod down the screen.

Happily, we now have esci web, thanks to Gordon Moore. The first component is dances. Click ‘?’, top right in left panel, to turn on tips. Then have a play. For the dance of the p values, click red ‘9’, bottom sub-panel. Turn on sound and use the slider to adjust the speed, which can be too fast. Enjoy!

p Values Unreliable in Virtually Any Situation

Have I chosen sample sizes and population ES, δ, to give a particularly dramatic dance of the p values? You can use ESCI for UTNS, or esci web, to investigate. You will discover, for example, that larger sample sizes and/or larger δ, give average p that’s smaller (higher power, so more replications give p < .05), but there is still crazy dancing of the values of p.

For virtually any situation, the replication to replication variation in p is scarily large. Whenever you see a p value reported, remind yourself that the study could easily have given a quite different value, and that an exact replication is likewise likely to give a very different value. Any p value should be regarded as very fuzzy indeed.

What if We Don’t Know the Population ES, But Only the p Value?

The dance of the p values assumes that we know the population ES, δ, and N. However, in real research life we never know δ. In Cumming (2008) I developed the formulas for the sampling distribution of the p value for two situations:

  1. δ and sample size(s) are known, as in the dances of the p values above; and
  2. all we know is the p value given by an initial study.

For 2, we assume a single replication, everything exactly the same as in the initial study except that a new sample is taken. (Also assumes sample sizes are not small.) I refer to the p value given by such a replication as replication p.

The article includes figures showing the two distributions for a range of cases. For the second case, the figure shows the distribution of replication p following a specified p value found in the initial study.

To my surprise, I found that it’s perfectly possible to find and picture the distribution of replication p: Tell me only p from your study, and I’ll tell you the chance that an exact replication will give, for example, p < .05. Whatever the power, the N (if not small), and the true ES.

Not surprisingly, the distribution of replication p is very wide. Yes, following an initial p = .01 you are likely, on average, to get smaller replication p than following initial p = .05. But in both cases there’s vast uncertaintly—almost any replication p may occur.

Significance Roulette

How could I dramatize the almost unbelievably large amount of uncertainty in replication p? A mere diagram of the sampling distribution lacks punch. I divided the area under the distribution curve into 38 equal areas, because that’s the number of slots on one common roulette wheel. I used the p value at the centre of each area to represent that area. I arranged those 38 p values haphazardly around a wheel, in fact a roulette wheel. This figure is the wheel for initial p = .05:

After an initial study gives p = .05, a replication, just the same but with a new sample, will give replication p as shown by the wheel. The distribution of replication p is illustrated at left in terms of five conventional intervals. All p values are two-tailed.

If you find an initial p = .05, then to find what p an exact replication is likely to give, spin the wheel. Each of the 38 values is equally likely. At left is a summary, in terms of five conventional intervals of p values. You have a 7/38 chance of *** (p < .001, bright red circles) and 15/38 chance of p > .10 (deep blue circles). A mere tiny flip of the wheel and, instead of p = .39, as pictured, you might have obtained .02, or maybe ** or ***.

Here’s the wheel for initial p = .01:

As for the previous figure, but now for initial p = .01.

As we expect, replication p values are on average smaller than for the first wheel, but there’s still enormous variation. Chance of *** is now 12/38 and chance of p > .10 is now 10/38. Once again there’s vast uncertainty ☹

Videos for Significance Roulette

For better explanation than my brief sketch above, see two videos, here (tiny.cc/SigRoulette1) and here (tiny.cc/SigRoulette2). Incidentally, the wheel runs in Excel 2003 but, despite great effort, I haven’t been able to get it running in later Excel. If anyone would like to build it in some more modern language, preferably to run on the web, that would be great. Please let me know.

Beyond p Values

I’ve long argued, for example in Cumming (2014), that p values are rarely needed and that almost always we’re better off if we simply don’t report them, and don’t see them in published articles. They should simply be left to whither away. The only reason students should need to learn about them is to make sense of old research literature.

However, NHST and p values seem to be the researcher’s heroin. For most of us, and the published literature, they are deeply embedded and it seems very challenging to overcome the addiction. Rational argument has not proved effective. Can dramatization of unreliability do better?

“To What Extent…?”

For years I have been schooling myself never to ask “I wonder whether…” even when daydreaming. It always has to be “I wonder to what extent…”. Give it a try. That’s the step from dichotomous to estimation thinking. Our world simply isn’t black-white, but a gazzillion shades of grey, not to mention colours.

Meanwhile, enjoy seeing the wheel spin, and musing about what it tells us.

May all your confidence intervals be short!

Geoff

Asendorpf, J. B., et al. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27(2), 108-119. https://doi.org/10.1002/per.1919

Bishop, D. V. M. (2020). The psychology of experimental psychologists: Overcoming cognitive constraints to improve research: The 47th Sir Frederic Bartlett Lecture. Quarterly Journal of Experimental Psychology, 73(1) 1–19. https://doi.org/10.1177/1747021819886519

Cumming, G. (2008). Replication and p Intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3(4), 286–300. https://doi.org/10.1111/j.1745-6924.2008.00079.x

Cumming, G. (2014). The New Statistics: Why and how. Psychological Science, 25(1), 7-29. https://doi.org/10.1177/0956797613504966

Lindsay, D. S. (2020). Seven steps toward transparency and replicability in psychological science. Canadian Psychology/Psychologie canadienne, 61(4), 310–317. https://doi.org/10.1037/cap0000222

Munafò, M. R., et al. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 0021. https://doi.org/10.1038/s41562-016-0021

Leave a Reply

Your email address will not be published. Required fields are marked *

*