p values and outrageous results

If you were researching a muscle-building supplement and read that a test of the supplement produced an increase in muscle mass by 200% within a month, you’d be right to be skeptical.  Perhaps randomization had broken down, perhaps there was a problem of measurement, or perhaps differential dropout had skewed the results..  Of course, maybe the results will generalize, but in a way the experiment is too successful–the effect size is just too strong to fit with what we know about human physiology.  Given that extraordinary claims require extraordinary evidence, it is wise to suspect a problem with an experiment that produces an outrageous effect size, at least until additional evidence can be collected.

One of the many (many!) problems with p values is that they can disguise outrageous results.  When researchers fixate on if p is less than alpha, they tend to not even consider the effect sizes, short-circuiting the critical step of judging if the effect size obtained is at all reasonable.  Even worse, outrageous effect sizes produce small p values, so researchers can become even more confident in results they really ought to know are problematic.  To corrupt the Bard: p values let us obliviously suffer the slings and arrows of outrageous effect sizes.

This may seem hard to believe: surely p values are not so mesmerizing as to cut off all critical thinking about the data obtained.  If only.  Here are two examples I recently came across where researchers reported outrageous effect sizes without really appreciating that they were making extraordinary claims.  Apparently, peer reviewers also succumbed to p-hypnosis.

  • In one case, a respected lab published a paper showing an enormously gigantic effect of watching a violent tv show on children’s behavior.  The outrageous nature of the finding was never commented on until a graduate student reviewing the literature happened to read the paper and ‘do a double-take’ on calculating the effect size.  When the grad student contacted the original lab, it triggered a process culminating in the retraction of the paper.  Here’s the whole story in retraction watch: http://retractionwatch.com/2017/03/30/group-whose-findings-support-video-game-violence-link-lose-another-paper/
  • And then here’s a great blog post from Daniel Lakens about a (in)famous study of how Isreali judges alter sentencing across the day.  Lots of folks have had issues with the study, but Lakens points out that the first and primary problem is that the effect size obtained is just far, far too large to be at all driven purely by time of the day.  His post is well worth a read.  http://daniellakens.blogspot.com/2017/07/impossibly-hungry-judges.html

So, add this to the many (many) reasons to avoid p values.  Sometimes experiments go horribly awry, leading to outrageous (and likely erroneous results).  If you only monitor p values, you could remain oblivious and/or even celebrate from within the smouldering wreck of an experiment.


p.s. – I (Bob) have been on hiatus from blogging most of the summer… I helped organize a teaching of neuroscience conference which ended up being fairly all-consuming (though well worth it).  I’m hoping to be back to weekly posts now.


I'm a teacher, researcher, and gadfly of neuroscience. My research interests are in the neural basis of learning and memory, the history of neuroscience, computational neuroscience, bibliometrics, and the philosophy of science. I teach courses in neuroscience, statistics, research methods, learning and memory, and happiness. In my spare time I'm usually tinkering with computers, writing programs, or playing ice hockey.

Posted in NHST

Leave a Reply

Your email address will not be published. Required fields are marked *