You’ve got to build your love on a solid foundation– p < .05 does not mean you have enough data to have done good science
Joe Tex sang it well: You’ve got to build your love on a solid foundation.
Applied to science: you should build a research program that is robust, generative, and fruitful.. a solid foundation for exploring the hidden mechanisms at work behind the visible world.
It is increasingly clear that much of our science is not built on a solid foundation–too much of what has been published is difficult to replicate, ephemeral, and/or highly uncertain. Lots of ink has been spilt over the past few years diagnosing the problem–flexible research analysis, publication bias, the misuse of p values, HARKING, etc. This has been vital conversation; it has identified many ways to improve the research process. So far, though, we haven’t properly addressed the number one-problem: insufficient sample size.
Science done with insufficient samples isn’t really science at all. It’s a house built on sand, unlikely to contribute lasting knowledge. Even worse, research published with inadequate samples pollutes the literature with false leads and blind alleys. This isn’t an opinion–it’s the basic maths of how sampling works: small samples have high error! Even though this is obvious, sample sizes in psychology and related fields have long been inadequate (Cohen, 1962; Marszalek, Barber, Kohlhart, & Cooper, 2011; Sedlmeier & Gigerenzer, 1989) and remain completely inadequate (Button et al., 2013; Szucs & Ioannidis, 2016).
Given that the problem is long-standing and well-documented, why does it persist? Because scientists erroneously think that p < .05 inoculates against these problems. That is, even though most scientists will acknowledge in general that inadequate sample sizes are an issue, they don’t have concerns about their own specific research. They take p < .05 to be a magic badge that protects themselves from issues related to sample size. This is mass delusion that is incredibly harmful to science. It’d be like having a surgeon who thinks that wearing a special badge means she no longer needs to scrub in for surgery. Badges don’t protect against infection; p values don’t indicate adequate sample-size.
I came up against this pernicious misconception in full force this year at the Society for Neuroscience meeting. I gave a talk there about sample-size planning. What was informative (and scary) was the conversations I had leading up to the presentation. Numerous friends and colleagues mentioned they were sending their students to the talk. But most of them were quick to assure me that their labs were doing fine–they always used adequate samples. “How do you know?” I would ask. The answers always related to obtaining statistical significance. I don’t want to embarrass the people I talked to, but here is an example in print. It’s from an editorial by a prominent cell biologist who ridicules journal editors for not sharing their magical thinking about sample sizes:
Somehow, journals have taken to asking how our animal studies were powered − and here’s the point: If the results are statistically significant then, indeed, our study is appropriately powered. Sometimes, I’m not sure that the editors who insist on this information understand this.(Mole, 2017)
This is the rock of ignorance on which 1,000 studies about poor statistical power have broken. Scientists don’t “hear” Cohen, Gigerenzer, and others because they don’t realize that issues of power and sample size relate even to their own studies.
Let’s see if we can dispel this misconception. Here is a result I plucked more-or-less at random from a recent issue of the Journal of Neuroscience. In this study, researchers were trying to find ways to reduce the impact of having a stroke on subsequent brain and behavioral function (Xu et al., 2017). They developed a special ‘knock out’ strain of mouse that lacked function of a specific gene. They then induced strokes in both control mice and the knockout mice, and checked to see the impact of the stroke on a global measure of motor function. Awesomely, the gene knockout was helpful: the knockout mice were less impaired after the stroke than the control mice (p = .017). In fact, in quantitative terms it was a very large effect. From the data provided in Figure 1C it looks like Cohen’s d would be -1.55, meaning that the there is only 45% overlap between the two groups in terms of their motor function. In this sample, the treatment had an almost-qualitative difference in stroke outcomes.
Awesome… but is this a solid foundation for future research? Should we start thinking about clinical trials? Should other researchers start exploring the mechanisms by which losing function of this gene can have neuroprotective effects? If another researcher wanted to replicate and extend these effects, would it be likely for them to also find such a staggeringly large effect? Surprisingly: No! This result is intriguing and statistically significant, but it is not a solid foundation for future research. Let’s take a look at why not.
The problem is that the sample size is inadequate, leaving tremendous uncertainty about the size of the effect. In this (small) sample, there was a large effect of d = -1.55. That’s large, potentially a breakthrough, and easy for others to study and explore. Doing a mechanistic followup would take only 8 animals per group to obtain 80% power for if the sample is a perfect representation of the population. But that’s just it–samples are not perfect representations of a population even under ideal circumstances. We should expect sampling error, and we should expect that sampling error to be large when samples are small (in this case, 4-8 per group in the original study). The figure shows the 95% CI based on the effect. Yes, it is statistically significant, meaning that we can rule out an effect size of exactly 0. But what can we rule in? It would be plausible for the real effect to be even larger, an enormous breakthrough in stroke prevention (d = -3.05 is at one end of the 95% CI). But it might also be just a modest protection from stroke, improving outcomes by only d = -0.25 –that’s the other end of the 95% CI.
Of course, there’s nothing special about the ends of 95% CI. The real effect could be outside these boundaries, though more likely inside. The main point, through, is that we don’t really know enough yet about this treatment to do good science moving forward. If the real effect is weak, like d = 0.25, then it would take about 252 animals/group to effectively do a follow-up study. That’s not financially or ethically feasible for pre-clinical research. In other words, this result, though statistically significant, could be a blind alley. Remember, this is not the only preclinical result related to stroke. Given that we can’t follow up on all leads,it is not enough to know that this *might* be a breakthrough. We need to know with more certainty if this is a strong enough effect to invest further resources into. At this point, not enough data has been collected to make that determination.
Let’s also think about replication. Suppose a group were to try to replicate this outcome. The original research had 12 animals in total. Let’s say to be on the safe-side the researchers go with 16 animals total (8/group)–that’s quite an expense and investment, especially if they want to do an extension that will require more groups. If the sample is a perfect representation of the population then this new sample will have 80% power to detect the effect. But if the sample even slightly over-represented the real effect, this replication effort is likely to fail. Suppose, for example, the real effect is d = -1. That’s still a sizeable effect and not too far from what was found in the sample. But even with just this slight over-estimation, the power of a reasonable replication study would be only 0.44 — that is, it would be more likely than not to fail to detect the effect, even though it is a real and sizeable effect. Even with p <.05, studies with inadequate samples are unlikely to replicate and a poor guide for future research.
So — statistical significance is not armor against the injuries caused by inadequate sample sizes. Even when p < .05 you may not have collected enough data to have a solid foundation for future research–it can remain unclear if you’ve found a meaningful or meaningless effect, it can remain unclear how to effectively replicate the research, and it can remain unclear if mechanistic follow-ups are feasible or wise.
If not by p values, how do we achieve a solid scientific foundation? By obtaining sample sizes that give us reasonable precision–that reduce our uncertainty to a range that makes planning for the next steps feasible and tractable. We can do this by increasing sample sizes, decreasing noise, and/or increasing the magnitude of the effect. But in the end, we want a margin of error that is considerably less than the effect size. A statistically significant result tells us only that the margin of error is at least fractionally smaller than the effect size–and that, more often than not, is too uncertain to build your love on.
Amen, Joe Tex!
BTW – not trying to pick on Xu et al. (2017)… basically almost every paper I looked at in J Neurosci had at least 1 figure showing a statistically significant result with an inadequate sample size. Insufficient sample size is an epidemic in science.