Sizing up behavioral neuroscience – a meta-analysis of the fear-conditioning literature
Inadequate sample sizes are kryptonite to good science–they produce waste, spurious results, and inflated effect sizes. Doing science with an inadequate sample is worse than doing nothing.
In the neurosciences, large-scale surveys of the literature show that inadequate sample sizes are pervasive in some subfields, but not in others (Button et al., 2013; Dumas-Mallet, Button, Boraud, Gonon, & Munafò, 2017; Szucs & Ioannidis, 2017). This means we really need to do a case-by-case analysis of different fields/paradigms/assays.
Along these lines, Carniero et al. recently published a meta-analysis examining effect sizes and statistical power in research using fear conditioning (Carneiro, Moulin, Macleod, & Amaral, 2018). This is a really important analysis because fear conditioning has been an enormously useful paradigm for studying the nexus between learning, memory, and emotion. Study of this paradigm has shed light on the neural circuitry for processing fear memories and their context, on the fundamental mechanisms of neural plasticity, and more. Fear conditioning also turns out to be a great topic for meta-analsis because even though protocols can vary quite a bit (e.g. mice vs rats!), the measurements is basically always expressed as a change in freezing behavior to the conditioned stimulus.
Carniero et al. (2018) compiled 122 articles reporting 410 experiments with a fear-conditioning paradigm (yeah…it’s a popular technique, and this represents only the studies reported with enough detail for meta-analysis). They focused not on the basic fear-conditioning effect (which is large, I think), but on contrasts between fear conditioning in control animals relative to treatments thought to impair or enhance learning/memory.
On first glance, there seems to be some good news: typical effects reported seem large. Specifically, the typical “statistically significant” enhancement effect is a 1.5 SD increase in freezing while a typical “statistically significant” impairment effect is a 1.5 SD decrease in freezing. Those are not qualitative effects, but they are what most folks would consider to be fairly large, and thus fairly forgiving in terms of sample-size requirements.
That sounds good, but there is a bit of thinking to do now. Not all studies analyzed had high power, even for these large effect sizes. In fact, typical power was 75% relative to the typical statistically significant effect. That’s not terrible, but not amazing. But that means there is probably at least some effect-size inflation going on here, so we need to now ratchet down our estimate. That’s a bit hard to do, but one way is to take only studies that are well-powered relative to the initial effect-size estimates and then calculate *their* typical effect size. This shrinks the estimated effects by about 20%. But *that* means that typical power is actually only 65%. That’s getting a bit worrisome–in terms of waste and in terms of effect-size inflation. And it also means you might want to again ratchet down your estimated effect size… yikes. Also, thought the authors don’t mention it, you do worry that in this field the very well-powered studies might be the ones where folks used run-and-check to hit a significance target (in my experience talking to folks as posters this is *still* standard operating procedure in most labs).
Based on these considerations, the authors estimate that you could achieve 80% power for what they estimate to be a plausible effect size with 15 animals/group. That wouldn’t be impossible, but it’s a good bit more demanding than current practice–currently only 12% of published experiments are that size or larger.
There are some other interesting observations in this paper–it’s well worth a read if you care about these sorts of things. The take-home I get is: a) this is without a doubt a useful assay that can reveal large enhancement and impairment effects, enabling fruitful mechanistic study, but b) the current literature has at least modestly inadequate sample sizes meaning that a decent proportion of supposedly inactive treatments are probably active and a decent proportion of supposedly impactful treatments are probably negligible. Even with what is often seen as a pillar of neuroscience, the literature is probably a decent bit dreck. That’s incredibly sad and also incredibly unproductive: trying to piece together how the brain works is hard enough; we don’t need to make it harder by salting pieces from a different puzzle (not sure if that metaphor really works, but I think you’ll know what i mean). Finally, doing good neuroscience is generally going to cost more in terms of animals, time, and resources that we’ve been willing to admit… but it’s better than doing cheap but unreliable work.