The replication crisis isn’t just about sample size and statistical inference. Another key issue is measurement: the process of turning observations into quantitative statements about our sample. It’s tricky. In many cases we’ve run before we learned to walk, adopting methods of measurement without adequately determining sources of error and bias, without optimizing to maximize the signal, and without standardizing details important for making results reproducible.
Here’s a fascinating paper on this topic, entitled “Rigor and reproducibility in rodent behavioral research” (Gulinello et al., 2018). The paper focuses on a “simple” memory test commonly used in rodents: the novel object recognition task. In this paradigm the rodent is placed in a chamber with 2 identical objects and left to explore them both. Then, after a delay, the rodent is placed in a chamber with 1 of the previous objects and 1 new object. *If* the rodent remembers the old object we expect it to explore it less, as it is something it has previously investigated. Thus, researchers use the preference for the novel object as a quantitative measure of how much the old objects is remembered. This paradigm was originally developed in the late 1980s (Ennaceur & Delacour, 1988) and has since become a bread-and-butter technique for many researchers. In fact, these tests are now so popular that many universities now have “Rodent Core Facilities” (RCFs) that will just run this assay for any lab that want it. This paper was in fact written primarily by various RCF directors.
Although the novelty preference test seems very straightforward, this belies some truly staggering complexity. The researcher has to make lots of decisions. Here are just a few that the paper mentions:
- what types of objects to use
- how to score ‘exploration’
- how to compute the novelty preference
- what strain of rodent to use, what age, and what genders
- how the rodents should be raised (in isolation; in colonies, etc.)
- how much the rodents should be habituated to the test chambers and/or human contact
- how much time should be left for initial exploration
- how long the delay should be between sessions
- how much time should be given for the second exploration
- how to control for odors left behind during exploration
- what time of the day and year to run the studies
- when to exclude rodents due–especially if one of the experimental manipulations is interfering with locomotion, olfaction, vision, etc.
- how to blind the researchers conducting the tests
- how many animals to test
What is truly concerning is that these decisions end up mattering. The novelty preference test is not robust across the various reasonable implementations one could use; results can be strongly influenced by each decision. For example, the paper illustrates effects of researcher training (novices have much lower reliability in scoring), how exploration is defined (automated software can do this, but defines a zone that counts as exploring; the size of the zone influences the results), strain effects, rearing effects (rodents raised in isolation are worse), delay effects (of course), and age effects, to name but a few. Because each variable matters, the authors state that the test must be calibrated pretty much for each experimental question (that is if you bring a new strain to test they’ll have to work out afresh how to get stable scores in the control animals before testing your manipulations).
Of all the factors to consider, analysis strategy is disconcertingly influential: researchers can report raw exploration times in both phases, % preferences, difference scores, or pass/fail rates (based on having a preference score above a certain threshold). One can make different conclusions from the same data depending on how novelty preference is expressed. Frustratingly, there doesn’t seem to be any ‘one’ approach that is a priori more reasonable/sensitive to the others–at least this paper did not seem to make that argument. I didn’t really understand this–surely RCF directors have examined which analysis method produces highest test/re-test reliablity, best convergent validity with other memory tests, etc…. why wasn’t this discussed.
How could the novely preference test be so influenced by procedural concerns? It seems to me the main problem is that this assay, even in the best of circumstances, produces weak effects. The paper (frustratingly) does not give a strong sense of this, but states “the effect size can be small, as the average preference score of healthy subjects is typically 60-70%” compared to the mean score for a no-memory control of 50%. I wish more info was given but that seems like a really subtle effect. In fact, that’s rather mind-blowing, given the popularity of the assay. Remember that in most cases researchers would not want to just demonstrate a memory-based preference; they’d want to detect a difference in an treated group of rats compared to controls. So if you have a drug that causes a memory mild memory impairment, you experimental prediction would be: 50% preference in non-memory controls, 60-70% preference in memory controls, and perhaps 55% in the treated animals in the memory protocol… that’s a rather insanely narrow window, especially given how expensive rodent research is (let alone all the fancy manipulations that are being used on them).
After reading this paper, I came away puzzled. The authors offered tips on doing reproducible and rigorous rodent research. But the data they presented seemed to clearly show that this simply is not possible with the novelty preference test–the signal is too small and the noise too difficult to adequately control. Maybe I’m missing something. But notably the researchers presented no specific evidence of strong test-re-test reliability for specific manipulations (manipulations which have been validated to impair/enhance memory in direct replications), no evidence of cross-lab reliability, and no evidence for convergent validity (manipulations that influence ). Maybe they felt this isn’t necessary for such a well established assay. But in the end, this paper convinced be to be very skeptical about novelty preference tests, even though I don’t think the authors were trying to make this point. The first tip for rigorous and reproducible research, it seems to me, is to develop assays that produce large and robust effects. Otherwise, you’re likely to be rigorously and reproducibly chasing noise.