Mega Studies Need Inference by Interval (or: Point Nulls are Especially Pointless when N is Large)
I (Bob) love large-scale collaborative projects (Many Labs, Many Babies, Psych Science Accelerator, etc.). I like the teamwork involved. I like the careful deliberation over study materials. I like the large sample sizes and the hopes of investigating how psychological phenomena vary over labs, languages, and culture. Mega studies aren’t always perfect or definitive (no study is), they are one of the most useful tools to emerge from the credibility revolution.
And yet mega studies can’t do any good if we simply carry forward poor statistical practices to a larger scale. I’m looking at you, point null hypotheses. With large Ns, testing against a null hypothesis of exactly 0 makes almost exactly 0 sense. It’s no test at all.
Why? First, because with larger sample sizes power increases to detect small effects, including effects that are only trivially different from 0. For example, Psych Science Accelerator is about to launch a new mega-study on Covid-19 that will probably recruit over 20,000 participants. One of the included studies is a simple two-group design. That means the nominal power for the study is reasonable (>80%) for effect sizes all the way down to d = 0.04, and still at 30% for d = .02!
A second problem is that model mis-specification becomes an increasingly big issue with large sample sizes–even slight deviations from the assumptions of the statistical test could produce spurious statistical significance.
Taken together, this means that statistical tests against a point null are overwhelmed by sample size–with real-world data they will emit statistical significance at an unacceptably high rate. Tests against a point null simply do not provide a stringent test in a mega-study (usually not in small studies either, but that’s a different blog post). “Predicting” a non-zero effect in a mega-study is about as impressive as calling a single coin flip.
What’s the solution? Bayes factors? No–these can also overwhelmed by sample size (at least as implemented by default in JASP). The solution (regardless of statistical philosophy) is to make more meaningful predictions. Specifically, one should be able to predict a range of plausible/meaningful effect sizes (aka the smallest effect size of interest). One can then check if the effect size observed is within this more stringent range of predictions. In practice, this is often done by inverse–by specifying the range of non-meaningful effect sizes (aka region of practical equivalence or ROPE) and then checking if the effect size observed is clearly outside of this range of predictions (see Figure 1). Either way, this approach (often called inference by interval) not only provides stringency to the test, it also rounds out the test, providing no only a path to support the hypothesis (the data are only compatible with meaningful effects), but also a path to refute the hypothesis (the data are only compatible with non-meaningful effects), and a path to an indeterminate result (the data are compatible with both meaningful and non-meaningful effects).
What if you don’t have a clear range of effect size predictions or can’t easily specify the smallest effect size of interest? Then it’s probably a bit premature to invest mega-study resources into the research question.
How do you implement inference by estimation? If you use estimation, it’s simple. You specify your smallest effect size of interest (thereby defining your interval null) and then plot that against your observed effect size and CI (this can be a confidence interval or a Bayesian credible interval). If the whole CI is outside of the interval null, you have clear evidence of a predicted meaningful effect. If the whole CI is inside the interval null, you have clear evidence the effect is not meaningful. If the CI includes both regions the test is indeterminate. There are some nuances here about setting the CI width based on your desired error level, but the basic idea is pretty simple.
If you want p values, you can get them–in fact, you can get 4 of them. First you conduct a minimal effect test to determine if the observed effect has a statistically significant difference from the interval null. This generates two p values, and the overall test is significant if either p value is lower than the selected alpha level. Then you conduct a minimal-effect test to see if your effect is fully inside the interval null. This also generates two p values. You take the maximum of these and then compare to the selected alpha level. To my mind, this is all a bit complicated relative to just seeing the result as a CI, but if you really need all those p values, go for it. Lakens has tutorials and the original source is here: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1954.tb00169.x