Update 8 June. Some minor tweaks. Addition of the full reference for two papers mentioned.
Of course I would say that, wouldn’t I?! It’s the basis of ITNS and a new-statistics approach. But the latest issue of SERJ adds a little evidence that, maybe, supports my statement above. The article is titled Conceptual Knowledge of Confidence Intervals in Psychology Undergraduate and Graduate Students (Crooks, et al., 2019) and is on open access here.
Reading the paper prompted mixed feelings:
I’m totally delighted to see this review and some new empirical work on the vital topic of how students, at least, understand CIs, and how teaching might be improved.
Most of the previous research that is discussed is quite old, much of it 10 years or more, so dating from well before Open Science and the latest moves to ditch statistical significance. We need more research on statistical cognition and we need it now! Especially on estimation and CIs!
Many of the papers from my research group are mentioned, but one fairly recent one was missed. Read about that one here (Kalinowski et al., 2018).
The reported study is welcome, but the authors acknowledge that it is small (N=21 undergraduates, 19 graduate students). The undergrads had all experienced the same statistics course, and the grad students ditto.
Therefore any conclusion can be only tentative. The authors concluded that, alas, general understanding of CIs was mediocre. In most respects, grad students did a bit better than undergrads (Phew!). Mentioning estimation tended to go with better appreciation of CIs; there was a hint that, perhaps, mentioning NHST went with less good understanding of CIs. But, as I mentioned, the samples were small and hardly representative.
The authors suggest, therefore, that teaching CIs along with an estimation approach is likely to be more successful than via NHST. I totally agree, even if only a little extra evidence is reported in this paper.
I’m delighted there are researchers studying CIs, estimation, teaching, and understanding. We do so need more and better evidence in this space. I look forward to some future version of ITNS being more strongly and completely evidence-based than we have been able to make the first ITNS. I’m confident that our approach will be strongly supported, no doubt with refinements and improvements. Bring that on! Meanwhile, enjoy using ITNS.
Crooks, N. M., Bartel, A. N., & Alibali, M. W. (2019). Conceptual knowledge of confidence intervals in psychology undergraduate and graduate students. Statistics Education Research Journal, 18, 46-62.
Kalinowski, P., Lai, J., & Cumming, G. (2018). A cross-sectional analysis of students’ intuitions when interpreting CIs. Frontiers in Psychology, 16 February. doi: 10.3389/fpsyg.2018.00112
For years I’ve been working on changing my thinking–even when just musing about nothing in particular–from “I wonder whether…” to “I wonder to what extent…”. It has taken a while, but now I usually do find myself thinking in terms of “How big is the effect of…?” rather than “Is there an effect of…?”
I worked on making that change, despite decades of immersion in NHST, because I’ve long felt that overcoming dichotomous thinking has to be at the heart of improving how we do statistics. No more mere black-white, sig-nonsig categorisation of findings!
“Why does dichotomous thinking persist? One reason may be an inherent preference for certainty. Evolutionary biologist Richard Dawkins (2004) argues that humans often seek the reassurance of an either-or classification. He calls this ‘the tyranny of the discontinuous mind’. Computer scientist and philosopher Kees van Deemter (2010) refers to the ‘false clarity’ of a definite decision or classification that humans clutch at, even when the situation is uncertain. To adopt the new statistics we may need to overcome an inbuilt preference for certainty, but our reward could be a better appreciation of the uncertainty inherent in our data.” (pp. 8-9)
Now there’s an article (Fisher & Keil, 2018) reporting evidence that such a binary bias may indeed be widespread:
It’s behind a paywall unfortch, but here’s the abstract
One of the mind’s most fundamental tasks is interpreting incoming data and weighing the value of new evidence. Across a wide variety of contexts, we show that when summarizing evidence, people exhibit a binary bias: a tendency to impose categorical distinctions on continuous data. Evidence is compressed into discrete bins, and the difference between categories forms the summary judgment. The binary bias distorts belief formation—such that when people aggregate conflicting scientific reports, they attend to valence and inaccurately weight the extremity of the evidence. The same effect occurs when people interpret popular forms of data visualization, and it cannot be explained by other statistical features of the stimuli. This effect is not confined to explicit statistical estimates; it also influences how people use data to make health, financial, and public-policy decisions. These studies (N = 1,851) support a new framework for understanding information integration across a wide variety of contexts.
Fisher and Keil reported multiple studies using a variety of tasks. All participants were from the U.S. and recruited via Mechanical Turk. Therefore, an unknown proportion, but perhaps a low proportion, would have had some familiarity with NHST, so the evidence for a binary bias probably does not reflect an influence of NHST–in accord with the authors’ claim that their results support a binary bias as a general human characteristic.
Updated 30 May 2019. A few tweaks to the text below. I have now had a chance to have a squiz at the book itself. Bob has also seen the book and given a quick opinion. I am confirmed in my view below that, alas, the book is not a good intro textbook for what we think of as the new statistics.
I was excited to read about a book titled The new statistics with R: An introduction for biologists (Hector, 2015). I was reading the book reviews in Significance magazine. (It’s a great magazine. Pity about the title!)
But, very unfortunately, the brief review was not positive, describing the book as “scattergun and superficial” and as “containing typos and errors that may mislead”, and saying that it “misses the mark”. The review is behind a paywall, but the info about it is here. (It, like all the magazine’s contents, will become open access 12 months after publication.)
I confess that I haven’t seen the book itself, but you can read some pages here on Google books, or here on Amazon (click ‘look inside’). There are a few reader reviews on Amazon, which can be summarised, politely, as ‘mixed’.
Of course, I was eager to know what the author meant by ‘the new statistics’. Here’s the answer, on p. 5:
The last sentence looked promising. Also on p. 5, the author writes “I have tried to take an estimation-based approach that focuses on estimates and confidence intervals wherever possible. … I have also tried to emphasize the use of graphs for exploring data and presenting results. I have tried to encourage the use of a priori contrasts…” Excellent!
From the Contents, I could see that in Chapter 3 (Comparing groups. The Student’s t-test) there is a section titled Confidence intervals, likewise in Chapter 4 (Linear Regression). Chapter 5 is titled Comparisons Using Estimates and Intervals. As stated earlier, there seems to be no mention of meta-analysis.
Then the book quickly moves on to analysis of covariance, maximum likelihood, the general linear model, mixed-effects models, and even the general linear mixed-effects model.
As I say, I haven’t read the book itself. With what I can see, I’m not hopeful that it’s a good intro textbook, with a focus on estimation and Open Science, suitable for psychology and other social and behavioural science beginning students. Please let me know if you have used it in teaching. Thanks.
Hector, A. (2015). The new statistics with R: An introduction for biologists. Oxford: OUP.
In psychology, there are a few studies so famous and influential that they have proper names: The Good Samaritan Study, the Asch Obedience Study, the Marshmallow test, etc, etc.
Approaching this echelon is the “Cookie Monster Study”, an increasingly-famous study of social power. If you don’t already know it, here’s a quick summary:
Ward and Keltner (1998) examined whether power would produce socially inappropriate styles of eating. In same-sex groups of 3 individuals, 1 randomly chosen individual (the high-power person) was given the role of assigning experi- mental points to the other 2 on the basis of their contributions to written policy recommendations concerning contentious social is- sues. After group members discussed a long and rather tedious list of social issues for 30 min, the experimenter arrived with a plate of five cookies. This procedure allowed each participant to take one cookie and provided an opportunity for at least 1 participant to comfortably take a second cookie, thus leaving one cookie on the plate. Consistent with the prediction, high-power individuals were more likely to take a second cookie (see Figure 6). Coding of the videotaped interactions also revealed that high-power individuals were more likely to chew with their mouths open and to get crumbs on their faces and on the table. Male participants ate in more disinhibited ways as well, lending further support to our power-based hypothesis, to the extent that gender is equated with power.(Keltner, Gruenfeld, & Anderson, 2003) (p. 277)
This study has become influential in the public sphere. It has been covered extensively in the national news media for over a decade (here and here and here and here and of course by NPR here and here and here…and, well, lots more places). There’s a cute YouTube video explaining the study which has been viewed over 100,000 times. It’s been lifted up into life advice, via a commencement address that went viral by Moneyball author Michael Lewis (“Don’t Eat Fortune’s Cookie“, delivered at Princeton in 2012). Finally, one of the authors of the study, Dacher Keltner, has written and discussed the study extensively, penning both thought-provoking essays and a pop-science book which feature the Cookie-Monster Study prominently.
It’s easy to appreciate the appeal of this study–it neatly encapsulates and supports our worry that power changes us, shaping us into less moral and sensitive creatures. As Shankar Vedantam breathlessly put it during a recent NPR interview with Keltner:
So it’s fascinating because, of course, what you see in these lab experiments is often reflected on much, much bigger stages, where you see people in power abusing that power – you know, having affairs, cheating and, you know, falsifying financial returns. And, you know, at one level, the conventional view, I think, is sort of to say, these are just people who were bad people who rose to the top. But what you’re suggesting is actually something more complex and, in some ways, much sadder, which is that these might not be bad people who rose to the top, but these might be good people who rose to the top, and power has made them bad. — Shankar Vedantam, NPR
Given the intrinsic interest in the Cookie Monster Study, it is surprising to find that it hasn’t been extremely influential in the academic world: Google scholar says it’s only been cited 26 times. Why is that? Because the original study is an unpublished manuscript… so there’s nothing to cite. The study was introduced to the world via a 2003 review paper (quoted above) by Keltner and colleagues (Keltner et al., 2003). The review summarized the study and provided a summary figure, but cited only an unpublished manuscript:
Ward, G., & Keltner, D. (1998). Power and the consumption of resources. Unpublished manuscript.
Ok, but as the study became more famous surely journals would have been lining up to publish it. Why is it still unpublished? Well, according to Keltner when he moved from the University of Wisconsin to UC Berkeley he lost contact with the lead author (Ward), and with him all access to the data, methods, and materials, leaving him unsure about even the original sample sizes. A screen-shot of his email to me (Bob) about this is below. According to Wikipedia, Keltner moved to UC in 1996, so the data has long been gone and unavailable.
That’s strange, right? That means the Cookie-Monster Study isn’t an influential paper..it’s an influential memory of a study, which 20-years after the fact is still providing subject-matter for books, essays, and breathless popular-press coverage.
And then it gets even more strange. The review paper by Keltner and colleagues that introduced the Cookie-Monster study to the world has this figure summarizing the results:
Look carefully… notice that men ate slightly fewer cookies in the high-power condition? If there is an effect, it would have been for women only. So the figure doesn’t match the summary given in the review paper nor with the interviews and summaries given since then by Keltner. Apparently none of the review paper authors noticed this, nor the reviewers. The review itself has been cited over 2700 times, and I can’t see anyone mentioning the discrepancy. When I pointed out the error to Keltner, he still didn’t seem to notice, stating that the effect was “more pronounced in women” but “observed for both men and women”. But that’s not the in the least what the published figure shows. So not only is the Cookie-Monster study based on a memory, it is likely based on a faulty memory. There are other possible indications of this. For example, Keltner seems to be inconsistent about if there were 4 or 5 cookies served to the participants. In the YouTube video about the study he specifically mentions 4 cookies, whereas in other venues he insists that pilot testing showed 5 cookies were needed to license the powerful person to take another cookie. Maybe this is just a detail lost to editing in these pieces, but the fact remains that there’s no way left to check which account is correct.
None of this means that the Cookie Monster study is wrong–just that the sum of its influence is based on one researcher’s potentially faulty memory of what one of his students did 20+ years ago. The evidentiary value, at this point, is probably nil. It might not be the most suitable study to lift up for public discussion. When it is discussed, it would probably be best to make it clear that it is a recollection of a data long-since lost.
Interestingly, the effect could potentially be reliable. A 2008 study attempted to replicate Ward and Keltner. They used dyads with a confederate, so there are some concerns about demand effects. Yet they did find that those assigned to a high-power situation ate more cookies. (Smith, Jost, & Vijay, 2008). It’d be great to do an RRR on this.
Keltner, D., Gruenfeld, D. H., & Anderson, C. (2003). Power, approach, and inhibition. Psychological Review, 265–284. doi: 10.1037/0033-295x.110.2.265
Smith, P. K., Jost, J. T., & Vijay, R. (2008). Legitimacy Crisis? Behavioral Approach and Inhibition When Power Differences are Left Unexplained. Social Justice Research, 358–376. doi: 10.1007/s11211-008-0077-9
Updated 5/21:fixed a typo, added a section on when p > .05 demonstrates a negligible effect, and added a figure at the end.
Daniel Lakens recently posted a pre-print with a rousing defense of the much-maligned p-value:
In essence, the problems we have with how p-values are used is human factors problem. The challenge is to get researchers to improve the way they work. (Lakens, 2019) (p.8)
I agree! In fact, Geoff and I just wrote nearly the same thing:
The fault, we argue, is not in p-values, but in ourselves. Human cognition is subject to many biases and shortcomings. We believe the NHST approach exacerbates some of these failings, making it more likely that researchers will make overconfident and biased conclusions from data. (Calin-Jageman & Cumming, 2019) (p.271)
So we agree that p values are misused… what should we do about it? Lakens states that the solution will be creating educational materials that help researchers avoid misunderstanding p values:
If anyone seriously believes the misunderstanding of p-values lies at the heart of reproducibility issues in science, why are we not investing more effort to make sure misunderstandings of p-values are resolved before young scholars perform their first research project? Where is the innovative coordinated effort to create world class educational materials that can freely be used in statistical training to prevent such misunderstandings? (p.8-9)
But Lakens seems to think this is all work to be done, and that no one else is putting serious effort towards remedying the misinterpretation of p values. That’s just not the case! The “New Statistics” is an effort to teach statistics in a way that will eliminate misinterpretation of p values. Geoff has put enormous effort into software, visualizations, simulations, tutorials, papers, youtube videos and more–a whole eco-system that can help researchers better understand what a p-value does and does not tell them.
Let’s take a look.
Non-significant does not always mean no effect. Probably the most common abuse of p values in concluding that p > .05 means “no effect”. How can you avoid this problem? One way is to reflect on the effect size and its uncertainty and to consider the range of effects that remain compatible with the data. This can be especially easy if we use an estimation plot to draw attention specifically to the effect size and uncertainty.
Here’s a conventionally non-significant finding (p = .08) but one look at the CI and it’s pretty easy to understand that this does not mean that there is no effect (in fact, when this experiment was repeated across many different labs there is a small but reliable effect (this is data from one site of the ManyLabs 2 replication of the “tempting fate” experiment) (Klein et al., 2018; Risen & Gilovich, 2008)
It is possible, though, to prove an effect negligible. So non-significant does not always rule out an effect, but sometimes it can show an effect is negligible. How can you tell if you have an informative non-significant result or a non-informative one? Check out the effect size and confidence interval. If the confidence interval is narrow, all within a range that we’d agree has no practical significance, then that non-significant result indicates the effect is negligible. If the CI is wide (as in the tempting fate example above), then there are a range of compatible effect sizes remaining and we can’t yet conclude that the effect is negligible.
For example, in ManyLab2 the font disfluency effect was tested–this is the idea that a hard-to-read font can activate analytic System2 style thinking and therefore improve performance on tests of analytic thinking. Whereas the original finding found a large improvement in syllogistic reasoning (d = 0.64), the replications found d = -0.03 95% CI[-0.08, 0.01], p = .43 (Klein et al., 2018). This is non-significant but that doesn’t mean we just throw up our hands and discount it as meaningless. The CI is narrow and all within a range of very small effects. Most of the CI is in the negative direction…exactly what you might expect for a hard-to-read font. But we can’t rule out the positive effect that was predicted–effect sizes around d = 0.01 remain compatible with the data (and probably a bit beyond–remember, the selection of a 95% Ci and therefore its boundary is arbitrary). Still, it’s hard to imagine effect sizes in this range being meaningful or practical for further study. So, yes – sometimes p > .05 accompanies a result that is really meaningful and which fairly clearly establishes a negligible effect. The way to tell when a non-significant result is or is not very informative is to check the CI. You can also do a formal test for a negligible effect, called an equivalence test. Lakens has great materials on how to do this–though I think it helps alot to see and understand this “by eye” first.
Significant does not mean meaningful. Another common misconception is the conflation of statistical and practical significance. How can you avoid this error? After obtaining the p– value, graph the effect size and CI. Then reflect on the meaningfulness of the range of compatible effect sizes. This is even more useful if you specify in advance what you consider to be a meaningful effect size relative to your theory.
Here’s an example from a meta-analysis I recently published. The graph shows the estimated standardized effect size (and 95% CI) for the effect of red on perceived attractiveness for female raters. In numbers, the estimate is: d = 0.13 95% CI [0.01, 0.25], p = .03, N = 2,739. So this finding is statistically significant, but we then need to consider if it is meaningful. To do that, we should countenance the whole range of the CI (and probably beyond as well)… what would it mean if the real effect is d = .01? d = .25? For d = .01 that would pretty clearly be a non-meaningful finding, far too small to be perceptually noticeable for anyone wearing red. For d = .25…well, perhaps a modest edge is all you need when you’re being scanned by hundreds on Tinder. Interpreting what is and is not meaningful is difficult and takes domain-specific knowledge. What is clear is that looking at the effect size and CI helps focus attention on practical significance as a question distinct from statistical significance.
Nominal differences in statistical significance are not valid for judging an interaction. Another alluring pitfall of p-values is assuming that nominal differences in statistical significance are meaningful. That is, it’s easy to assume that if condition A is statistically significant and condition B is not, that A and B must themselves be different. Unfortunately, the transitive property does not apply to statistical significance. As Gelman and many others have pointed out, the difference between statistically significant and not significant is not, itself, statistically significant (Gelman & Stern, 2006; Nieuwenhuis, Forstmann, & Wagenmakers, 2011). That’s quite a mantra to remember, but one easy way to avoid this problem is to look at an estimation plot of the effect sizes and confidence intervals. You can see the evidence for an interaction ‘by eye’.
Here’s an example from a famous paper by Kosfeld et al. where they found that intranasal oxytocin increases human trust in a trust-related game but not in a similar game involving only risk (Kosfeld, Heinrichs, Zak, Fischbacher, & Fehr, 2005). As shown in A, the effect of oxytocin was statistically significant for the trust game, but not for the risk game. The paper concluded an effect specific to trust contexts, was published in Nature and has been cited over 3,000 times…apparently without anyone noticing that a formal test for a context interaction was not reported and would not have been statistically significant. In B, we see the estimation plots from both contexts. You can immediately see that there is huge overlap in the comptable effect sizes in both contexts, so the evidence for an interaction is weak (running the interaction gives p = .23). That’s some good human factors work–letting our eyes immediately tell us what would be hard to discern from the group-level p values alone.
If I have obtained statistical significance I have collected enough data. To my mind this is the most pernicious misconception about p-values–that obtaining significance proves you have collected enough data. I hear this alot in neuroscience. This mis-conception causes people to think that run-and-check is the right way to obtain a sample size, and it also licenses uncritical copying of previous sample sizes. Overall, I believe this misconception is responsible for the pervasive use of inadequate sample sizes in the life and health sciences. This, in combination with publication bias, can easily produce whole literatures that are nothing more than noise ( https://www.theatlantic.com/science/archive/2019/05/waste-1000-studies/589684/ ). Yikes.
How can you avoid this error? After obtaining your p-value, look at the effect size and an expression of uncertainty. Look at the range of compatible results–if the range is exceedingly long relative to your theoretical interests then your result is uninformative and you have not collected enough data. You actually can use run-and-check to keep checking on you margin of error and running until it is sufficiently narrow to enable a reasonable judgement from your data. How cool is that?
Here’s an example. This is data from a paper in Nature Neuroscience that found a statistically significant effect of caffeine on memory (Borota et al., 2014). Was this really enough data? Well, the CI is very long relative to the theoretical interests–it stretches from implausibly large impacts on memory to vanishingly small impacts on memory (in terms of % improvement the finding is: 31% 95% CI[0.2%, 62%]. So the data establishes a range of compatible effect sizes that is somewhere between 0 and ludicrous. That’s not very specific/informative, and that’s a clear sign you need more data. I corresponded with this lab extensively about this… but in all their subsequent work examining effects of different interventions on memory they’ve continued to use sample sizes that provide very little information about the true effect. Somehow, though, they’ve yet to publish a negative result (that I know of). Some people have all the luck.
p-values do not provide good guidance of what to expect in the future. If you find a statistically significant finding, you might expect that someone else repeating the same experiment with the same sample size will also find a statistically-significant finding (or at least be highly likely to). Unfortunately, this is wrong–even if your original study had 0 sampling error (the sample means were exactly what they are in the population) the odds of a statistically-significant replication may not be very high. Specifically, if your original study has a sample size that is uninformative (aka low powered), odds of replication might be stunningly small. Still, this is a seductive illusion that even well-trained statisticians fall into (Gigerenzer, 2018) .
How can you avoid this error? Well, one way is to work through simulations to get a better sense of the sampling distribution of p under different scenarios. Geoff has great simulations to explore; here’s a video of him walking through one of these: https://www.youtube.com/watch?v=OcJImS16jR4. Another strategy is to consider the effect size and a measure of uncertainty, because these actually do provide you with a better sense of what would happen in future replications. Specifically, CIs provide prediction intervals. For a 95% CI for an original finding, there is ~83% chance that a same-sized replication result will fall within the original CI (Cumming, 2008). Why only 83%? Because both the original and the replication study are subject to sampling error. Still, this is useful information about what to expect in a replication study–it can help your lab think critically about if you have a protocol that is likely to be reliable and can properly calibrate your sense of surprise over a series of replications. To my mind, having accurate expectations about the future is critical for fruitful science, so it is well worth some effort to train yourself out of the misconception that p values alone can give you much information about the future.
If you want to use a p value well, always report and interpret its corresponding effect size and CI. If you’re following along at home, you’ve probably noticed a pattern: to avoid misinterpretation of p values, reflect critically on the effect size and CI that accompany it. If you do this, it will be much harder to go wrong. If you keep at it for a while, you may begin to notice that the p value step isn’t really doing much for you. I mean, it does tell you the probability of obtaining your data under a specific null hypothesis. But you could also plot the expected distribution of sampling error and get that sense for all null hypotheses, which is pretty useful. And you may be interested in Laken’s excellent suggestions to not always use a point null and to use equivalence testing. But both of these are applied by considering the effect size and its CI. So you may find that p values become kind of vestigial to the science you’re doing. Or not. Whatever. I’m not going to fall into the “Statistician’s Fallacy” of trying to tell you what you want to know. But I can tell you this: If p values tell you something you want to know, then the best way to avoid some of the cognitive traps surrounding p values is to carefully consider effect sizes and CIs.
There’s even more. This short tour is just a taste of work Geoff and I (mostly Geoff) have done to try to provide tools and tutorials to help people avoid misinterpretation of p values. We have an undergrad textbook that helps students learn what p values are and are not (we include several “red flags” to consider every time p values are reported). And there’s ESCI, which will give you p values if you need them, but always along with helpful visualizations and outputs that we think will nudge you away from erroneous interpretation. And pedagogically, I’m convinced that teaching estimation first is the best way to teach statistics–it is more intuitive, it is more fun, and it helps students develop a much stronger sense of what a p value is (and is not). With all due respect to the wonderful memorization drills Lakens recounts in his pre-print, chanting mantras about p values is not going to avoid mindless statistics–but a teaching approach that takes a correct developmental sequence could.
So how do we disagree? So, I agree with Lakens that the problems of p values are in how they are used, and I agree with Lakens that we need a whole ecosystem to help people avoid using them incorrectly. Where we have a puzzling disagreement then, is that I see that work as largely complete and ready for use. That is, I see the “New Statistics” specifically as a way of teaching and thinking about statistics that will help make the errors common to p values more rare (but will probably make other types of errors more common; no system is perfect). Is the disagreement all down to labels and/or misconceptions? Stay tuned.
Oh and one other thing. Lakens complains that SPSS does not report an effect size for a t-test. But it does! At least as long as I’ve been using it, it provides an effect size and CI. It does not provide standardized effect sizes, but raw score? You bet.
Borota, D., Murray, E., Keceli, G., Chang, A., Watabe, J. M., Ly, M., … Yassa, M. A. (2014). Post-study caffeine administration enhances memory consolidation in humans. Nature Neuroscience, 201–203. doi: 10.1038/nn.3623
Calin-Jageman, R. J., & Cumming, G. (2019). The New Statistics for Better Science: Ask How Much, How Uncertain, and What Else Is Known. The American Statistician, 271–280. doi: 10.1080/00031305.2018.1518266
Cumming, G. (2008). Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better. Perspectives on Psychological Science, 286–300. doi: 10.1111/j.1745-6924.2008.00079.x
Gelman, A., & Stern, H. (2006). The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. The American Statistician, 328–331. doi: 10.1198/000313006×152649
Gigerenzer, G. (2018). Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science, 198–218. doi: 10.1177/2515245918771329
Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Jr., Alper, S., … Nosek, B. A. (2018). Many Labs 2: Investigating Variation in Replicability Across Samples and Settings. Advances in Methods and Practices in Psychological Science, 443–490. doi: 10.1177/2515245918810225
Kosfeld, M., Heinrichs, M., Zak, P. J., Fischbacher, U., & Fehr, E. (2005). Oxytocin increases trust in humans. Nature, 673–676. doi: 10.1038/nature03701
Lakens, D. (2019). The practical alternative to the p-value is the correctly used p-value. doi: 10.31234/osf.io/shm8v
Nieuwenhuis, S., Forstmann, B. U., & Wagenmakers, E.-J. (2011). Erroneous analyses of interactions in neuroscience: a problem of significance. Nature Neuroscience, 1105–1107. doi: 10.1038/nn.2886
Risen, J. L., & Gilovich, T. (2008). Why people are reluctant to tempt fate. Journal of Personality and Social Psychology, 293–307. doi: 10.1037/0022-3518.104.22.1683
Yay! I salute the editors and everyone else who toiled for more than a year to create this wonderful collection of TAS articles. Yes, let’s move to a “post p<.05 world” as quickly as we can.
Much to applaud
Numerous good practices are identified in multiple articles. For example:
Recognise that there’s much more to statistical analysis and interpretation of results than any mere inference calculation.
Promote Open Science practices across many disciplines.
Move beyond dichotomous decision making, whatever ‘bright line’ criterion it’s based on.
Examine the assumptions of any statistical model, and be aware of limitations.
Work to change journal policies, incentive structures for researchers, and statistical education.
And much, much more. I won’t belabour all this valuable advice.
The editorial is a good summary I said that in this blog post. The slogan is ATOM: “Accept uncertainty, and be Thoughtful, Open, and Modest.” Scan the second part of the editorial for a brief dot point summary of each of the 43 articles.
Estimation to the fore I still think our (Bob’s) article sets out the most-likely-achievable and practical way forward, based on the new statistics (estimation and meta-analysis) and Open Science. Andersonprovides another good discussion of estimation, with a succinct explanation of confidence intervals (CIs) and their interpretation.
What’s (largely) missing: Bayes and bootstrapping
The core issue, imho, is moving beyond dichotomous decision making to estimation. Bob and I, and many others, advocate CI approaches, but Bayesian estimation and bootstrapping are also valuable techniques, likely to become more widely used. It’s a great shame these are not strongly represented.
There are articles that advocate a role for Bayesian ideas, but I can’t see any article that focuses on explaining and advocating the Bayesian new statistics, based on credible intervals. The closest is probably Ruberg et al., but their discussion is complicated and technical, and focussed specifically on decision making for drug approval.
I suspect Bayesian estimation is likely to prove an effective and widely-applicable way to move beyond NHST. In my view, the main limitation at the moment is the lack of good materials and tools, especially for introducing the techniques to beginners. Advocacy and a beginners’ guide would have been a valuable addition to the TAS collection.
Bootstrapping to generate interval estimates can avoid some assumptions, and thus increase robustness and expand the scope of estimation. An article focussing on explaining and advocating bootstrapping for estimation would have been another valuable addition.
The big delusion: Neo-p approaches
I and many others have long argued that we should simply not use NHST or p values at all. Or should use them only in rare situations where they are necessary—if these ever occur. For me, the biggest disappointment with the TAS collection is that a considerable number of articles present some version of the following argument: “Yes, there are problems with p values as they have been used, but what we should do is:
Use .005 rather than .05 as the criterion for statistical significance, or
teach about them better, or
think about p values in the following different way, or
replace them with this modified version of p, or
supplemente them in the following way, or
There seems to be an assumption that p values should—or at least will—be retained in some way. Why? I suspect that none of the proposed neo-p approaches is likely to become very widely used. However, they blunt the core message that it’s perfectly possible to move on from any form of dichotomous decision making, and simply not use NHST or p values at all. To this extent they are an unfortunate distraction.
p as a decaf CI One example of neo-p as a needless distraction is the contribution of Betensky. She argues correctly and cogently that (1) merely changing a p threshold, for example from .05 to .005 is a poor strategy, and (2) interpretation of any p value needs to consider the context, in particular N and the estimated effect size. Knowing all that, she correctly explains, permits calculation of the CI, which provides a sound basis for interpretation. Therefore, she concludes, a p value, when considered in context in this way, does provide information about the strength of evidence. That’s true, but why not simply calculate and interpret the CI? Once we have the CI, a p value adds nothing, and is likely to mislead by encouraging dichotomisation.
I’ll mention just one further article that caught my attention. Billheimer contends that “observables are fundamental, and that the goal of statistical modeling should be to predict future observations, given the current data and other relevant information” (abstract, p.291). Rather than estimating a population parameter, we should calculate from the data a prediction interval for a data point, or sample mean, likely to be given by a replication. This strategy keeps the focus on observables and replicability, and facilitates comparisons of competing theories, in terms of the predictions they make.
This strikes me as an interesting approach, although Billheimer gives a fairly technical analysis to support his argument. A simpler approach to using predictions would be to calculate the 95% CI, then interpret this as being, in many situations, on average, approximately an 83% prediction interval. That’s one of the several ways to think about a CI that we explain in ITNS.
I haven’t read every article in detail. I could easily be mistaken, or have missed things. Please let me know.
I suggest (maybe slightly tongue-in-cheek):
Read the editorial, and skim the 43 brief summaries.
Steve Lindsay usefully posted a comment to draw our attention to the Preregistration Workshop on offer at the APS Convention coming up shortly in D.C.. You can scan the full list of Workshops here–there are lots of goodies on offer, including Tamara and Bob on Teaching the New Statistics. Steve is leading the prereg workshop and is supported by an all-star cast:
Prereg: Some Doubts
Richard Shiffrin is a highly distinguished psychological scientist. He has been arguing at a number of conventions that some Open Science practices are often unnecessary, and may even impede scientific progress. He and colleagues published this article in PNAS:
It’s a great article, but the authors raised doubts about, specifically, preregistration.
The Case for Prereg
In a very nice blog post, Steve Lindsay made complimentary remarks about the Shiffren et al. article, then made what seems to me a wonderfully wide-ranging yet succinct account of preregistration: The rationale and justification, and the limitations. He also countered a number of doubts that have been raised about prereg.
Steve’s post gives us lots of reasons WHY prereg. I’m sure the Workshop will also give lots of advice on HOW. Enjoy!
Whenever we read a research article we almost certainly form a judgment of its believability. To what extent is it plausible? To what extent could it be replicated? What are the chances that the findings are true?
What features of the article drive our judgment? Likely influences include:
A priori plausibility. To what extent is it unsurprising, reasonable?
How large are the effects? How meaningful?
The reputation of the authors, the standing of their lab and institution.
The reputation of the journal.
Interpretations, judgments, claims and conclusions made by the authors.
Open Science ideas will emphasise our attention to features of the research itself, including:
Are there multiple lines of converging evidence? Replications?
Was the research, including the data analysis strategy, preregistered?
Sample sizes? Quality of manipulations (IVs) and measures (DVs)?
Results of statistical inference, especially precision (CI lengths)?
Are we told that the research was reported in full detail? Are we assured that all relevant studies and results have been reported?
Any signs of cherry-picking or p hacking?
How prone is the general research area to publish non-replicable results?
Automating the Assessment of Replicability
The SCORE projectis a large DARPA attempt to find automated ways to assess the replicability of social and behavioural science research. As I understand it, teams around the world are just beginning on:
Running replications of a large number of published studies, to provide empirical evidence of replicability, and a reference database of studies.
Studying how human experts judge replicability of reported research–how well do they do, and what features (as in the lists above) guide their judgments?
Building AI systems to take the results from (2) and make automated assessments of the replicability of published research.
Brian Nosek, of COS, is leading a project on 1. above. Fiona Fidler is leading one of the projects tackling 2. above: the repliCATS project.
It’s big, maybe up to US$6.5M. It’s ambitious. And Fiona has multiple teams working on various aspects, all with impossibly tight time lines.
A month or so ago I spent a fascinating 3 days down at the University of Melbourne, for research meetings, seminars, and more, as the teams worked on their plans. Brian Nosek was in town, giving great presentations, and consulting as to how his team and Fiona’s could best work together. Here’s the outline of repliCATS, from the project website:
The repliCATS project aims to develop more accurate, better-calibrated techniques to elicit expert assessment of the reliability of social science research.
Our approach is to adapt and deploy the IDEA protocol developed here at the University of Melbourne to elicit group judgements for the likely replicability of 3,000 research claims.
The research we will undertake as part of the repliCATS project will include the largest ever empirical study on how scientists reason about other scientists’ work, and what factors makes them trust it.
We are building a custom online platform to deploy the IDEA protocol. This platform will have a life beyond the repliCATS project: it will be able to be used in the future to enhance expert group judgements on a wide range of topics, in a number of disciplines.
If you are interested in repliCATS, subscribe here for updates. You can also follow Fiona @fidlerfm
I’m agog to follow how it goes, and to see the insights I’m sure they will find into researchers’ judgments about replicability.
Suddenly it’s more than 5 years since APS made some important and very early steps to promote Open Science and new-statistics practices. Back in Jan 2014, then Editor-in-Chief of Psychological Science, Eric Eich, explained in an editorial a whole set of new policies: Offering of OS badges, requirements for more complete reporting of methods and data, encouragement for use of the new statistics, discouragement of NHST, and more. Steve Lindsay, who followed Eric, has introduced further policies to encourage replication and improve the trustworthiness of research published in Psychological Science. Various of the new policies have spread to other APS journals.
APS Conventions have hosted many symposia and workshops on Open Science, and on the new statistics and other modern statistical practices. You can rush to sign up (click at left then scroll down) for Tamara and Bob’s great workshop at this year’s Convention in D.C.:
All strength to the APS in its long-standing and continuing support for Open Science and better statistical methods.
John Ioannidis published this criticism of the Comment, with the subtitle Do Not Abandon Significance. Much of what he writes is sensible, and in agreement with the Comment, but in my opinion he doesn’t make a concerted or convincing case for not abandoning statistical significance. He hardly seems to attempt to assemble such a case, despite that subtitle.
The authors of the Comment, joined by Andrew Gelman, replied. Their reply, titled Abandoning statistical significance is both sensible and practical, strikes me as succinct, clear, and convincing.
In the meantime I continue to be so happy that statistical significance may at last be receiving its comeuppance. The battalion of scholars who have published swingeing critiques of NHST since around 1950 may at last be vindicated!
Now we just need all this great progress to filter through to instructors of the intro stats course, so that they can feel emboldened to adopt ITNS. Then we’ll know that things have really changed for the better!
Our latest work now out in @naturemethods! "Moving Beyond P Values: data analysis with estimation graphics". We show how estimation plots can improve statistical reasoning with elegant and informative estimation plots.