Reply to Lakens: The correctly-used p value needs an effect size and CI

Updated 5/21: fixed a typo, added a section on when p > .05 demonstrates a negligible effect, and added a figure at the end.

Daniel Lakens recently posted a pre-print with a rousing defense of the much-maligned p-value:

In essence, the problems we have with how p-values are used is human factors problem. The challenge is to get researchers to improve the way they work. ​(Lakens, 2019)​ (p.8)

I agree!  In fact, Geoff and I just wrote nearly the same thing:

The fault, we argue, is not in p-values, but in ourselves.  Human cognition is subject to many biases and shortcomings. We believe the NHST approach exacerbates some of these failings, making it more likely that researchers will make overconfident and biased conclusions from data.   ​(Calin-Jageman & Cumming, 2019)​ (p.271)

So we agree that p values are misused… what should we do about it? Lakens states that the solution will be creating educational materials that help researchers avoid misunderstanding values:

If anyone seriously believes the misunderstanding of p-values lies at the heart of reproducibility issues in science, why are we not investing more effort to make sure misunderstandings of p-values are resolved before young scholars perform their first research project? Where is the innovative coordinated effort to create world class educational materials that can freely be used in statistical training to prevent such misunderstandings? (p.8-9)

Agreed, again.

But Lakens seems to think this is all work to be done, and that no one else is putting serious effort towards remedying the misinterpretation of p values. That’s just not the case! The “New Statistics” is an effort to teach statistics in a way that will eliminate misinterpretation of p values. Geoff has put enormous effort into software, visualizations, simulations, tutorials, papers, youtube videos and more–a whole eco-system that can help researchers better understand what a p-value does and does not tell them.

Let’s take a look.

Non-significant does not always mean no effect. Probably the most common abuse of p values in concluding that p > .05 means “no effect”. How can you avoid this problem? One way is to reflect on the effect size and its uncertainty and to consider the range of effects that remain compatible with the data. This can be especially easy if we use an estimation plot to draw attention specifically to the effect size and uncertainty.

Here’s a conventionally non-significant finding (p = .08) but one look at the CI and it’s pretty easy to understand that this does not mean that there is no effect (in fact, when this experiment was repeated across many different labs there is a small but reliable effect (this is data from one site of the ManyLabs 2 replication of the “tempting fate” experiment) ​(Klein et al., 2018; Risen & Gilovich, 2008)​

It is possible, though, to prove an effect negligible. So non-significant does not always rule out an effect, but sometimes it can show an effect is negligible. How can you tell if you have an informative non-significant result or a non-informative one? Check out the effect size and confidence interval. If the confidence interval is narrow, all within a range that we’d agree has no practical significance, then that non-significant result indicates the effect is negligible. If the CI is wide (as in the tempting fate example above), then there are a range of compatible effect sizes remaining and we can’t yet conclude that the effect is negligible.

For example, in ManyLab2 the font disfluency effect was tested–this is the idea that a hard-to-read font can activate analytic System2 style thinking and therefore improve performance on tests of analytic thinking. Whereas the original finding found a large improvement in syllogistic reasoning (d = 0.64), the replications found d = -0.03 95% CI[-0.08, 0.01], p = .43 ​(Klein et al., 2018)​. This is non-significant but that doesn’t mean we just throw up our hands and discount it as meaningless. The CI is narrow and all within a range of very small effects. Most of the CI is in the negative direction…exactly what you might expect for a hard-to-read font. But we can’t rule out the positive effect that was predicted–effect sizes around d = 0.01 remain compatible with the data (and probably a bit beyond–remember, the selection of a 95% Ci and therefore its boundary is arbitrary). Still, it’s hard to imagine effect sizes in this range being meaningful or practical for further study. So, yes – sometimes p > .05 accompanies a result that is really meaningful and which fairly clearly establishes a negligible effect. The way to tell when a non-significant result is or is not very informative is to check the CI. You can also do a formal test for a negligible effect, called an equivalence test. Lakens has great materials on how to do this–though I think it helps alot to see and understand this “by eye” first.

Significant does not mean meaningful. Another common misconception is the conflation of statistical and practical significance. How can you avoid this error? After obtaining the p– value, graph the effect size and CI. Then reflect on the meaningfulness of the range of compatible effect sizes. This is even more useful if you specify in advance what you consider to be a meaningful effect size relative to your theory.

Here’s an example from a meta-analysis I recently published. The graph shows the estimated standardized effect size (and 95% CI) for the effect of red on perceived attractiveness for female raters. In numbers, the estimate is: d = 0.13 95% CI [0.01, 0.25], p = .03, N = 2,739. So this finding is statistically significant, but we then need to consider if it is meaningful. To do that, we should countenance the whole range of the CI (and probably beyond as well)… what would it mean if the real effect is d = .01? d = .25? For d = .01 that would pretty clearly be a non-meaningful finding, far too small to be perceptually noticeable for anyone wearing red. For d = .25…well, perhaps a modest edge is all you need when you’re being scanned by hundreds on Tinder. Interpreting what is and is not meaningful is difficult and takes domain-specific knowledge. What is clear is that looking at the effect size and CI helps focus attention on practical significance as a question distinct from statistical significance.

Nominal differences in statistical significance are not valid for judging an interaction. Another alluring pitfall of p-values is assuming that nominal differences in statistical significance are meaningful. That is, it’s easy to assume that if condition A is statistically significant and condition B is not, that A and B must themselves be different. Unfortunately, the transitive property does not apply to statistical significance. As Gelman and many others have pointed out, the difference between statistically significant and not significant is not, itself, statistically significant ​(Gelman & Stern, 2006; Nieuwenhuis, Forstmann, & Wagenmakers, 2011)​. That’s quite a mantra to remember, but one easy way to avoid this problem is to look at an estimation plot of the effect sizes and confidence intervals. You can see the evidence for an interaction ‘by eye’.

Here’s an example from a famous paper by Kosfeld et al. where they found that intranasal oxytocin increases human trust in a trust-related game but not in a similar game involving only risk ​(Kosfeld, Heinrichs, Zak, Fischbacher, & Fehr, 2005)​. As shown in A, the effect of oxytocin was statistically significant for the trust game, but not for the risk game. The paper concluded an effect specific to trust contexts, was published in Nature and has been cited over 3,000 times…apparently without anyone noticing that a formal test for a context interaction was not reported and would not have been statistically significant. In B, we see the estimation plots from both contexts. You can immediately see that there is huge overlap in the comptable effect sizes in both contexts, so the evidence for an interaction is weak (running the interaction gives p = .23). That’s some good human factors work–letting our eyes immediately tell us what would be hard to discern from the group-level p values alone.

If I have obtained statistical significance I have collected enough data. To my mind this is the most pernicious misconception about p-values–that obtaining significance proves you have collected enough data. I hear this alot in neuroscience. This mis-conception causes people to think that run-and-check is the right way to obtain a sample size, and it also licenses uncritical copying of previous sample sizes. Overall, I believe this misconception is responsible for the pervasive use of inadequate sample sizes in the life and health sciences. This, in combination with publication bias, can easily produce whole literatures that are nothing more than noise (
https://www.theatlantic.com/science/archive/2019/05/waste-1000-studies/589684/ ). Yikes.

How can you avoid this error? After obtaining your p-value, look at the effect size and an expression of uncertainty. Look at the range of compatible results–if the range is exceedingly long relative to your theoretical interests then your result is uninformative and you have not collected enough data. You actually can use run-and-check to keep checking on you margin of error and running until it is sufficiently narrow to enable a reasonable judgement from your data. How cool is that?

Here’s an example. This is data from a paper in Nature Neuroscience that found a statistically significant effect of caffeine on memory ​(Borota et al., 2014)​. Was this really enough data? Well, the CI is very long relative to the theoretical interests–it stretches from implausibly large impacts on memory to vanishingly small impacts on memory (in terms of % improvement the finding is: 31% 95% CI[0.2%, 62%]. So the data establishes a range of compatible effect sizes that is somewhere between 0 and ludicrous. That’s not very specific/informative, and that’s a clear sign you need more data. I corresponded with this lab extensively about this… but in all their subsequent work examining effects of different interventions on memory they’ve continued to use sample sizes that provide very little information about the true effect. Somehow, though, they’ve yet to publish a negative result (that I know of). Some people have all the luck.

p-values do not provide good guidance of what to expect in the future. If you find a statistically significant finding, you might expect that someone else repeating the same experiment with the same sample size will also find a statistically-significant finding (or at least be highly likely to). Unfortunately, this is wrong–even if your original study had 0 sampling error (the sample means were exactly what they are in the population) the odds of a statistically-significant replication may not be very high. Specifically, if your original study has a sample size that is uninformative (aka low powered), odds of replication might be stunningly small. Still, this is a seductive illusion that even well-trained statisticians fall into ​(Gigerenzer, 2018)​ .

How can you avoid this error? Well, one way is to work through simulations to get a better sense of the sampling distribution of p under different scenarios. Geoff has great simulations to explore; here’s a video of him walking through one of these: https://www.youtube.com/watch?v=OcJImS16jR4. Another strategy is to consider the effect size and a measure of uncertainty, because these actually do provide you with a better sense of what would happen in future replications. Specifically, CIs provide prediction intervals. For a 95% CI for an original finding, there is ~83% chance that a same-sized replication result will fall within the original CI ​(Cumming, 2008)​. Why only 83%? Because both the original and the replication study are subject to sampling error. Still, this is useful information about what to expect in a replication study–it can help your lab think critically about if you have a protocol that is likely to be reliable and can properly calibrate your sense of surprise over a series of replications. To my mind, having accurate expectations about the future is critical for fruitful science, so it is well worth some effort to train yourself out of the misconception that p values alone can give you much information about the future.

If you want to use a p value well, always report and interpret its corresponding effect size and CI. If you’re following along at home, you’ve probably noticed a pattern: to avoid misinterpretation of p values, reflect critically on the effect size and CI that accompany it. If you do this, it will be much harder to go wrong. If you keep at it for a while, you may begin to notice that the p value step isn’t really doing much for you. I mean, it does tell you the probability of obtaining your data under a specific null hypothesis. But you could also plot the expected distribution of sampling error and get that sense for all null hypotheses, which is pretty useful. And you may be interested in Laken’s excellent suggestions to not always use a point null and to use equivalence testing. But both of these are applied by considering the effect size and its CI. So you may find that p values become kind of vestigial to the science you’re doing. Or not. Whatever. I’m not going to fall into the “Statistician’s Fallacy” of trying to tell you what you want to know. But I can tell you this: If p values tell you something you want to know, then the best way to avoid some of the cognitive traps surrounding p values is to carefully consider effect sizes and CIs.

There’s even more. This short tour is just a taste of work Geoff and I (mostly Geoff) have done to try to provide tools and tutorials to help people avoid misinterpretation of p values. We have an undergrad textbook that helps students learn what p values are and are not (we include several “red flags” to consider every time p values are reported). And there’s ESCI, which will give you p values if you need them, but always along with helpful visualizations and outputs that we think will nudge you away from erroneous interpretation. And pedagogically, I’m convinced that teaching estimation first is the best way to teach statistics–it is more intuitive, it is more fun, and it helps students develop a much stronger sense of what a p value is (and is not). With all due respect to the wonderful memorization drills Lakens recounts in his pre-print, chanting mantras about p values is not going to avoid mindless statistics–but a teaching approach that takes a correct developmental sequence could.

So how do we disagree? So, I agree with Lakens that the problems of p values are in how they are used, and I agree with Lakens that we need a whole ecosystem to help people avoid using them incorrectly. Where we have a puzzling disagreement then, is that I see that work as largely complete and ready for use. That is, I see the “New Statistics” specifically as a way of teaching and thinking about statistics that will help make the errors common to p values more rare (but will probably make other types of errors more common; no system is perfect). Is the disagreement all down to labels and/or misconceptions? Stay tuned.

Oh and one other thing. Lakens complains that SPSS does not report an effect size for a t-test. But it does! At least as long as I’ve been using it, it provides an effect size and CI. It does not provide standardized effect sizes, but raw score? You bet.

“The widely used statistical software package SPSS is 40 years old, but in none of its 25 editions d it occur to the creators that it might be a good idea to provide researchers with the option to compute an effect size when performing a t-test.” ​(Lakens, 2019)​ (Except that it did, and it does).

  1. Borota, D., Murray, E., Keceli, G., Chang, A., Watabe, J. M., Ly, M., … Yassa, M. A. (2014). Post-study caffeine administration enhances memory consolidation in humans. Nature Neuroscience, 201–203. doi: 10.1038/nn.3623
  2. Calin-Jageman, R. J., & Cumming, G. (2019). The New Statistics for Better Science: Ask How Much, How Uncertain, and What Else Is Known. The American Statistician, 271–280. doi: 10.1080/00031305.2018.1518266
  3. Cumming, G. (2008). Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better. Perspectives on Psychological Science, 286–300. doi: 10.1111/j.1745-6924.2008.00079.x
  4. Gelman, A., & Stern, H. (2006). The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. The American Statistician, 328–331. doi: 10.1198/000313006×152649
  5. Gigerenzer, G. (2018). Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science, 198–218. doi: 10.1177/2515245918771329
  6. Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Jr., Alper, S., … Nosek, B. A. (2018). Many Labs 2: Investigating Variation in Replicability Across Samples and Settings. Advances in Methods and Practices in Psychological Science, 443–490. doi: 10.1177/2515245918810225
  7. Kosfeld, M., Heinrichs, M., Zak, P. J., Fischbacher, U., & Fehr, E. (2005). Oxytocin increases trust in humans. Nature, 673–676. doi: 10.1038/nature03701
  8. Lakens, D. (2019). The practical alternative to the p-value is the correctly used p-value. doi: 10.31234/osf.io/shm8v
  9. Nieuwenhuis, S., Forstmann, B. U., & Wagenmakers, E.-J. (2011). Erroneous analyses of interactions in neuroscience: a problem of significance. Nature Neuroscience, 1105–1107. doi: 10.1038/nn.2886
  10. Risen, J. L., & Gilovich, T. (2008). Why people are reluctant to tempt fate. Journal of Personality and Social Psychology, 293–307. doi: 10.1037/0022-3514.95.2.293

The TAS Articles: Geoff’s Take

Yay!  I salute the editors and everyone else who toiled for more than a year to create this wonderful collection of TAS articles. Yes, let’s move to a “post p<.05 world” as quickly as we can.

Much to applaud 

Numerous good practices are identified in multiple articles. For example:

  • Recognise that there’s much more to statistical analysis and interpretation of results than any mere inference calculation.
  • Promote Open Science practices across many disciplines.
  • Move beyond dichotomous decision making, whatever ‘bright line’ criterion it’s based on.
  • Examine the assumptions of any statistical model, and be aware of limitations.
  • Work to change journal policies, incentive structures for researchers, and statistical education.

And much, much more. I won’t belabour all this valuable advice.

The editorial is a good summary  I said that in this blog post. The slogan is ATOM: “Accept uncertainty, and be Thoughtful, Open, and Modest.” Scan the second part of the editorial for a brief dot point summary of each of the 43 articles.

Estimation to the fore  I still think our (Bob’s) article sets out the most-likely-achievable and practical way forward, based on the new statistics (estimation and meta-analysis) and Open Science. Anderson provides another good discussion of estimation, with a succinct explanation of confidence intervals (CIs) and their interpretation.

What’s (largely) missing: Bayes and bootstrapping 

The core issue, imho, is moving beyond dichotomous decision making to estimation. Bob and I, and many others, advocate CI approaches, but Bayesian estimation and bootstrapping are also valuable techniques, likely to become more widely used. It’s a great shame these are not strongly represented.

There are articles that advocate a role for Bayesian ideas, but I can’t see any article that focuses on explaining and advocating the Bayesian new statistics, based on credible intervals. The closest is probably Ruberg et al., but their discussion is complicated and technical, and focussed specifically on decision making for drug approval.

I suspect Bayesian estimation is likely to prove an effective and widely-applicable way to move beyond NHST. In my view, the main limitation at the moment is the lack of good materials and tools, especially for introducing the techniques to beginners. Advocacy and a beginners’ guide would have been a valuable addition to the TAS collection.

Bootstrapping to generate interval estimates can avoid some assumptions, and thus increase robustness and expand the scope of estimation. An article focussing on explaining and advocating bootstrapping for estimation would have been another valuable addition.

The big delusion: Neo-p approaches 

I and many others have long argued that we should simply not use NHST or p values at all. Or should use them only in rare situations where they are necessary—if these ever occur. For me, the biggest disappointment with the TAS collection is that a considerable number of articles present some version of the following argument: “Yes, there are problems with p values as they have been used, but what we should do is:

  • Use .005 rather than .05 as the criterion for statistical significance, or
  • teach about them better, or
  • think about p values in the following different way, or
  • replace them with this modified version of p, or
  • supplemente them in the following way, or
  • …”

There seems to be an assumption that p values should—or at least will—be retained in some way. Why? I suspect that none of the proposed neo-p approaches is likely to become very widely used. However, they blunt the core message that it’s perfectly possible to move on from any form of dichotomous decision making, and simply not use NHST or p values at all. To this extent they are an unfortunate distraction.

p as a decaf CI  One example of neo-p as a needless distraction is the contribution of Betensky. She argues correctly and cogently that (1) merely changing a p threshold, for example from .05 to .005 is a poor strategy, and (2) interpretation of any p value needs to consider the context, in particular N and the estimated effect size. Knowing all that, she correctly explains, permits calculation of the CI, which provides a sound basis for interpretation. Therefore, she concludes, a p value, when considered in context in this way, does provide information about the strength of evidence. That’s true, but why not simply calculate and interpret the CI? Once we have the CI, a p value adds nothing, and is likely to mislead by encouraging dichotomisation.

Using predictions

I’ll mention just one further article that caught my attention. Billheimer contends that “observables are fundamental, and that the goal of statistical modeling should be to predict future observations, given the current data and other relevant information” (abstract, p.291). Rather than estimating a population parameter, we should calculate from the data a prediction interval for a data point, or sample mean, likely to be given by a replication. This strategy keeps the focus on observables and replicability, and facilitates comparisons of competing theories, in terms of the predictions they make.

This strikes me as an interesting approach, although Billheimer gives a fairly technical analysis to support his argument. A simpler approach to using predictions would be to calculate the 95% CI, then interpret this as being, in many situations, on average, approximately an 83% prediction interval. That’s one of the several ways to think about a CI that we explain in ITNS.

Finally

I haven’t read every article in detail. I could easily be mistaken, or have missed things. Please let me know.

I suggest (maybe slightly tongue-in-cheek):

Geoff

Preregistration: Why and How

Prereg Workshop at APS

Steve Lindsay usefully posted a comment to draw our attention to the Preregistration Workshop on offer at the APS Convention coming up shortly in D.C.. You can scan the full list of Workshops here–there are lots of goodies on offer, including Tamara and Bob on Teaching the New Statistics. Steve is leading the prereg workshop and is supported by an all-star cast:

Prereg: Some Doubts

Richard Shiffrin is a highly distinguished psychological scientist. He has been arguing at a number of conventions that some Open Science practices are often unnecessary, and may even impede scientific progress. He and colleagues published this article in PNAS:

It’s a great article, but the authors raised doubts about, specifically, preregistration.

The Case for Prereg

In a very nice blog post, Steve Lindsay made complimentary remarks about the Shiffren et al. article, then made what seems to me a wonderfully wide-ranging yet succinct account of preregistration: The rationale and justification, and the limitations. He also countered a number of doubts that have been raised about prereg.

Steve’s post gives us lots of reasons WHY prereg. I’m sure the Workshop will also give lots of advice on HOW. Enjoy!

Geoff

Judging Replicability: Fiona’s repliCATS Project

Judging Replicability

Whenever we read a research article we almost certainly form a judgment of its believability. To what extent is it plausible? To what extent could it be replicated? What are the chances that the findings are true?

What features of the article drive our judgment? Likely influences include:

  • A priori plausibility. To what extent is it unsurprising, reasonable?
  • How large are the effects? How meaningful?
  • The reputation of the authors, the standing of their lab and institution.
  • The reputation of the journal.
  • Interpretations, judgments, claims and conclusions made by the authors.

Open Science ideas will emphasise our attention to features of the research itself, including:

  • Are there multiple lines of converging evidence? Replications?
  • Was the research, including the data analysis strategy, preregistered?
  • Sample sizes? Quality of manipulations (IVs) and measures (DVs)?
  • Results of statistical inference, especially precision (CI lengths)?
  • Are we told that the research was reported in full detail? Are we assured that all relevant studies and results have been reported?
  • Any signs of cherry-picking or p hacking?
  • How prone is the general research area to publish non-replicable results?

Automating the Assessment of Replicability

The SCORE project is a large DARPA attempt to find automated ways to assess the replicability of social and behavioural science research. As I understand it, teams around the world are just beginning on:

  1. Running replications of a large number of published studies, to provide empirical evidence of replicability, and a reference database of studies.
  2. Studying how human experts judge replicability of reported research–how well do they do, and what features (as in the lists above) guide their judgments?
  3. Building AI systems to take the results from (2) and make automated assessments of the replicability of published research.

Brian Nosek, of COS, is leading a project on 1. above. Fiona Fidler is leading one of the projects tackling 2. above: the repliCATS project.

repliCATS

It’s big, maybe up to US$6.5M. It’s ambitious. And Fiona has multiple teams working on various aspects, all with impossibly tight time lines.

A month or so ago I spent a fascinating 3 days down at the University of Melbourne, for research meetings, seminars, and more, as the teams worked on their plans. Brian Nosek was in town, giving great presentations, and consulting as to how his team and Fiona’s could best work together. Here’s the outline of repliCATS, from the project website:

  1. The repliCATS project aims to develop more accurate, better-calibrated techniques to elicit expert assessment of the reliability of social science research.
  2. Our approach is to adapt and deploy the IDEA protocol developed here at the University of Melbourne to elicit group judgements for the likely replicability of 3,000 research claims.
  3. The research we will undertake as part of the repliCATS project will include the largest ever empirical study on how scientists reason about other scientists’ work, and what factors makes them trust it.
  4. We are building a custom online platform to deploy the IDEA protocol. This platform will have a life beyond the repliCATS project: it will be able to be used in the future to enhance expert group judgements on a wide range of topics, in a number of disciplines.

If you are interested in repliCATS, subscribe here for updates. You can also follow Fiona @fidlerfm

I’m agog to follow how it goes, and to see the insights I’m sure they will find into researchers’ judgments about replicability.

Geoff

APS Publicises the TAS Articles

The APS has just released this Research Spotlight item:

Suddenly it’s more than 5 years since APS made some important and very early steps to promote Open Science and new-statistics practices. Back in Jan 2014, then Editor-in-Chief of Psychological Science, Eric Eich, explained in an editorial a whole set of new policies: Offering of OS badges, requirements for more complete reporting of methods and data, encouragement for use of the new statistics, discouragement of NHST, and more. Steve Lindsay, who followed Eric, has introduced further policies to encourage replication and improve the trustworthiness of research published in Psychological Science. Various of the new policies have spread to other APS journals.

The Research Spotlight item also notes some of the other steps APS has taken in support of OS and statistical improvements, including publication of this tutorial article, and this set of six videos.

APS Conventions have hosted many symposia and workshops on Open Science, and on the new statistics and other modern statistical practices. You can rush to sign up (click at left then scroll down) for Tamara and Bob’s great workshop at this year’s Convention in D.C.:

All strength to the APS in its long-standing and continuing support for Open Science and better statistical methods.

Geoff

Ditching Statistical Significance: The Most Talked-About Paper Ever?

Well, that might be a stretch, but in relation to the Nature Comment that Bob and I signed to support, Altmetric tweeted:

John Ioannidis published this criticism of the Comment, with the subtitle Do Not Abandon Significance. Much of what he writes is sensible, and in agreement with the Comment, but in my opinion he doesn’t make a concerted or convincing case for not abandoning statistical significance. He hardly seems to attempt to assemble such a case, despite that subtitle.

The authors of the Comment, joined by Andrew Gelman, replied. Their reply, titled Abandoning statistical significance is both sensible and practical, strikes me as succinct, clear, and convincing.

I’m still working through the 43 TAS articles that prompted the Nature Comment. I’ll report.

In the meantime I continue to be so happy that statistical significance may at last be receiving its comeuppance. The battalion of scholars who have published swingeing critiques of NHST since around 1950 may at last be vindicated!

Now we just need all this great progress to filter through to instructors of the intro stats course, so that they can feel emboldened to adopt ITNS. Then we’ll know that things have really changed for the better!

Geoff

Moving Beyond p < .05: The Latest

A couple of days ago, the three authors of the Nature paper accompanying the special issue of TAS on moving beyond p < .05 sent the update below. (See below for lots of links.)

We are writing with a brief update on the Nature comment “Retire statistical significance” that you signed. The comment has so far been subject to spirited discussion in both traditional and social media (see below).

We believe keeping this discussion going by approaching colleagues and sharing links on social media is the only way a reform in statistical practice will come about. Your continued support in this is most welcome and appreciated! Together, our chance to make this happen has perhaps never been better.

With kind regards and many thanks,Valentin, Sander, Blake

Links:
Retire statistical significancehttps://www.nature.com/articles/d41586-019-00857-9
The American Statistician – Statistical inference in the 21st century: a world beyond p < 0.05https://www.tandfonline.com/toc/utas20/73/sup1

Media mentions:
Retraction Watchhttps://retractionwatch.com/2019/03/21/time-to-say-goodbye-to-statistically-significant-and-embrace-uncertainty-say-statisticians

Andrew Gelman’s blog (discussion with 400 comments)https://statmodeling.stat.columbia.edu/2019/03/20/retire-statistical-significance-the-discussion

Bloomberghttps://www.bloomberg.com/opinion/articles/2019-03-29/dump-statistical-significance-then-teach-scientists-statistics

The Guardianhttps://www.theguardian.com/commentisfree/2019/mar/24/the-guardian-view-on-statistics-in-sciences-gaming-the-unknown

Voxhttps://www.vox.com/latest-news/2019/3/22/18275913/statistical-significance-p-values-explained

NPRhttps://www.npr.org/sections/health-shots/2019/03/20/705191851/statisticians-call-to-arms-reject-significance-and-embrace-uncertainty

……………………………….

Browse Gelman’s blog and its comments to get an idea of the range of views held by, especially, statisticians. There may be a danger that the imperative to move forward from p values gets lost in the noise. We really do need to keep focus! (Yay for the new statistics! Estimation, CIs, meta-analysis to the fore! Let p values wither and die a natural death!)

Geoff

Moving to a World Beyond “p < 0.05”

The 43 articles in The American Statistician discussing what researchers should do in a “post p<.05” world are now online. See here for a list of them all, with links to each article.

The collection starts with an editorial:

Go here to get the full editorial as a pdf.

Bob and I commented on earlier drafts of the editorial, as did some other authors of articles in the collection. I think the published version is great, even if, as usual, I’d like it to have gone a bit further towards virtually always ending the use of p values. But there are very welcome strong recommendations for the use of estimation, as well as many other wise words.

We’re pleased that the authors of the editorial elected to refer to our article (more on that below) in 5 places, and to quote our words in 4 of those (see pp. 3 (twice), 9, and 10 in the pdf). A very strong theme of the editorial is that researchers should always ’embrace uncertainty’, which is a major theme also in our article, among others.

Go here for the pdf of our article,

Yes, my name is listed as a co-author, but Bob wrote the article, and an excellent job he did of it too! If all researchers followed our (his) advice, the world would be a much better place–says he modestly. But have a squiz and see what you think.

The editorial includes (pp. 10-18 in the pdf) a brief dot point summary of each article. The summary we contributed of ours is:

1. Ask quantitative questions and give quantitative answers.

2. Countenance uncertainty in all statistical conclusions, seeking ways to quantify, visualize, and interpret the potential for error.

3. Seek replication, and use quantitative methods to synthesize across data sets as a matter of course.

4. Use Open Science practices to enhance the trustworthiness of research results.

5. Avoid, wherever possible, any use of p values or NHST.

Here’s to the onward march of Open Science and better statistical practice!

Geoff

Ditching Statistical Significance?!

Nature (!) has just published an editorial discussing and advocating that statistical significance should be ditched. For me, that’s the stuff of dreams, but I have lived to see it happen! I’m so happy!

Here’s one para from the editorial:

The ‘call for scientists to abandon statistical significance‘, by Valentin Amrhein, Sander Greenland, Blake McShane is the Comment that is also in this week’s Nature. Its title is Scientists rise up against statistical significance, and it is supported by 854 scientists from 52 countries. See a list of these good folks here.

The Comment has been developed over past weeks, with revisions in response to suggestions (and exhortations) from many, including Bob and me. Of course I’d like the final version to be stronger, and to call for an end to any use of p values, or their use only in rare cases when we don’t (yet) have good alternatives. But it is pretty strong, and has much wise advice that would lead to much improved practice if widely followed.

The 854 signatures were collected in a very few days, during which the Comment itself was strictly embargoed. Bob and I were very happy to sign.

The ‘series of related articles‘ mentioned above, and published by the American Statistical Association, has not yet appeared, but should do so any moment. That should also be a massive game changer. Let’s hope.

Sometimes, ‘ditching’ can be great progress!

Geoff

Microworlds

Last month I (Bob) visited a local elementary school for a “Science Alliance” visit. This is a program in our community to being local scientists into the classroom. I brought the Cartoon Network simulator I have been developing (Calin-Jageman, 2017, 2018). This simulator is simple-enough that kids can use, but complex enough to generate some really cool network behaviors (reflex chains, oscillations, etc.). The simulation can be hooked up to a cheap USB robot, so kids can design the ‘brains’ of the robot, giving it the characteristics they want (fearful–to run away from being touched; aggressive–to track light), etc.

The kids *loved* the activity–the basic ideas were easy to grasp and they were quickly exploring, trying things out, and sharing results with each other. They made their Finches chirp and dance, and in the process discovered recurrent loops and the importance of inhibition.

In developing Cartoon Network, my inspiration was logo, the programming language developed by Seymour Papert and colleagues at MIT. I was a “logo kid”–it was basically the only thing you *could* do on the computer lab my elementary school installed when I was in second grade. Logo was *fun*–you could draw things, make animations…it was a world I wanted to explore. But Logo didn’t make it terribly easy–as you went along you would need/want key programming concepts. I clearly remember sitting in the classroom writing a program to draw my name and being frustrated at having to re-write the commands to make a B at the end of my name when I had already typed them out for the B at the beginning of my name. The teacher came by and introduced me to functions, and I remember being so happy about the idea of a “to b” function. I immediately grasped that I could write functions for every letter once and then be able to have the turtle type anything I wanted in no time at all. Pretty soon I had a “logo typewriter” that I was soooo proud of. I could viscerally appreciate the time I had saved, as I could quickly make messages to print out that would have taken me the whole class-period to code ‘by letter’.

Years later I read Mindstorms, Papert’s explanation of the philosophy behind Logo. This remains, to my mind, one of the most important books on pedagogy, teaching, and technology. Papert applied Piaget’s model of children as scientists (he had trained with Piaget). He believed that if you can make a microworld that is fun to explore, children will naturally need, discover, and understand deep concepts embedded in that world. That’s what I was experiencing back in 2nd grade–I desperately needed functions, and so the idea of them stuck with me in a way that they never would in an artificial “hello world” type of programming exercise. Having grown up a “logo kid”, reading Mindstorms was especially exciting–I could recognize myself in the examples, and connect my childhood experiences to the deeper ideas about learning Papert used to structure my experience.

Papert warned that microworlds must be playful and open-ended. Most importantly a microworld should not be reduced to a ‘drill and skill’ environment where kids have to come up with *the* answer. Sadly, he saw computers and technologies being used that way–to program the kids rather than having the kids program the computers. Even more sad, almost all the “kids can code” initiatives out there have lost this open-ended sense of exploration–they are mostly a series of specific challenges, each with one right answer. They do not inspire much joy or excitement; their success is measured in the number of kids pushed through. (Yes, there are some exceptions, like minecraft coding, etc… but most of the kids code initiatives are just terrible, nothing like what Papert had in mind).

So, what does all this have to do with statistics? Well, the idea of a microworld still makes a lot of sense and is also applicable to statistics education. Geoff’s dance of the means has become rightly famous, I would suggest, because it is a microworld users can explore to sharpen their intuitions about sampling, p values,CIs, and the like. Richard Morey and colleagues recently ran a study where you could sample from a null distribution to help make a judgement about a possible effect. And, in general, the use of simulations is burgeoning in helping researchers explore and better understand analyses (Dorothy Bishop has some great blog posts about this). Thinking of these examples makes me wonder, though–can we do even better? Can we produce a fun and engaging microworld for the exploration of inferential statistics, one that would help scientists of all ages gain deep insight into the concepts at play? I have a couple of ideas… but nothing very firm yet, and even less time to start working on them.. But still, coming up with a logo of inference is definitely on my list of projects to take on.

I’m going to end with 3 examples of thank-you cards I received from the 3rd grade class I visited. All the cards were amazing–they genuinely made my week. I posted these to Twitter but thought I’d archive them here as well.

This kid has some great ideas for the future of AI

“I never knew neurons were a thing at all”–the joy of discovery
“Your job seems awesome and you are the best at it”—please put this kid on my next grant review panel.
  1. Calin-Jageman, R. (2017). Cartoon Network: A tool for open-ended exploration of neural circuits. Journal of Undergraduate Neuroscience Education : JUNE : A Publication of FUN, Faculty for Undergraduate Neuroscience, 16(1), A41–A45. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/29371840
  2. Calin-Jageman, R. (2018). Cartoon Network Update: New Features for Exploring of Neural Circuits. Journal of Undergraduate Neuroscience Education : JUNE : A Publication of FUN, Faculty for Undergraduate Neuroscience, 16(3), A195–A196. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/30254530
Top