Statistical Cognition: An Invitation

Statistical Cognition (SC) is the study of how people understand–or, quite often, misunderstand–statistical concepts or presentations. Is it better to report results using numbers, or graphs? Are confidence intervals (CIs) appreciated better if shown as error bars in a graph or as numerical values?

And so on. These are all SC questions. For statistical practice to be evidence-based, we need answers to SC questions, and these should inform how we teach, discuss, and practise statistics. Of course.

An SC Experiment

This is a note to invite you–and any of your colleagues and students who may be interested–to participate in an interesting SC study. It is being run by Lonni Besançon and Jouni Helske. It’s an online survey that asks questions about CIs and other displays. It’s easy, takes around 15 minutes, and, as usual, is anonymous. To start, click here.

Feel free to pass this invitation on to anyone who might be interested. I suggest that we should all feel some obligation to encourage participation in SC research, because it has the potential to enhance research. Here’s to evidence-based practice! With cognitive evidence front and central.

Statistical Cognition, Some Background

Ruth Marom, Fiona Fidler, and I wrote about SC some years back. The full paper is here. The citation is:

Beyth-Marom, R., Fidler, F., & Cumming, G. (2008). Statistical cognition: Towards evidence-based practice in statistics and statistics education. Statistics Education Research Journal, 7, 20-39.

Some of Our SC Research

Here are a few examples:

Lai, J., Fidler, F., & Cumming, G. (2012). Subjective p intervals: Researchers underestimate the variability of p values over replication. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 8, 51-62. Abstract is here.

Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but do not guarantee, better inference than statistical significance testing. Frontiers in Quantitative Psychology and Measurement, 1:26. Full paper is here.

Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10, 389-396. Abstract is here.

I confess that all those studies date from pre-Open-Science times, so there was no preregistration, and little or no replication. Opportunity!


A Second Edition of ITNS? Here’s the Latest

Our first blog post about a possible second edition of ITNS is here. All the comments I made there, and the questions I asked, remain relevant. We’ve had some very useful feedback and suggestions, but we’d love more. You could even tell us about aspects of ITNS that you think work well. Thanks!

Meanwhile Routledge has been collecting survey responses from users, adopters, and potential adopters of ITNS. We are promised the collated responses in the next few weeks.

Back to that first post. Scroll down below the post to see some valuable comments, including two substantial contributions from Peter Baumgartner.

Peter has also been in touch with us to discuss various possibilities and, most notably, to send a link to some goodies he has sketched. One of Peter’s interests is effective didactical methods, and he has made positive comments about some of the teaching strategies and exercises in ITNS–especially those that prompt students to take initiative, explore, and discover. He’s keen to see development of these and so, as I mentioned, he has sketched some possibilities.

Peter’s Goodies

Peter’s sketches start here. Then he has a whole sequence of goodies that start here. He asks me to emphasise that he doesn’t claim to be an expert programmer and that he’s largely using tools (including shiny on top of R, and the R tutorial package learnr) that are rather new to him. (So one of the lessons for us is that it may (!) be comparatively easy to use such tools to build new components for ITNS.)

The figure below is from a shinyapp that is one of the last of Peter’s goodies, and shows how a stacked dot plot and a histogram can be linked together, as the user changes the number of bins. ESCI has a version of this, and also allows clicking to display mean, median, and various other aspects of the data set. Peter emphasises that it should be comparatively easy to build such displays in shiny, with the advantage of the full power and flexibility of R.

Thanks Peter! For sure, finding a good way to do more in R is a priority for us. Should we try to replace ESCI entirely? Retain ESCI for some simulations that are useful for learners, while transferring more of the data analysis to R? Is jamovi the way to go, perhaps alongside shiny apps? What’s the best way in R to build detailed, appealing, graphics–with onscreen user controls–such as ESCI attempts to provide?

Please, let’s have your thoughts, either below, or via email to Bob or me. Thanks!



Play, Wonder, Empathy – Latest Educational Trends, Says The Open University

My long-time friend and colleague Mike Sharples told me about the recently released Innovating Pedagogy 2019 report from The Open University (U.K.). It’s the seventh in an annual series initiated by Mike. Each report aims to describe a number of promising trends in learning and teaching. There’s not much by way of formal evaluation of effectiveness and outcomes, but there are illuminating examples, and leads and links to resources to help adoption and further development.

The 2019 report describes 10 trends, as listed below. At this website you can click for brief summaries of any of the 10 that takes your fancy. There are also links to the previous six reports.

It strikes me that several of the 10 deserve thought, from the point of view of improving how we teach intro statistics. The one that immediately caught my eye was wonder.

I’ve always found randomness, and random sampling variability, to be the source of wonder. People typically don’t appreciate the wonder of randomness, nor do they appreciate that, in the short term, randomness is totally unpredictable and often surprising, even astonishing. In the long term, however, the laws of probability dictate that the observed proportions of particular outcomes will be very close to what we expect.

Prompted by the examples and brief discussion in the report of wonder, I can think of my years of work with the dances (of the means, of the CIs, of the p values, and more) as aiming to bring the wonder of randomness to students. Often we’ve discussed patterns and predictions and the hopelessness of making short-term predictions. We’ve compared the dances we see on screen–dancing before our eyes–with physical processes in the world that we might regard as random. (To see the dances, use ESCI, or go to YouTube and search for ‘dance of the p values’ and ‘significance roulette’.)

I suggest it’s worth poking about in this latest report, and in the earlier reports, for trends that might spark your own thinking about statistics teaching and learning.


The ten 2019 trends:

Playful learningEvoke creativity, imagination and happiness 

Learning with robotsUse software assistants and robots as partners for conversation

Decolonising learningRecognize, understand, and challenge the ways in which our world is shaped by colonialism

Drone-based learningDevelop new skills, including planning routes and interpreting visual clues in the landscape

Learning through wonderSpark curiosity, investigation, and discovery

Action learningTeam-based professional development that addresses real and immediate problems

Virtual studiosHubs of activity where learners develop creative processes together 

Place-based learningLook for learning opportunities within a local community and using the natural environment

Making thinking visibleHelp students visualize their thinking and progress

Roots of empathyDevelop children’s social and emotional understanding

Sizing up behavioral neuroscience – a meta-analysis of the fear-conditioning literature

Inadequate sample sizes are kryptonite to good science–they produce waste, spurious results, and inflated effect sizes.  Doing science with an inadequate sample is worse than doing nothing. 

In the neurosciences, large-scale surveys of the literature show that inadequate sample sizes are pervasive in some subfields, but not in others (Button et al., 2013; Dumas-Mallet, Button, Boraud, Gonon, & Munafò, 2017; Szucs & Ioannidis, 2017).  This means we really need to do a case-by-case analysis of different fields/paradigms/assays. 

Along these lines, Carniero et al. recently published a meta-analysis examining effect sizes and statistical power in research using fear conditioning (Carneiro, Moulin, Macleod, & Amaral, 2018).  This is a really important analysis because fear conditioning has been an enormously useful paradigm for studying the nexus between learning, memory, and emotion.  Study of this paradigm has shed light on the neural circuitry for processing fear memories and their context, on the fundamental mechanisms of neural plasticity, and more.  Fear conditioning also turns out to be a great topic for meta-analsis because even though protocols can vary quite a bit (e.g. mice vs rats!), the measurements is basically always expressed as a change in freezing behavior to the conditioned stimulus. 

Carniero et al. (2018) compiled 122 articles reporting 410 experiments with a fear-conditioning paradigm (yeah…it’s a popular technique, and this represents only the studies reported with enough detail for meta-analysis).   They focused not on the basic fear-conditioning effect (which is large, I think), but on contrasts between fear conditioning in control animals relative to treatments thought to impair or enhance learning/memory. 

On first glance, there seems to be some good news: typical effects reported seem large.  Specifically, the typical “statistically significant” enhancement effect is a 1.5 SD increase in freezing while a typical “statistically significant” impairment effect is a 1.5 SD decrease in freezing.  Those are not qualitative effects, but they are what most folks would consider to be fairly large, and thus fairly forgiving in terms of sample-size requirements.  

That sounds good, but there is a bit of thinking to do now.  Not all studies analyzed had high power, even for these large effect sizes.  In fact, typical power was 75% relative to the typical statistically significant effect.  That’s not terrible, but not amazing.  But that means there is probably at least some effect-size inflation going on here, so we need to now ratchet down our estimate.  That’s a bit hard to do, but one way is to take only studies that are well-powered relative to the initial effect-size estimates and then calculate *their* typical effect size.  This shrinks the estimated effects by about 20%.  But *that* means that typical power is actually only 65%.  That’s getting a bit worrisome–in terms of waste and in terms of effect-size inflation.  And it also means you might want to again ratchet down your estimated effect size… yikes.  Also, thought the authors don’t mention it, you do worry that in this field the very well-powered studies might be the ones where folks used run-and-check to hit a significance target (in my experience talking to folks as posters this is *still* standard operating procedure in most labs).

Based on these considerations, the authors estimate that you could achieve 80% power for what they estimate to be a plausible effect size with 15 animals/group.  That wouldn’t be impossible, but it’s a good bit more demanding than current practice–currently only 12% of published experiments are that size or larger. 

There are some other interesting observations in this paper–it’s well worth a read if you care about these sorts of things.  The take-home I get is: a) this is without a doubt a useful assay that can reveal large enhancement and impairment effects, enabling fruitful mechanistic study, but b) the current literature has at least modestly inadequate sample sizes meaning that a decent proportion of supposedly inactive treatments are probably active and a decent proportion of supposedly impactful treatments are probably negligible.  Even with what is often seen as a pillar of neuroscience, the literature is probably a decent bit dreck.  That’s incredibly sad and also incredibly unproductive: trying to piece together how the brain works is hard enough; we don’t need to make it harder by salting pieces from a different puzzle (not sure if that metaphor really works, but I think you’ll know what i mean).  Finally, doing good neuroscience is generally going to cost more in terms of animals, time, and resources that we’ve been willing to admit… but it’s better than doing cheap but unreliable work.




Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. doi:10.1038/nrn3475
Carneiro, C. F. D., Moulin, T. C., Macleod, M. R., & Amaral, O. B. (2018). Effect size and statistical power in the rodent fear conditioning literature – A systematic review. PLOS ONE, 13(4), e0196258. doi:10.1371/journal.pone.0196258
Dumas-Mallet, E., Button, K., Boraud, T., Gonon, F., & Munafò, M. (2017). Low statistical power in biomedical science: a review of three human research domains. Royal Society Open Science, 4(2), 160254. [PubMed]
Szucs, D., & Ioannidis, J. P. A. (2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biology, 15(3), e2000797. doi:10.1371/journal.pbio.2000797

Sadly, Dichotomous Thinking Persists in HCI Research

A few words about the latest from Pierre Dragicevic. He’s an HCI researcher in Paris who totally gets the need for the new statistics. I’ve written about his work before, here and here. Now, with colleague Lonni Besançon, he reports a study of how HCI researchers have reported statistical inference over the period 2010 – 2018. It’s a discouraging picture, but with glimmers of hope.

The study is:

Lonni Besançon, Pierre Dragicevic. The Continued Prevalence of Dichotomous Inferences at CHI. 2019. It’s here.

Lonni and Pierre took the 4234 articles in the CHI conference proceedings from 2010 to 2018–these proceedings are one of the top outlets for HCI research. They used clever software to scan the text of all the articles, searching for words and symbols indicating how statistical inference was carried out and reported.

I recall the many journal studies that I, with students, carried out, 10 to 20 years ago: We did it all ‘manually’: we scanned articles and filled in complicated coding sheets as we found signs of our target statistical or reporting practices. Analysing a couple of hundred articles was a huge task, as we trained coders, double-coded, and checked for coding consistency. Computer analysis of text has its limitations, but a sample size of 4,000+ articles is impressive!

Here’s a pic summarising how NHST appeared:

About 50% of papers reported p values and/or included language suggesting interpretation in terms of statistical significance. This percentage was pretty much constant over 2010 to 2018, with only some small tendency for increased reporting of exact, rather than relative, p values over time. Sadly, dichotomous decision making seems just as prevalent in HCI research now as a decade ago. 🙁

If you are wondering why only 50% of papers, note that in HCI many papers are user studies, with 1 or very few users providing data. Qualitative methods and descriptive statistics are common. The 50% is probably pretty much all papers that reported statistical inference.

What about CIs? Here’s the picture:

Comparatively few papers reported CIs, and, of those, a big majority also reported reported p values and/or used significance language. Only about 1% of papers (40 of 4234) reported CIs and not p or a mention of statistical significance. The encouraging finding, however, was that the proportion of papers reporting CIs increased from around 6% in 2010 to 15% in 2018. Yay! But still a long way to go.

For years, Pierre has been advocating change in statistical practices in HCI research–he can probably take credit for some of the improvement in CI reporting. But, somehow, deeply entrenched tradition and the seductive lure of saying ‘statistically significant’ and concluding (alas!) ‘big, important, publishable, next grant, maybe Nobel Prize!’ persists. In HCI as in many other disciplines.

Roll on, new statistics and Open Science!


Internal Meta-Analysis: The Latest

I recently wrote in favour of internal meta-analysis, which refers to m-a that integrates evidence from two or more studies on more-or-less the same question, all coming from the same lab and perhaps reported in a single article. The post is here.

This month’s issue of Significance magazine carries an article that also argues in favour of internal meta-analysis, which it refers to as single paper meta-analysis.

McShane, B. B. & Böckenholt, U. (2018). Want to make behavioural research more replicable? Promote single paper meta-analysis. Significance, December, 38-40. (The article is behind a paywall, so I can’t give a link to the full paper.)

The article provides a link to software that, the authors claim, makes it easy to carry out meta-analysis, using their recommended hierarchical (or multilevel) model fit to the individual-level observations.


P.S. Note that Blakely McShane, the first author, is also first author of the article Abandon Statistical Significance that I recently wrote about–see here.

Abandon Statistical Significance!

That’s the title of a paper accepted for publication in The American Statistician. (I confess that I added the “!”) The paper is here. Scroll down below to see the abstract. The paper boasts an interdisciplinary team of authors, including Andrew Gelman of blog fame.

I was, of course, eager to agree with all they wrote. However, while there is much excellent stuff and I do largely agree, they don’t go far enough.

Down With Dichotomous Thinking!

Pages 1-11 discuss numerous reasons why it’s crazy to dichotomise results, whether on the basis of a p value threshold (.05, .005, or some other value) or in some other way–perhaps noting whether or not a CI includes zero, or the Bayes Factor is greater than some threshold. I totally agree. Dichotomising results, especially as statistically significant or not, throws away information, is likely to mislead, and is a root cause of selective publication, p-hacking, and other evils.

So, What to Do?

The authors state that they don’t want to ban p values. They recommend “that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence.” (abstract)

That all seems reasonable. Yes, if p values are mentioned at all they should be considered as a graduated measure. However, the authors “argue that it seldom makes sense to calibrate evidence as a function of the p-value.” (p. 6) Yet in later examples, for example in Appendix B, they interpret p values as indicating strength of evidence that some effect is non-zero. I sense an ambivalence: The authors present strong arguments against using p values, but cannot bring themselves to make the logical next step and not use them at all.

Why not? One reason, I think, is that they don’t discuss in detail any other way that inferential information from the data can be used to guide discussion and interpretation, alongside the ‘subordinate factors’ that they very reasonably emphasise all through the paper. For me, of course, that missing inferential information is, most often, estimation information. Once we have point and interval estimates from the data, p values add nothing and are only likely to mislead.

In the context of a neuroimaging example, the authors state that “Plotting images of estimates and uncertainties makes sense to us” (p. 15). Discussing pure research they state “we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research.” That’s fine, but a long way short of recommending that point and interval estimates almost always be reported and almost always used as the primary data-derived inferential information to guide interpretation.

Meta-analytic Thinking

There is a recommendation for “increased consideration of models that use informative priors, that feature varying treatment effects, and that are multilevel or meta-analytic in nature” (p. 16). That hint is the only mention of meta-analysis.

For me, the big thing missing in the paper is any sense of meta-analytic thinking–the sense that our study should be considered as a contribution to future meta-analyses, as providing evidence to be integrated with other evidence, past and future.. Replications are needed, and we should make it as easy as possible for others to replicate our work. From a meta-analytic perspective, of course we must report point and interval estimates, as well as very full information about all aspects of our study, because that’s what replication and meta-analysis will need. Better still, we should also make our full data and analysis open.

For a fuller discussion of estimation and meta-analysis, see our paper that’s also forthcoming in The American Statistician. It’s here.


Here’s the abstract of McShane et al.:

Open Science DownUnder: Simine Comes to Town

A week or two ago Simine Vazire was in town. Fiona Fidler organised a great Open Science jamboree to celebrate. The program is here and a few of the sets of slides are here.

Simine on the credibility revolution

First up was Simine, speaking to the title THE CREDIBILITY REVOLUTION IN PSYCHOLOGICAL SCIENCE. Her slides are here. She reminded us of the basics then explained the problems very well. Enjoy her pithy quotes and spot-on graphics.

My main issue with her talk, as I said at the time, was the p value and NHST framework that she used. I’d love to see the parallel presentation of the problems and OS solutions, all set out in terms of estimation. Of course it’s easy to cherry-pick and do other naughty things when using CIs, but, as we discuss in ITNS, there should be less pressure to p-hack, and the lengths of the CIs give additional insight into what’s going on. Switching to estimation doesn’t solve all problems, but should be a massive step forward.

A vast breadth of disciplines

Kristian Camilleri described the last few decades of progress in history and philosophy of science. Happily, there’s now much HPS interest in the practices of human scientists. So there’s lots of overlap with the concerns of all of us interested in developing OS practices.

Then came speakers from psychology (naturally), but also evolutionary biology, law, statistics, ecology, oncology, and more. I mentioned the diversity of audiences I’ve been invited to address this year on statistics and OS issues–from Antarctic research scientists to cardiothoracic surgeons.

Mainly we noted the commonality of problems of research credibility across disciplines. To some extent core OS offers solutions; to some extent situation-specific variations are needed. A good understanding of the problems (selective publication, lack of replication, misleading statistics, lack of transparency…) is vital, in any discipline.


Fiona’s own research group at The University of Melbourne is IMeRG (Interdisciplinary MetaResearch Group). It is, as its title asserts, strongly interdisciplinary in focus. Researchers and students in the group outlined their current research progress. See the IMeRG site for topics and contact info.

Predicting the outcome of replications

Bob may be the world champion at selecting articles that won’t replicate: I’m not sure of the latest count, but I believe only 1 or 2 of the dozen or so articles that he and his students have very carefully replicated have withstood the challenge. Only 1 or 2 of their replications have found effects of anything like the original effect sizes. Most have found effect sizes close to zero. 

Several projects have attempted to predict the outcome of replications, then assessed the accuracy of the predictions. Fiona is becoming increasingly interested in such research, and ran a Replication Prediction Workshop as part of the jamboree. I couldn’t stay for that, but she introduced it as practice for larger prediction projects she has planned.

You may know that cyberspace has been abuzz this last week or so with the findings of Many Labs 2, a giant replication project in psychology. Predictions of replication outcomes were collected in advance: Many were quite accurate. A summary of the prediction results is here, along with links to earlier studies of replication prediction.

It would be great to know what characteristics of a study are the best predictors of successful replication. Short CIs and large effects no doubt help. What else? Let’s hope research on prediction helps guide development of OS practices that can increase the trustworthiness of research.


P.S. The Australasian Meta-Research and Open Science Meeting 2019 will be held at The University of Melbourne, Nov 7-8 2019.

Cabbage? Open Science and cardiothoracic surgery

“The best thing about being a statistician is that you get to play in everyone’s backyard.” –a well-known quote from John Tukey.

Cabbage? That’s CABG–see below.

A week or so ago Lindy and I spent a very enjoyable 5 days of sun, surf, and sand at Noosa Heads in Queensland. I spoke at the Statistics Day of the Annual Scientific Meeting of ANZSCTS (Australian and New Zealand Society of Cardiothoracic Surgeons). The program is here (scroll down to p. 18).

My first talk, to open the day, was “Setting the scene–problems with current design, analysis and reporting of medical research”. The slides are here.

In the afternoon I spoke on “‘Open science’–the answer to the problem?”. The slides are here.

Once again, I learned that:

  • The problems of selective publication, lack of reproducibility, and lack of full access to data and materials are, largely, common across numerous disciplines. And many researchers have increasing awareness of such problems.
  • Familiar Open Science practices (preregistration, open materials and data, publishing whatever the results, …) have wide applicability. However, each discipline and research field needs to develop its own best strategies for achieving, as well as it can, Open Science goals.

Technology races on…

I referred to a 2018 meta-analysis (pic below) that combined the results of 7 RCTs that compared two ways to rejoin the two halves of the sternum (breast bone) after open-chest surgery. The conclusion was that there’s not much to choose between wires and traditional suturing.

That was a 2018 article, but two commercial exhibitors were touting the advantages of devices that they claimed were better than either procedure assessed in the Pinotti et al. review. One was a metal clamp that has, apparently, been used for thousands of patients in China and has just been approved for use in Australia, on the basis of one RCT. The second looked like up-market plastic cable ties.

Open Science may set out ideal practice for researchers, but meanwhile regulators and practitioners must constantly make judgments on the basis of possibly less than ideal amounts of evidence, less than desirable levels of precision of estimates.

PCI or CABG? Just run a replication!

PCI is percutaneous coronary intervention, usually the insertion of a stent in a diseased section of coronary artery. The stent is typically inserted via a major blood vessel, for example the femoral artery from the groin.

CABG (“Cabbage”) is the much more invasive coronary artery bypass grafting, which requires open-chest surgery.

How do they compare? Arie Pieter Kappetein told us the fascinating story of  research on that question. He described the SYNTAX study, a massive comparison of PCI and CABG that involved 85 centres across the U.S. and Europe. At the 5-year follow-up stage, little overall difference was found between the two very different techniques. Some clinical advice could be given. There were many valuable subgroup analyses, some of which gave only tentative conclusions.

Replication was needed! More than 5 years and $80M later, he could describe results from the even larger EXCEL study. Again, there were many valuable insights and little overall difference, and the researchers are now seeking funding to follow the patients beyond 5 years. Recently his team has published a patient-level meta-analysis of results from 11 randomised trials involving 11,518 patients. Some valuable differences were identified and recommendations for clinical practice were made but, again, there was little overall difference in several of the most important outcomes–such as death.

So, in some fields, replication, if possible at all, is rather more challenging than simply running another hundred or so participants on your simple cognitive task!


Some of the most interesting papers I attended were retrospective studies of cases sourced from large patient databases. Such databases, as large and detailed as possible, are a highly valuable research resource. One seminar was devoted to the practicalities of setting up a major thoracic database, alongside the existing Australian cardiac database. The vast range of practicalities to be considered made clear how challenging it is to set up and keep running such databases.

Co-incidentally, The New Yorker that week published a wonderful article by Atul Gawande–one of my favourite writers–with the title Why Doctors Hate Their Computers. It seemed to me so relevant to that day’s cardiothoracic database discussions.

I hope you never have to worry about whether to prefer PCI or cabbage!


Internal Meta-Analysis: Useful or Disastrous?

A recent powerful blog post (see below) against internal meta-analysis prompts me to ask the question above. (Actually, Steve Lindsay prompted me to write this post; thanks Steve.)

In ITNS we say, on p. 243: “To carry out a meta-analysis you need a minimum of two studies, and it can often be very useful to combine just a few studies. Don’t hesitate to carry out a small-scale meta-analysis whenever you have studies it would be reasonable to combine.”

Internal meta-analysis
The small number to be combined could be published studies, or your own studies (perhaps to appear in a single journal article) or, of course, some of each. Combining your own studies is referred to as internal meta-analysis. It can be an insightful part of presenting, discussing, and interpreting a set of closely related studies. In Lai et al. (2012), for example, we used it to combine the results from three studies that used three different question wordings to investigate the intuitions of published researchers about the sampling variability of the p value. (Those studies are from the days before preregistration, but I’m confident that our analysis and reporting was straightforward and complete.)

The case against
The blog post is from the p-hacking gurus and is here. The main message is summarised in this pic:

The authors argue that even a tiny amount of p-hacking of each included study, and/or a tiny amount of selection of which studies to include, can have a dramatically large biasing effect on the result of the meta-analysis. They are absolutely correct. They frame their argument largely in terms of p values and whether or not a study, or the whole meta-analysis, gives a statistically significant result.

Of course, I’d prefer to see no p values at all, and the whole argument made in terms of point and interval estimates–effect sizes and CIs. Using estimation should decrease the temptation to p-hack, although estimation is of course still open to QRPs: results are distorted if choices are made in order to obtain shorter CIs. Do that for every study and the CI on the result of the meta-analysis is likely to be greatly and misleadingly shortened. Bad!

Using estimation throughout should not only reduce the temptation to p-hack, but also assist understanding of each study, and the whole meta-analysis, so may reduce the chance that an internal meta-analysis will be as misleading as the authors illustrate.

Why internal?
I can’t see why the authors focus on internal meta-analysis. In any meta-analysis, a small amount of p-hacking in even a handful of the included studies can easily lead to substantial bias. At least with an internal meta-analysis, which brings together our own studies, we have full knowledge of the included studies. Of course we need to be scrupulous to avoid p-hacking any study, and any biased selection of studies, but if we do that we can proceed to carry out, interpret, and report our internal meta-analysis with confidence.

The big meta-analysis bias problem
It’s likely to be long into the future before many meta-analyses can include only carefully preregistered and non-selected studies. For the foreseeable, many or most of the studies we need to include in a large meta-analysis carry risks of bias. This is a big problem, probably without any convincing solution short of abandoning just about all research published earlier than a few years ago. Cochrane attempts to tackle the problem by having authors of any systematic review estimate the extent of various types of biases in each included study, but such estimates are often little more than guesses.

Our ITNS statement
I stand by our statement in favour of internal meta-analysis. Note that it is made the context of a book that introduces Open Science ideas in Chapter 1, and discusses and emphasises them in many places, including in the meta-analysis chapter. Yes, Open Science practices are vital, especially for meta-analysis! Yes, bias can compound alarmingly in meta-analysis! However, the problem may be less for a carefully conducted internal meta-analysis, rather than more.


Lai, J., Fidler, F., & Cumming, G. (2012). Subjective p intervals: Researchers underestimate the variability of p values over replication. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 8, 51-62. doi:10.1027/1614-2241/a000037