Pre-Print – The New Statistics for Better Science

We have a new preprint on how the New Statistics can save the world (sort of): 

It’s for a special issue of the American Statistician on the them of  “Beyond p values”.

We welcome your feedback on via email, twitter (@TheNewStats), or via comment on this blog.

Open Science Goes to the Antarctic–Well, Nearly

Have you ever met a Professor of Seaweed? No, nor had I, but now I have: Catriona Hurd. More about her in a moment.

I’m just back from two highly enjoyable days visiting IMAS, the Institute of Marine and Antarctic Studies, part of the University of Tasmania. IMAS occupies an impressive new building right on the harbour in Hobart. Inside on the walls are stunning panorama pictures from Antarctica. There are various giant whale skulls and other specimens scattered about, along with piles of heavy duty crates with quarantine stickers, no doubt used to bring back samples from down south.

As an Antarctic tragic since childhood–see my Grade 4 project ‘Nature in the Frozen South’–I was delighted to be invited by Jon Havenhand to give a statistics talk and workshop at IMAS. Jon is at IMAS on sabbatical from Sweden. He, Catriona, and some colleagues have been worried about some statistical practices in their research fields, and keen to know more about the new statistics and other recent developments.

My talk and workshop seemed to be well received, with lots of comments along the lines that many IMAS researchers need to be thinking seriously about how the issues I raised should be addressed in their own fields. They were certainly receptive to the notion that many Open Science practices may be just as relevant–and needed–in their disciplines as in psychology. My impression was that they may find it less difficult to move on from p values to effect sizes and estimation than do many researchers in psychology.

The slides for my talk are here, and for my workshop are here.

I was lucky enough 10 years ago to go with 8 friends for 7 weeks in a small motor yacht, Australis, for a trip to the Antarctic Peninsula, South Georgia, and the Falklands. Magic. Tho’ that just makes me more envious of those IMAS scientists whose research takes them down south for months at a time.

Two conclusions from my IMAS visit:

1. John Tukey was spot on when he famously said “The best thing about being a statistician is that you get to play in everyone’s backyard.”

2. Open Science (and the new statistics) is relevant, and needed, across numerous disciplines, some as far afield as the Antarctic.

Thanks Jon, Catriona, and others for the invitation and hosting.

P.S. Will you join the gentoo penguin in trying to hassle the slumbering crabeater seal?!

Not replicable, but citable

What happens to the reputation of a paper when the results reported cannot be replicated?

Here’s a graph of citations/year from two studies–an original and a replication study that found little to no effect.  It’s just one example, but it doesn’t seem like the replication study has had much impact on citations to the original article.  There is a brief fall off (which is actually normal as a paper gets older), but there has lately been a rebound (which is rare; most papers don’t have much longevity in terms of citation history).  Most interesting, you can see that the majority of those citing the original research are not citing the replication study, so most aren’t even raising questions about the results.

I suppose this isn’t surprising, given that even retracted papers earn citations, and it is estimated that retraction only cuts citation yields by about half (Grieneisen & Zhang, 2012) (plus a bunch of other references).  It would be interesting and useful to look more systematically at this question as it relates to replications (and to see if a successful replication helps).

The details:  The original study is by Damisch et al.  (Damisch, Stoberock, & Mussweiler, 2010).  It was published in Psych Science and suggested that belief in luck can produce substantial improvements in motor skill (e.g. having a lucky golf ball increases golf performance).  The replication paper is Calin-Jageman & Caldwell (2014) , a study I helped complete (Calin-Jageman & Caldwell, 2014).  We conducted 2 direct and well-powered replications of one of the key studies in the original article (the lucky golf ball study).  The replications indicated little to no effect of superstition on motor skill.  There’s no such thing as Truth with a capital T in science, but I don’t think there’s a lot of ambiguity on this one.  Not only are the replications convincing, but there is good reason to believe that placebo effects have boundary conditions: believing you will do well at golf shouldn’t magically make you substantially better.  .



Calin-Jageman, R. J., & Caldwell, T. L. (2014). Replication of the Superstition and Performance Study by Damisch, Stoberock, and Mussweiler (2010). Social Psychology, 45(3), 239–245. doi:10.1027/1864-9335/a000190
Damisch, L., Stoberock, B., & Mussweiler, T. (2010). Keep Your Fingers Crossed! Psychological Science, 21(7), 1014–1020. doi:10.1177/0956797610372631
Grieneisen, M. L., & Zhang, M. (2012). A Comprehensive Survey of Retracted Articles from the Scholarly Literature. PLoS ONE, 7(10), e44118. doi:10.1371/journal.pone.0044118

It’s not all bad news

Here’s a cool pre-print examining the quality of evidence in studies of the genetics of short-term memory in fruit flies (Tumkaya, Ott, & Claridge-Chang, 2018).  The paper conducts a meta-analysis of different genes that have been linked to altered olfactory memory.  There’s lots of good news.  Most genes were identified via studies with internal direct replications and large sample-sizes.  No hint of publication bias was detected.  Effect sizes across replications were largely stable–no decline effects were observed.  Yeah!

The only glimmer of bad news is that most of the genes have not been independently replicated, suggesting either that no one is interested in them or that external replications have been conducted but not published (in fact, not a single disputing paper was identified).

It seems clear that when researchers use adequate sample sizes, the resulting evidence base comes out much more reliable and interpretable.   Hooray for fly labs.



Tumkaya, T., Ott, S., & Claridge-Chang, A. (2018, January 13).           A systematic review and meta-analysis of          Drosophila          short-term-memory genetics: robust reproducibility, but little independent replication        . Cold Spring Harbor Laboratory. doi: 101101/247650

Banning p values? The journal ‘Political Analysis’ does it

Back in the 1980s, epidemiologist Kenneth Rothman was a leader of those trying to persuade researchers across medicine and the biosciences to use CIs routinely. The campaign was successful to the extent that the International Council of Medical Editors stated that CIs, or equivalent, should always be reported and that researchers should not rely solely on p values. Since then, the great majority of empirical articles in medicine have reported CIs, although often the intervals are not discussed or used as the basis for interpretation, and p values remain close to universal.

Rothman went further, always arguing that p values should simply not be used, at all. He founded a new journal, Epidemiology, in 1990 and was chief editor for close to a decade. He announced at the start that the journal would not publish any p values. We reported an evaluation of his bold experiment in Fidler et al. (2004). We found that he succeeded–virtually no p values appeared in the journal during his tenure as editor. Epidemiology demonstrated that good science can flourish totally without p values; CIs were usually the basis for inference. Wonderful!

By contrast, in other cases enterprising editors ‘strongly encouraged’ the use of CIs instead of p values, but did not ban p values outright. For an example in psychology, see Finch et al. (2004). Researchers made more use of CIs, which was an improvement, but p values were still usually reported and used.

Very recently, the incoming editor to the political science journal, Political Analysis, announced a ban on p values. The editorial announcing the new policy is here.

Here is the key para from the editorial:
“In addition, Political Analysis will no longer be reporting p-values in regression tables or elsewhere. There are many principled reasons for this change—most notably that in isolation a p-value simply does not give adequate evidence in support of a given model or the associated hypotheses. There is an extremely large, and at times self-reflective, literature in support of that statement dating back to 1962. I could fill all of the pages of this issue with citations. Readers of Political Analysis have surely read the recent American Statistical Association report on the use and misuse of p-values, and are aware of the resulting public discussion. The key problem from a journal’s perspective is that p-values are often used as an acceptance threshold leading to publication bias. This in turn promotes the poisonous practice of model mining by researchers. Furthermore, there is evidence that a large number of social scientists misunderstand p-values in general and consider them a key form of scientific reasoning. I hope other respected journals in the field follow our lead.”

Imho that’s a fabulous development. Yes, progress is happening across many disciplines. I’ll be eager to watch how things go in Political Analysis. Of course I join the new editor in the hope expressed in the final sentence above.

P.S. Thanks to Fiona Fidler for the breaking news.

Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can’t make them think: Statistical reform lessons from medicine. Psychological Science, 15, 119-126.

Finch, S., Cumming, G., Williams, J., Palmer, L., Griffith, E., Alders, C., Anderson, J., & Goodman, O. (2004). Reform of statistical inference in psychology: The case of Memory & Cognition. Behavior Research Methods, Instruments & Computers, 36, 312-324.

Pre-registration challenge met!

I (Bob) have met the pre-registration challenge!  I pre-registered a set of replication studies (Calin-Jageman, 2018), and now that they are published, I’ve received confirmation from the Center for Open Science that I have met the challenge–a check for $1,000 will arrive in my mail around July 1st.   What a great little bonus to incentivize good scientific practices!

The project started when I read the original research in later 2015.  I developed a replication study, completed a pre-registration template in January 2016, and sent my work into the Center for Open Science to submit for the challenge.  I had comments back within days, and after a couple of tweaks my pre-registration was accepted.  I ended up doing a lot more than I thought on the project–lots of studies and little issues to track down, and then a very lengthy review process.  All-in-all, it’s been about 2.5 years from idea to publication.  Nothing about pre-registration slowed me down–it was easy to pre-register each step in the journey.

You can do it, too!  The process is easy and will help make your science better:


Calin-Jageman, R. J. (2018). Direct replications of Ottati et al. (2015): The earned dogmatism effect occurs only with some manipulations of expertise. Journal of Experimental Social Psychology.

A bracing call for better science when linking genes to brain function

There’s a fantastic editorial out in the European Journal of Neuroscience (Mitchell, 2018) arguing that standards need to be much higher in the field of neurogenomics–that’s the study of how genes relate to differences in brain structure and function.

The editorial is spot on–it concisely reviews the issues of low power, publication bias, lack of replication, etc.  Although specifically about neurogeneomics, the points made could be applied to many areas of investigation in the neurosciences.  It’s well worth a read, but if you don’t have time, here’s the most striking quote:

Underpowered, exploratory studies with high degrees of freedom and without replication samples simply generate noise, waste everyone’s time and resources, and pollute the literature with false positives. They are worse than doing nothing.  — Mitchell, 2018,



Mitchell, K. J. (2018). Neurogenomics – towards a more rigorous science. European Journal of Neuroscience, 47(2), 109–114. doi: 10.1111/ejn13801

Open Science Leaders: Dan and Steve Tell Their Stories

BTW, have you noticed that Bob has set up NewStatistics on Twitter–scroll down and see the right hand side bar. Do follow us and help spread the word. Thanks!

Dan Simons may be best known as the co-author of the best-seller The Invisible Gorilla. He’s also the foundation editor of the new APS statistics and methods journal AMPPS.

Steve Lindsay is an accomplished cognitive psychologist and current Editor-in-Chief of Psychological Science. Steve has developed Open Science policies at this journal, see for example here.

Dan and Steve were interviewed at the recent SIPS meeting. A podcast of the interview has recently been released and is here.

In my opinion it’s a fascinating 47 min of story-telling about Open Science, and how two highly successful researchers came to understand that what they–and pretty much everyone else–had been doing for decades was fundamentally flawed. Only over the last 4-6 years have they come to deeply appreciate the importance of p-hacking, optional stopping, cherry-picking, dichotomous thinking…

To their great credit, they have embraced the new ways of Open Science and have indeed become leaders in the development and wide understanding of Open Science practices.

Here’s an approximate sketch of the podcast timing:

0 to 9.20: Dan and Steve describe how they came to be researchers in psychology

9.20 to 18.40: (The most interesting and important bit, imho.) How they came to understand the basic Open Science issues and their importance.

Includes around 13.50+ Steve recalling a talk I gave at Victoria University, Wellington, New Zealand in 2012. He says that the dance of the p values, in particular, was for him ‘revelatory’–and a prompt for his journey to Open Science.

18.45 to 30.20: The need to improve statistics education (hey, we hope ITNS can help!), and stories about stats ed and other signs that the shift to Open Science has a way to go.

30.20 to 38.50: Bayesian possibilities

38.50 to end: Current developments, priorities for further change. Wrap.

Thanks Dan and Steve for describing your own journeys to OS, stumbles and all. I hope your stories encourage many others in their own journeys.


P.S. Thanks to Pierre for the heads up about the podcast.

Gaining expertise doesn’t have to close your mind–another adventure in replication

You may have seen it on the news: being an expert makes you close-minded.  This was circa 2015, and the news reports were about this paper (Ottati, Price, Wilson, & Sumaktoyo, 2015) by Victor Ottati’s group, published in JESP.  The paper showed an ‘earned dogmatism effect’–finding that “situations that engender self-perceptions of high self-expertise elicit a more close-minded cognitive style”.  I think the extensive news coverage was related to the zeitgeist that still pervades–the anxious sense that there is no rationality and that even those who we hoped would know better do not, etc.  Except for just one thing…the research that helped fuel our collective epistemic dread was not, itself, entirely trustworthy.

You could see the warning signs right away.  There were two types of experiments in Ottati et al. (2015).  In one type, participants were asked to imagine being experts in a social scenario and then to predict how open-minded they would be.  These studies were conducted within-subjects with lots of participants, yielding very precise effect-size estimates.  But, they were based solely on participants guessing how they might behave in an imagined scenario.  The second type of experiment is the one that garnered the press attention–participants were given an easy or difficult task, with the easy task being used to generate a sense of expertise (because you were so good at the task).  Then participants said how open-minded they actually felt.  In these studies, those given the easy task felt *much* less open-minded—BUT the studies were very small, the effect size estimates were very broad, and there were serious procedural issues (such as differential dropout in the difficult condition).  Moreover, there were none of the new best practices that might help instill some confidence in the findings–across the multiple studies none of the between-subjects ones were directly replicated/extended, there was no sample-size planning, no assurance of full reporting, no data sharing…it all felt so “pre-awakening”.  In fact, by my reading, the paper violated several of the JESP guide to authorship tenants which had been published earlier that year, prior to the reported submission date of the paper.  That really tore me up–weren’t we making any progress?

This question began a 2-year Odyssey of replicating Ottati et al (2015).  I’m pleased that my replication paper is now published, and that it is, in fact, published in JESP, the same journal that published the original (Calin-Jageman, 2018).  What I found is pretty much what I might have predicted when I first read the paper.  The well-powered within-subjects experiments replicated beautifully.  The under-powered between-subjects experiments did not replicate well at all–across multiple attempts with different subject pools I obtained overall effect sizes very close to 0 with narrow confidence intervals.  Participants do predict they will be close-minded in a situation of expertise, but the current best evidence indicates this does not happen in practice (though, who knows–maybe some other way of operationalizing the variables will yield results).

Here are some things I learned during this replication adventure:

  • Ottati and his team are not close-minded.  They were incredibly gracious and cooperative.  I think they’ll be writing a commentary.
  • Absence of evidence is not the same as evidence of absence.  When I read the paper and it had none of the best practices I have come to expect in modern research (sample-size planning, pre-registration, etc.) I thought for sure there was some funny-business going on.  But in emailing back-and-forth it became clear that the researchers had fully reported their design, had not used run-and-check, had not buried unsuccessful research, etc.  They could have helped themselves by making all this clear, but it was good to be reminded that just because researchers haven’t stated the “21-word solution” doesn’t mean they are gaming the system.
  • Having tried really hard to diagnose where things went wrong with the original research, I’m down to two points: inadequate sample-size (duh!) and differential dropout.  I hadn’t even thought about differential dropout while working on the replications, but then I found this paper about how common and problematic this is with MTurk samples (Zhou & Fishbach, 2016).  Sure enough, it opened my eyes–the original Ottati paper always had more MTurk participants in the easy condition than in the difficult condition.  I don’t know for sure, but I think that’s what did the original research in.  In my case I used much larger samples and I also drew not only on MTurk participants but also other pools (e.g. Psych 101 students) that do not so actively stop when there is a bit of work to do in an experiment.  I need to write another post on why MTurk should be dead to social psychologists.
  • I had to work really hard to convince my section editor at JESP that this paper warranted publication.  It was actually rejected initially and part of the reasoning was that I hadn’t justified why the replications were done or why JESP should publish it.  I’m glad they re-considered, but I still consider it axiomatic–if your journal published a paper you automatically should consider replications interesting to your readers, perhaps especially when it revises the purported knowledge already published.
  • I had to cut from the paper discussion of JESP’s publishing guidelines.  Part of my reason for doing this set of replications was to point out that they are either not being enforced or they are not toothsome enough to prevent major type M errors.  But the editor suggested I not discuss this.  The reasoning was strange–apparently even though the updated guidelines had been published before Ottati et al. (2015) had been submitted, the editor didn’t seem convinced that anything had actually changed at that point.  Interesting!  It’s anecdotal, but I keep hearing about journals rolling out impressively strict new guidelines…but not really lifting a finger to train section editors or reviewers to be sure they are implemented.  Boo.
  • Ottati and his team pointed out lots of ways the Earned Dogmatism Effect could still be true and are also not nearly as down on the imagined scenarios as I am.  Fair enough.  It will be interesting to see if something solid can be developed from this.  If so, I’d be thrilled.  As it stands, I am happy to have finally have a paper with some good news–the within-subjects imagined/scenario designs replicate very, very well across multiple types of participants.
  • The interactions with Ottati were excellent, but the review process at JESP was not inspriing–it took a long time, one of the reviewers didn’t know anything about replication research or the new statistics, and the third had only very superficial comments.  I don’t think my responses went back to any reviewers.  Of all the social psych work I’ve now done, this was the least useful and rigorous review process… oh wait, no.. 2nd worse behind a Plos ONE paper.
  • I keep running into the misconception that only Bayesian’s can support the null hypothesis.  Even in a manuscript that reports effects sizes with Confidence Intervals and interprets them very explicitly throughout in ways that make clear there is good support for the null or something reasonably close to it.  That’s a stubborn misconception.  Fortunately I was able to get some quick help from EJ Wagenmakers (thanks!) and reported a replication Bayes factor (Ly, Etz, Marsman, & Wagenmakers, 2017).  I still don’t believe it adds anything beyond the CIs I had reported, but nothing wrong with another way of summarizing the results.


    Calin-Jageman, R. J. (2018). Direct replications of Ottati et al. (2015): The earned dogmatism effect occurs only with some manipulations of expertise. Journal of Experimental Social Psychology. doi: 10.1016/j.jesp.2017.12008
    Ly, A., Etz, A., Marsman, M., & Wagenmakers, E.-J. (2017). Replication Bayes Factors from Evidence Updating (p. ). PsyArXiv. doi: /10.17605/osfio/u8m2s
    Ottati, V., Price, E. D., Wilson, C., & Sumaktoyo, N. (2015). When self-perceptions of expertise increase closed-minded cognition: The earned dogmatism effect. Journal of Experimental Social Psychology, 61, 131–138. doi: 10.1016/j.jesp.2015.08003
    Zhou, H., & Fishbach, A. (2016). The pitfall of experimenting on the web: How unattended selective attrition leads to surprising (yet false) research conclusions. Journal of Personality and Social Psychology, 111(4), 493–504. doi: 101037/pspa0000056

Say It in Song: Go Forth and Replicate!

Jon Grahe, of Pacific Lutheran University, is an enthusiastic advocate for Open Science and, especially, for student participation in doing Open Science as a key part of education. The Collaborative Replication and Education Project (CREP, pronounced “krape”) is a great project of his, which we discuss on pp. 263 and 475 in ITNS.

Jon’s also a musician, and uses music as yet another way to spread OS messages. The Second Stringers recently released an updated recording of their potential world smash hit number Go Forth & Replicate. (While you are at that OSF site you may care to check out one or two other goodies.)

I particularly liked the creative involvement of the cat’s eye, which becomes not just a graph showing how likelihood or plausibility varies across a CI, but a sort of all-seeing eye on what’s happening in a replication project. Great work, Jon and team!

“…when the cat’s eye blinks…”


And now a pic to admire, the cat’s eye (from Fig 5.1 in ITNS):