Effect Sizes for Open Science

For the last 20 years or so, many journals have emphasised the reporting of effect sizes. The new statistics emphasised also the reporting of CIs on those effect sizes. Now Open Science places effect sizes, CIs, and their interpretation centre stage.

Here’s a recent article with interesting discussion and much good advice about using effect sizes in an Open Science world:
Pek, J., & Flora, D. B. (2018). Reporting effect sizes in original psychological research: A discussion and tutorial. Psychological Methods, 23(2), 208-225. http://dx.doi.org/10.1037/met0000126

The translational (simplified) abstract is down the bottom.

Unfortunately, Psychological Methods is behind the APA paywall, so you will need to find the article via a library. (Sometimes it’s worth searching for the title of an article, in case someone has secreted the pdf somewhere online. Not in this case at the moment, I think.)

A couple of the article’s main points align with what we say in ITNS:

Give a thoughtful interpretation of effect sizes, in context
Choose effect sizes that best answer the research questions and that make most sense in the particular situation. Often interpretation is best done in the original units of measurement, assuming these have meaning–especially to likely readers. Use judgment, compare with past values found in similar contexts, give practical interpretations where possible. Where relevant, consider theoretical and possibly other aspects or implications. (And, we add in ITNS, consider the full extent of the CI in discussing and interpreting any effect size.)

Use standardised effect sizes with great care
Standardised (or units-free) effect sizes, often Cohen’s d or Pearson’s r, can be invaluable for meta-analysis, but it’s often more difficult to give them practical meaning in context. Beware glib resort to Cohen’s (or anyone else’s) benchmarks. Be very conscious of the measurement unit–for d, the standardiser. If, as usual, that’s a standard deviation estimated from the data, it has sampling variability and will be different in a replication. (In my first book, UTNS, I introduced the idea of the rubber ruler. Imagine the measuring stick to be a rubber cord, with knots at regular intervals to mark the units. Every replication results in the cord being stretched to a different extent, so the knots are further or less far apart. The Cohen’s d value is measured in units of the varying distance between knots.)

There’s also lots more of interest in this article, imho.

Translational Abstract
We present general principles of good research reporting, elucidate common misconceptions about standardized effect sizes, and provide recommendations for good research reporting. Effect sizes should directly answer their motivating research questions, be comprehensible to the average reader, and be based on meaningful metrics of their constituent variables. We illustrate our recommendations with four different empirical examples involving popular statistical methods such as ANOVA, categorical variable analysis, multiple linear regression, and simple mediation; these examples serve as a tutorial to enhance practice in the research reporting of effect sizes.

APS in San Fran 3: Workshop on Teaching the New Stats

Tamarah Smith and Bob presented a workshop on Teaching the New Stats to an almost sold-out crowd. I wasn’t there, but by all reports it went extremely well. Such a workshop seems to me a terrific way to help interested stats teachers introduce the new stats into their own teaching.

After taking that first step, it may all get easier, because, in my experience, teaching the new stats brings its own reward, in that students understand better and therefore feel better. So we the teachers will also feel the joy.

Tamarah and Bob’s slides are here. It strikes me as a wonderful collection, with numerous links to useful resources, and great advice about presenting an appealing and up-to-date course to beginning students. Also, indeed, to more advanced students. It’s well worth browsing those slides. Here are a few points that struck me as I browsed:

**You can download the workshop files, and follow along.

**You may know that jamovi and JASP are open source applications for statistical analysis designed to supersede commercial applications, notably SPSS. They are more user friendly, as well as being extensible by anyone. Already, add-on modules written in R are beginning to appear. (These are exciting developments, worth trying out.)

**Bob is developing add-ons for jamovi for the new statistics. (Eventually, jamovi augmented by Bob’s modules may replace ESCI–with the great advantages of providing data file management and a gateway to the power and scope of a full data analysis application.)

**The workshop discussed three simple examples (comparison of two independent means, comparison of two independent proportions, and estimation of interactions).

**The first example (independent means) was discussed in most detail, with a comparison of traditional NHST analysis, and new-stats analysis using ESCI then jamovi (with Bob’s add-on); then a Bayesian credible-interval approach. Then meta-analysis, to emphasise that estimation thinking and meta-analytic thinking are essential frameworks for the new stats.

**The GAISE guidelines were used to frame the discussion of the pedagogical approach. There is lots on encouraging students to think and judge in context–which should warm the heart of any insightful stats teacher.

**There are examples of students’ responses to illustrate the presenters’ conviction that conceptual understanding is better when using the new stats.

**There’s a highly useful discussion of a range of statistical software for new-stats learning and data analysis.

There’s lots more, but I’ll close with the neat summary below of the new-stats approach, which is now best ethical practice for conducting research and inferential data analysis.


APS in San Fran 2: Symposium on Teaching the New Stats

Our symposium was titled Open Science and Its Statistics: What We Need to Teach Now. The room wasn’t large, but it was packed, standing room only. I thought the energy was terrific. There were four presentations, as below.

Bob and Tamarah Smith have set up an OSF page on Getting started teaching the New Statistics. It holds all sorts of goodies, including the slides for our symposium.

At that site, expand OSF Storage, then Open Science and its Statistics–2018 APS Symposium slides and see 4 files for the 4 presentations:

Bob Calin-Jageman (Chair)
Open Science and Its Statistics: What We Need to Teach Now
Examples of students being stumped by traditional NHST analysis and presentation of a result, but readily (and happily) understanding the same result presented using the new stats. In addition we should teach and advocate the new statistics to improve statistical understanding and communication across all of science that uses statistical inference.

Geoff Cumming
Open Science is Best Practice Science
Being ethical as a teacher or researcher requires that we use current best practice, and for statistical inference that is the new statistics. The forest plot is, in my experience, a highly effective picture for introducing students to estimation and meta-analysis. I gave a paper advocating its use in the intro stats course back in 2006. In my experience, the new stats is, in contrast to NHST, a joy to teach. Students saying ‘It just all makes sense…’ is one of the most heart-warming things any teacher can hear.

Susan Nolan & Kelly Goedert
Transitioning a Traditional Department: Roadblocks and Opportunities for Incorporating the New Statistics and Open Science into Teaching Materials
Roadblocks include NHST appearing everywhere, common software (SPSS) not making the new stats easy, colleagues who are steeped in the old ways, widely-used textbooks taking traditional approaches, … But there are new textbooks and open source software emerging, and there are strategies for spreading the word and bringing colleagues on board. (See the slides for numerous practical suggestions and links to many useful resources to support teaching and using the new statistics.)

Tamarah Smith
Feeling Good about the New Statistics: How the New Statistics Improves the Way Researchers and Students Feel about Statistics
Statistics anxiety is a problem for many students, and impedes their learning. The new statistics opens up great opportunities for teaching so that anxiety is much reduced and students’ attitudes are more positive. The new statistics helps teachers meet the GAISE guidelines for assessment and instruction in stats education. Students feel better and more engaged and their learning is grounded. As a result their teachers also feel better. Let’s do it.

Personally, I found it wonderful to hear so many examples and reasons all converging on the conclusion that teaching the new statistics (1) is what’s needed for ethical science, (2) helps students understand much better and feel good about their learning, and (3) is great for teachers also. A triple win!


P.S. The pic below is from Bob’s slides and is adapted from Kruschke and Liddell (2018). The crucial thing for Open Science is the shift to estimation and meta-analysis, and away from the damaging dichotomous decision making of NHST. The estimation and meta-analysis can be frequentist (conventional confidence intervals) or Bayesian (credible intervals)–either is fine. In other words, there is a Bayesian new statistics, alongside the new statistics of ITNS. Maybe the Bayesian version will come to be the more widely used? I believe the biggest hurdle to overcome for that to happen is the arrival of good teaching materials that make Bayesian estimation easily accessible to beginning students, as well as to researchers steeped in NHST.

But the main message is that either of the cells in the bottom row is just what we need.

Kruschke, J. K. & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25, 178–206.

APS in San Fran 1: Open Science is Maturing

It was, as ever, a great pleasure to catch up with Bob last weekend. We were in San Francisco for the APS Convention. That convention has been for the last five years or so a hotbed of discussion about Open Science–many times I felt I was witnessing decisions about what best practice science now should be. Open Science practices were being identified and refined, as I listened and watched!

This year the convention was back in the same place as four years ago, when my workshop on the new statistics and OS was filmed. That resulted in the six videos available here.

My sense last year and especially this year is that OS practices are maturing. A few years ago we discussed evidence of the replication crisis and possible strategies for improvement. But now OS badges have been in the field for a few years, preregistration is widely recognised, and the emphasis has shifted to actually doing these good things, and providing encouragement and facilities for researchers generally to adopt OS practices.

As an example, it’s no longer a matter of explaining why preregistration helps and encouraging folks to use it, but now we discuss the best ways to ensure that preregistration is fully detailed and includes a detailed plan for the statistical analysis.

Again this year there were a number of lively symposia in which OS issues were debated, but now with some data about how we’re tracking. As an example of a great symposium contribution I’ll mention 19 Ways Journal Editors Can Promote Transparency and Replicability by Steve Lindsay, editor-in-chief of Psychological Science. The 19 encompass a very broad range of good practices, from encouraging preregistration to suggesting several strategies for improving statistical practices in articles that make the grade and are published. See a pdf of Steve’s slides here.

Below is a pic from those slides.


The Perils of MTurk, Part 1: Fuel to the Publication Bias Fire?

It’s not going to be a popular opinion, but I think MTurk has become a danger to sound psychological science.  This breaks my heart.  MTurk has helped transform my career for the better.  Moreover, MTurk participants are amazing: they are primarily diligent and honest folks with a genuine interest in the science they are participating in.  Still, I’ve come around to the conclusion that MTurk + current publication practices = very bad science (in many cases).  That’s a worrisome conclusion, given that reliance on MTurk is already fairly overwhelming in some subfields of psychology; 40-50% of all manuscripts in top social psych journals include an MTurk sample (Zhou & Fishbach, 2016).

The Theory: MTurk as an accelerant to noise mining

Here’s the problem: MTurk makes running studies so easy that it exacerbates the publication bias problem.  There are *so* many researchers running *so many* studies.  Yes, you know that–problems with non-naivete are already well-documented with MTurk (Chandler, Mueller, & Paolacci, 2013).  But think about this: it is the nature of science that within a particular field its diverse workforce often works on a relatively small set of problems.  It is this relatively narrow focus within each field that explains why the history of science is so full of instances of parallel/multiple discovery: https://en.wikipedia.org/wiki/Multiple_discovery.

So it’s not just that lots of people are running lots of studies–it’s that often there are large cohorts of researchers who are unwittingly running essentially the same studies.   Then, given our current publication practices, the studies reaching ‘statistical significance’ are published while those that are ‘not significant’ are shelved.   That’s a recipe for disaster.  And, as Leif Nelson recently pointed out, mining noise in this way is not nearly as inefficient as one might expect (http://datacolada.org/71).

So… my theory is basically just a re-telling of the publication bias story, an old saw (Sterling, 1959), though still an incredibly pernicious one.   What is new, I think, is the way MTurk has made the opportunity costs for conducting a study so negligible: it’s like fuel being poured on the publication dumpster fire.  MTurk dramatically increases the number of people running studies and the number of studies run by each researcher.  Moreover, the low opportunity costs means it is less painful for researchers to simply move on if results didn’t pan out.  With MTurk it costs very little to fill your file drawer while mining noise for publication gold.

The Semi-Anecdotal Data

I’ve developed these concerns based on my own experiences with MTurk.  This is semi-anecdotal.  I mean, the data are real, but none of it was collected specifically to probe publication bias with MTurk… so it may not be representative of psychology, or of any particular subfield.  That’s part of why I wrote this blog post–to see if others have had experiences like mine and to try to think about how the size/scope of the problem might be estimated in an unbiased way.  Anyway, here are the experiences I’ve had which suggest MTurk accentuates publication bias:

Online studies might have a higher file drawer rate

I recently collaborated on a meta-analysis on the effect of red on romantic attraction (paper is under review, OSF page with all data and code are here: https://osf.io/xy47p/ ).   For experiments in which incidental red (in the border, clothing, etc.) was contrasted with another color we ended up with data summarizing 8,007 participants.  Incredibly, only 45% of this was published.  The other 55% (4,436 participants) was data shared with us from the file drawer.  That, on it’s own, is crazy!  And yes, the published literature is distorted: the unpublished data yield much weaker effect sizes than the published data.

Beyond the troubling scope of the file drawer problem in the red/romance literature, we found that data source is related to publication status.  For in-person studies, 52% was in the file drawer (2,484 of 4,763 participants); for studies conducted online, 60% was in the file drawer (1,952 of 3,244 participants).  This comparison is partly confounded by time: in-person studies have been conducted from 2008 to present whereas online studies only since 2013.  Also, online studies are typically much larger, so even if there is the same experimental rate of drifting into the file drawer, the overall participant rate will be higher for online studies.  Still, this data seems worrisome to me and potentially indicative that MTurk/Online studies fill the file drawer at a faster rate than in-person studies.

Those who do online studies sometimes seem to have little invested in them

For the red-romance meta-analysis, one lab sent us 6 unpublished online studies representing 956 participants (all conducted in 2013).  The lab actually sent us the data in a chain of emails because digging up data from one study reminded of the next and so on.  It had been 5 years, but the lab leader reported having completely forgotten about the experiments.  That, to me, indicates the incredibly low opportunity cost of MTurk.  If I had churned through 956 in-person participants I would remember it, and I would have had so much sunk costs I’m sure I would have wanted to try to find some outlet for publishing the result.  I guess that’s its own problem, because the enormous costs of in-person studies provide incentives to try to p-hack them into a ‘publishable’ state.  But still, when you can launch a study and see 300 responses roll in within an hour or so, your investment in eventually writing it all up seems fairly weak.

MTurk participants say they’ve taken part in experiments which have not been reported

In 2015 I was running replication studies on power and performance (Cusack, Vezenkova, Gottschalk, & Calin-Jageman, 2015).  None of the in-person studies were showing the predicted strong effects, so I turned to MTurk.

I knew from the paper by Chandler (2013) that non-naivete is a huge issue and that common manipulations of power were played out on MTurk.  So I went looking for a manipulation that had not yet been used online.  I settled on a word-search task (participants find either power-related or neutral words). I selected this task because adopting it online requires some decent coding skills and because I could not find any published articles reporting a word-search manipulation (of any type) with Mturk participants.  I figured that by developing an online word search I could be assured of an online study with naive participants.

Even though the word search task I painstakingly coded was novel (for an online context), when I launched the study on MTurk I included an item at the end where I asked participants to rate their familiarity with the study materials.  In this last section, it was made clear that payment was confirmed and not contingent on their answers.  Participants were asked “How familiar to you was the word-search task you answered?” and responded on a 1-4 scale.

The results were astonishing.  Of 442 participants who responded, 19 (4%) rated their familiarity as a 4: Very familiar – I’ve completed on online word search before using this exact same word list“; another 46 (9.7%) rated theri familiarity as a 3: Familiar – I’ve completed word searches like this online before, and some of the words were probably the same“.

Yikes! That’s means 16% of MTurk participants reported having taken part in a manipulation that had not ever been reported in published form (as far as I can tell).  If there are, say 7,300 MTurk survey takers (Stewart et al., 2015), that means about 1,100 MTurkers had already been through an experiment with this manipulation.  Given that there is constant turnover of MTurk, the real number of unpublished power/word-search studies is likely considerably higher.

Materials I know are being used aren’t being reported

In March 2015 I wrote a blog post about my word search task (https://calin-jageman.net/lab/word_search/ ) and offered the code for free for anyone who wants to use it.  Since then, I’ve been contacted by >30 labs seeking the code and sample Qualtrics survey.  I’ve happily shared the materials each time, but asked the recipients to please cite my paper.  So far, no one has cited the paper for this reason.  I’m sure some are still in progress and that others decided against using the task… but I’ve done a lot of tech support on this task and feel confident at least 1/2 have actually tried to collect data with it.  If I were to guess that only 1/2 of those are mature enough to have been published by now, that’s still gives me a guess of about 7 studies which have used the task but which are lurking in the file drawer.  If each study has 300 participants, that’s easily 2100 participants in the file drawer.  So how representative, really, is the published literature on power in the MTurk era?  Surely, you can now find lots of published effects of power that use MTurk or online participants… but it may be that these are just the tip of the iceberg.

Multiple discovery is real

In 2014 I went to the EASP meeting in Amsterdam and saw a talk by Pascal Burgmer.  He reported studies in which he had adapted a perspective-taking task by Tversky & Hard for use online (Tversky & Hard, 2009).  As the talk was happening, bells were going off in my head: I could use this online perspective-taking measure to conduct a conceptual replication of the famous finding by Galinsky and colleagues that power decreases perspective-taking (Galinsky, Magee, Inesi, & Gruenfeld, 2006). (you know the one… it’s in all the Psych 101 textbooks).

I wrote to Pascal Burgmer; he graciously shared the materials, and then I set to work with a student to design a replication (https://osf.io/wch5r/).  We collected the data in early 2015 on both MTurk and Prolific Academic.  For the MTurk study, 36% of the participants reported strong familiarity with the task.  Clearly, others were investigating power and perspective-taking on MTurk at or around the same time.  For comparison, only 3% of Prolific Academic participants reported strong familiarity with the study materials.

Eventually, we’ve been able to identify some of the other labs that were interested in this question.  In 2016 Blader and colleagues reported an experiment almost exactly like the one my student and I had run (Study 4 of (Blader, Shirako, & Chen, 2016)): power was manipulated and the effect was measured using the same online adaptation of Tversky and Hard’s visual perspective-taking task (even the same images!).  Also, it turns out that in late 2014 one of the Many Labs projects included a study on power and perspective-taking (though with a somewhat different paradigm) and that 773 MTurk participants were collected  (Ebersole et al., 2016).

So – that means at least 3 labs were collecting data on power and perspective-taking at or around the same time on MTurk.  We were using materials that are spoiled with repeated use (once you’ve been debriefed on the visual-perspective-taking task it is ruined), but as far as I can tell I was the only one to probe for non-naivete.  Moreover, non-naivete makes a measurable difference: in our MTurk data including the non-naive participants yields a statistically significant effect, but excluding the non-naive participants does not (though it is a moderate effect in the predicted direction).

Not only were we spoiling each others experiments, it is also clear that the published record on the topic is distorted.  The Blader et al. (2106) paper was published, but the authors were (of course) unaware of two other large studies using essentially the same materials… synthesizing all these studies would suggest a weak, potentially 0 effect.

I made this scatterplot for an APS poster in March of 2017 that has the original study, my student’s direct replication, our MTurk and Prolific Academic samples, the Blader study, and an student replication we were able to find online…. the balance of evidence suggests a very weak effect in the predicted direction, but 0 effect cannot be ruled out.  I haven’t had time to work on the project since then.. so there may be more relevant data to synthesize.  And, of course, one of the Many Labs projects failed to find an effect of power on perspective-taking in another paradigm.

Caveats and Discussion

I keep saying “MTurk” when really I mean online… I would guess the problem of low opportunity cost exacerbates the file drawer problem with any online platform in which obtaining large samples becomes very low cost in terms of time, effort, and actual money.

I don’t think MTurk is completely ruined.  If you have truly novel research materials or materials that don’t spoil with repeated use then MTurk seems much more promising.  But think about this–if you do find an effect with new materials, how long before they are spoiled on MTurk?  What will we do when folks can report original research on MTurk but then claim all subsequent replications are invalid due to non-naivete?

Overall, I’m still a bit equivocal about the dangers of MTurk.  I love the platform and seeing the data roll in.  On the other hand, it is this very love of MTurk that is so dangerous.  There are so many researchers with so few distinct research questions–it’s a simple tragedy of the commons.

All of the data I presented above is semi-anecdotal.  The projects I’ve been working on keep bumping me up against the idea that the file drawer problem is much worse with online studies.  But maybe my experiences are unusual or unique?  Maybe it depends on the subfield of psychology or the topic?  I’m happy for feedback, sharing of experiences, etc.  I’d also love to hear from folks interested in collaborating to try to measure the extent of the file drawer problem on MTurk.



Blader, S. L., Shirako, A., & Chen, Y.-R. (2016). Looking Out From the Top. Personality and Social Psychology Bulletin, 42(6), 723–737. doi:10.1177/0146167216636628
Chandler, J., Mueller, P., & Paolacci, G. (2013). Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behavior Research Methods, 46(1), 112–130. doi:10.3758/s13428-013-0365-7
Cusack, M., Vezenkova, N., Gottschalk, C., & Calin-Jageman, R. J. (2015). Direct and Conceptual Replications of Burgmer & Englich (2012): Power May Have Little to No Effect on Motor Performance. PLOS ONE, 10(11), e0140806. doi:10.1371/journal.pone.0140806
Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks, J. B., … Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82. doi:10.1016/j.jesp.2015.10.012
Galinsky, A. D., Magee, J. C., Inesi, M. E., & Gruenfeld, D. H. (2006). Power and Perspectives Not Taken. Psychological Science, 17(12), 1068–1074. doi:10.1111/j.1467-9280.2006.01824.x
Sterling, T. D. (1959). Publication Decisions and their Possible Effects on Inferences Drawn from Tests of Significance—or Vice Versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.1080/01621459.1959.10501497
Stewart, N., Ungemach, C., Harris, A., Bartels, D., Newell, B., Paolacci, G., & Chandler, J. (2015). The average laboratory samples a population of 7,300 amazon mechanical turk workers. Judgment and Decision Making, 10(5). Retrieved from https://repub.eur.nl/pub/82837/
Tversky, B., & Hard, B. M. (2009). Embodied and disembodied cognition: Spatial perspective-taking. Cognition, 110(1), 124–129. doi:10.1016/j.cognition.2008.10.008
Zhou, H., & Fishbach, A. (2016). The pitfall of experimenting on the web: How unattended selective attrition leads to surprising (yet false) research conclusions. Journal of Personality and Social Psychology, 111(4), 493–504. doi:10.1037/pspa0000056

Tony Hak 1950-2018: Champion of Better Methods, Better Statistics

It was a shock to receive the very sad news that Tony Hak died last week, unexpectedly. Too young! And only 3 years into an active retirement.

Tony was an Emeritus Associate Professor, having retired in 2015 from the Department of Technology and Operations Management of RSM, the Rotterdam School of Management, Erasmus University, The Netherlands.

Tony was indefatigable in teaching and advocating better research practices and, in particular, use of statistical methods not based on NHST and p values. He had worked in a number of different disciplines during his career, and maintained a great breadth of interest–just about all disciplines need to lift their statistical game. Tony recognised many of the issues that led to Open Science well before that term became established.

I’ll mention just two of his recent pieces of work:

This is an Excel-based set of tools for meta-analysis that was developed with two of Tony’s graduate students. It goes well beyond ESCI intro meta-analysis. It is a free download from here. We recommend it in Chapter 9 of ITNS; it’s well worth checking out.

ICOTS9, 2014
ICOTS9 was the 9th International Conference on Teaching Statistics, held in Flagstaff, Arizona. Tony’s provocative paper was titled AFTER STATISTICS REFORM: SHOULD WE STILL TEACH SIGNIFICANCE TESTING? Download it here.

Tony argued that we should seriously consider no longer teaching NHST. His paper caused quite a stir at ICOTS, and, of course, few agreed to go as far as he was advocating. But he made some strong arguments for his position. His paper is very much worth reading–a real thought-provoker.

Australia in 2012, Netherlands in 2013
Tony visited my research group at La Trobe in 2012–we had excellent discussions. Then in 2013 he was a generous academic host to me at RSM for a couple of weeks. I had a wonderful time, giving talks and/or workshops in Groningen, Utrecht, Amsterdam and, especially, Rotterdam. I met and had lively discussions with many of the Netherlands-based folks who are still leaders in Open Science and statistical debates.

One day Tony and I were chatting in his office when a delivery arrived–a stack of heavy cartons. Tony happily explained that the boxes contained my first book, UTNS–sufficient copies for Tony’s incoming methods class. Enough to warm the heart of any author!

Tony was a fine man, wonderful with students, and energetic, persistent, and innovative in pursuit of his academic goals–which were shrewdly chosen to improve how research can be done.


TONY HAK 1950 – 2018

Badges, Badges, Badges: Open Science on the March

Here are two screen pics from today’s notice about the latest issue of Psychological Science.

Four of the first five articles earned all three badges, including Pre-reg! Gold! (OK, by showing just those five I’m cherry picking, but other articles also have lots of badges and, in any case, what a lovely juicy cherry!)

If you would like to know about Open Science, or you would like your students to learn how science is now being done, you could try ITNS, our intro textbook that includes lots of Open Science. And/or you could find out more at the APS Convention in San Fran next month:


ITNS–A New Review on Amazon

The ITNS page on Amazon (U.S.) is here. Scroll down to see the 4 reviews by readers.

Recently a five-star review was added by Edoardo Zamuner. Here it is:

“I am an experimental psychologist with training in NHST. Cumming’s book has helped me realise that my understanding of key statistical tests was largely mistaken. As a result, Cumming’s discussion of confidence intervals, effect sizes, meta-analysis and open science has become essential to my work. I found the section on effect sizes especially important, since many academic journals now require authors to report effect sizes with error estimations — with or without p-values. Cumming’s New Statistics is at the center of Psychological Science’s statistical reforms (Eich, 2013, Psychological Science, “No More business as Usual—Embracing the New Statistics”).

“In addition to being an excellent tool for researchers, Cumming’s book is relevant to teachers interested in switching from NHST to effect sizes, and who want their students to learn about the replication crisis, meta-analysis, and open science. Students using this book will still gain the necessary skills to understand the literature published prior to the statistical reform.

“The book comes with a number of excellent online resources, including a companion website hosted by the publisher (google: routledge textbooks introduction new statistics). From this website, teachers and students can download slides with images from lecture notes, the Exploratory Software for Confidence Intervals (ESCI), and guides to SPSS and R. More online material is available on the APS website (google: aps new statistics estimation cumming), where readers can watch six excellent videos in which Cumming expands on topics from the book.”

Thank you Edoardo. I was delighted to see those judgments.

If I could just add one remark: My co-author Bob Calin-Jageman deserves a big mention for an immense and ongoing contribution to the book and all the materials.

P.S. If you have read or worked with ITNS and/or any of the accompanying materials, please consider writing a review for Amazon. It’s actually easy to do.

It’s not just Psychology: Questionable Research Practices in Ecology

Today’s fine article from The Conversation is:

Our survey found ‘questionable research practices’ by ecologists and biologists – here’s what that means

The authors are Fiona Fidler and Hannah Fraser, of The University of Melbourne.

Fidler and Fraser surveyed 807 researchers (494 ecologists and 313 evolutionary biologists) about their use of Questionable Research Practices (QRPs), including cherry picking statistically significant results, p-hacking, and hypothesising after the results are known (HARKing). The authors also asked them to estimate the proportion of their colleagues that use each of these QRPs. For each QRP, roughly around half the respondents stated that they had used that practice at least once. For some practices, they estimated higher rates among their research colleagues. These results are confronting, but the proportions are similar to those previously reported for psychology.

The preprint that gives more details of their survey and the results is here.

So QRPs have been endemic in Psychology, and now Ecology and Evolutionary Biology. And in even more disciplines, we’d have to guess. Open Science has, of course, developed to improve research practices, in particular by reducing QRPs markedly.

One of the problems is that anti-science forces can attempt to exploit these sort of findings, not to mention the also confronting findings of the replication crisis. The specific focus of Fidler and Fraser’s article is to respond to this problem. They pose and then reply to a number of the accusations that might be prompted by their results:

It’s fraud!
NO, it’s not! Scientific fraud does occur, and is extremely serious, but the evidence is that, thankfully, it’s very rare.

Scientists lack integrity and we shouldn’t trust them
The authors present evidence and several reasons why this is not true. The rapid rise and spread of Open Science may be the strongest indicator that researchers are responding with great integrity, energy, and conviction as they develop and adopt the better ways of Open Science.

We can’t base important decisions on current scientific evidence
On the contrary, in numerous important cases, including climate change and the effectiveness of vaccination, the evidence is multi-pronged, massive, and much replicated.

Scientists are human and we need safeguards
Yes indeed, and perhaps one of the biggest challenges of Open Science is to achieve change in the incentive systems that scientists are subjected to, and that so easily lead to QRPs.

But read the article itself–it’s short and very well-written.


We’ve Been Here Before: The Replication Crisis over the Pygmalion Effect

[UPDATE: Thanks to twitter I came across this marvelous book(Jussim, 2012) that does a great job explaining the Pygmalion effect, the controversy around it, and the overall state of research on expectancy effects.  I’ve amended parts of this post based on what I’ve learned from Chapter 3 of the book…looking forward to reading the rest]

Some studies stick with you; they provide a lens that transforms the way you see the world.   My list of ‘sticky’ studies includes Superstition in the Pigeon (Skinner, 1948), the Good Samaritan study (Darley & Batson, 1973), Milgram’s Obedience studies (Milgram, 1963), and the Pygmalion Effect by Rosenthal and Jacobson.

Today I’m taking the Pygmalion Effect off my list.  It turns out that it is much less robust than my Psych 101 textbook led me to believe (back in 1994).   Expectancy effects do occur, but it is unlikely that teacher expectations can dramatically shape IQ as claimed by Rosenthal & Jacobson.

This is news to me…though maybe not to you.  Since I first read about the Pygmalion effect as a first-year college student I ‘ve bored countless friends and acquaintances with this study.  It was a conversational lodestone; I could find expectancy effects everywhere and so talked about them frequently.  No more, or at least not nearly so simplistically.  The original Pygmalion Effect is seductive baloney.  [Update: I mean this in terms of teacher expectancy being able to have a strong impact on IQ.  Fair point by Jessim that expectancy effects matter alot even if IQ isn’t directly affected. ]

What has really crushed my spirit today is the history of the Pygmalion Effect.  It turns out that when it was published it set off a wave of debate that very closely mirrors the current replication crisis.  Details are below, but here’s the gist:

  • The original study was shrewdly popularized and had an enormous impact on policy well before sufficient data had been collected to demonstrate it is a reliable and robust result.
  • Critics raged about poor measurement, flexible statistical analysis, and cherry-picking of data.
  • That criticism was shrugged off.
  • Replications were conducted.
  • The point of replication studies was disputed.
  • Direct replications that showed no effect were discounted for a variety of post-hoc reasons.
  • Any shred of remotely supportive evidence was claimed as a supportive replication.  This stretched the Pygmalion effect from something specific (an impact on actual IQ) to basically any type of expectancy effect in any situation…. which makes it trivially true but not really what was originally claimed.  Rosenthal didn’t seem to notice or mind as he elided the details with constant promotion of the effect.
  • Those criticizing the effect were accused of trying to promote their own careers, bias, animus, etc.
  • The whole thing continued on and on for decades without satisfactory resolution.
  • Multiple rounds of meta-analysis were conducted to try to ferret out the real effect; though these were always contested by those on opposing sides of this issue.  [Update – on the plus side, Rosenthal helped pioneer meta-analysis and bring it into the mainstream…so that’s a good thing!]
  • Even though the best evidence suggests that expectation effects are small and cannot impact IQ directly, the Pygmalion Effect continues to be taught and cited uncritically.  The criticisms and failed replications are largely forgotten.
  • The truth seems to be that there *are* expectancy effects–but:
    • that there are important boundary conditions (like not producing real effects on IQ)
    • they are often small
    • and there are important moderators (Jussim & Harber, 2005).
  • Appreciation of how real expectancy effects works has likely been held back by tons of attention and research on this one, particular research claim, which was never very reasonable or well supported in the first place.

So: The famous Pygmalion Effect is likely an illusion and the bad science that produced it was accompanied by a small-scale precursor of the current replication crisis.  Surely this is a story that has been repeating itself many times across many decades of psychology research:

Toutes choses sont dites déjà; mais comme personne n’écoute, il faut toujours recommencer / Everything has been said already; but as no one listens, we must always begin again.

(I just learned about this quote today in a Slate article; it is from Andre Gide and was recently quoted in footnote by Supreme Court Justice Sonia Sotomayor)

The Details

I’ve based this brief blog post on two papers that summarize the academic history of the Pygmalion effect: Spitz (1999) and Jussim & Harber (2005).  If you are interested in this topic, I strongly recommend them along with this book by Jessum (Jussim, 2012) There’s no way I could match either of these sources for their breadth and depth of coverage of this topic.  So below here are the cliff notes:

The Pygmalion Effect

The original Pygmalion Effect was an experiment by Rosenthal & Jacobson in which teachers at an elementary school were told that some of their students were ready to exhibit remarkable growth (based on the “Harvard Test of Inflected Acquisition”).  In reality the students designated as “about to bloom” were selected at random (about 5 per classroom in 18 classrooms spanning 5 grades).  IQ was measured before this manipulation and again at several time points after the study began.   At the 8 month time point, the 255 control students showed growth of 4 IQ points whereas the 65 children designated bloomers gained an average of 12 IQ points.  Most of this was due to much higher growth in the first and second grade classes.   The IQ tests were somewhat standardized, so supposedly the DV was *not* subject to expectancy effects by the teacher who administered it.  Thus, the original Pygmalion Effect was the notion that teacher expectancy could literally increase IQ.

The results were reported across several publications: results were presented (briefly) in a book by Rosenthal (1966), then more fully in a journal article (Rosenthal & Jacobson, 1966), then in a Scientific American (1968),  a book chapter (also 1968), and then in a full-length book (Rosenthal & Jacobson, 1968).  According to Google Scholar, the book version has been cited over 5,000 times since publication (though Google Scholar links to a summary of the book published in Urban Review)

There experiment caused a sensation, garnering tremendous public attention and almost immediately influencing public policy and even legal decisions (Spitz, 1999).

The Problems

Not all the reaction to the Pygmalion Effect was positive.  Doubters emerged.  Some pointed out that the teachers could not recall who had been designated a student of great potential…meaning the manipulation should not have been effective (the teachers received a list of students at the beginning of the semester; few could recall the names of those on the list and many reported it to have been ‘just another memo’ in a sea of back-to-school business).  Questions were also raised about the quality of the measurement: the scores seemed to indicate that the incoming students were mentally disabled, and the IQ test used may not have been valid with children in the younger grades (the ones who drove all the gains).  Spitz (1999) has a great historical overview.

Here are a few juicy tidbits from a ferociously bad review of the book by Thorndike (Thorndike, 1968):

  • “In spite of anything I can say, I am sure it will become a classic — widely referred to and rarely examined critically”
  • “Alas, it is so defective technically that one can only regret that it ever got beyond the eyes of the original investigators!”
  • “When the clock strikes thirteen, doubt is not only cast on the last stroke but also on all that have come before….When the clock strikes 14, we throw away the clock.”

The endless back and forth

I can’t even begin to summarize the long-standing back-and-forth over the Pygmalion effect. Spitz (1999) does a good job summarizing from a primarily critical point of view.  It’s worth reading.  Equally worthy is a more sympathetic review by Jussim & Harber (Jussim & Harber, 2005).

One theme that emerges from the Spitz summary is that as more data rolled in the concept of what the Pygmalion Effect is became a point of contention.  Critics were eager to focus on IQ and to show that there is no way a specific and large effect of IQ could be reliable.  Rosenthal, on the other hand, seemed comfortable with a very flexible definition of the Pygmalion Effect, accepting nearly any type of expectancy effect in a school setting as confirmation while discounting or eliding negative results.  Overall, the impression one gets is that Rosenthal was eager for a simple story (expectancy effects are real) and didn’t want to get caught up in the nuances.  The critics were eager to show that at least parts of the story were questionable.  In my reading, this ended up being a colossal waste of time–it lead to many resources poured into direct replications and endless argument but not much productive in terms of fleshing out and rigorously testing theories about how/when expectency effects would occur.

The Jussim & Harber paper does a great job at trying to move things forward–acknowledging the weak evidence specifically for IQ but pushing the field to think more critically about moderators, effect sizes, boundary conditions, and the like.  They end up with a much more nuanced take–that even if IQ effects might not be reliable, expectancy effects are likely real.

Deja Vu

If you bother to read any of these sources, I’m guessing that you’ll join me in feeling an eerie and worrying sense of deja vu related to the current replication crisis.  The Pygmalion Effect stirred up many of the same debates we’re currently hashing out (measurement quality, rigor of prediction, value of meta-analysis, standards of evidence, utility of replication, etc.).  There are also a lot of similarities in terms of tone and the way folks on opposing sides treated each other.  Rosenthal seems to shrug off criticism, and be very inventive at post-hoc reasoning.  It must have driven his critics mad.  I’ll let him have the last word, which I think those pushing for better science will find frustratingly familiar.  This is from a paper he wrote in 1980 celebrating the Pygmalion Effect reaching the status of “citation classic”:

There were also some wonderfully inept statistical critiques of Pymalion research.  This got lots of publications for the critics of our research including one whole book aimed at devastating the Pygmalion results, which only showed that the results were even more significant than Lenore Jacobson and I had claimed.

Yes, that’s the “what doesn’t kill my statistical significance makes it stronger” fallacy Gelman has been blogging about.  And, yes, it’s that same mocking dismissal of cogent criticism in favor of simplistic but almost certainly wrong stories that frustrates those trying to raise standards today.  And yes, this was 38 years ago… so things haven’t changed much and Rosenthal is still highly and uncritically cited.

We’ve got to do better this time around.



Darley, J. M., & Batson, C. D. (1973). “From Jerusalem to Jericho”: A study of situational and dispositional variables in helping behavior. Journal of Personality and Social Psychology, 27(1), 100–108. 10.1037/h0034449″ target=”_blank” rel=”noopener noreferrer”>https://doi.org/10.1037/h0034449
Jussim, L. (2012). Social Perception and Social Reality. OUP USA.
Jussim, L., & Harber, K. D. (2005). Teacher Expectations and Self-Fulfilling Prophecies: Knowns and Unknowns, Resolved and Unresolved Controversies. Personality and Social Psychology Review, 9(2), 131–155. 10.1207/s15327957pspr0902_3″ target=”_blank” rel=”noopener noreferrer”>https://doi.org/10.1207/s15327957pspr0902_3
Milgram, S. (1963). Behavioral Study of obedience. The Journal of Abnormal and Social Psychology, 67(4), 371–378. 10.1037/h0040525″ target=”_blank” rel=”noopener noreferrer”>https://doi.org/10.1037/h0040525
Rosenthal, R., & Jacobson, L. (1966). Teachers’ Expectancies: Determinants of Pupils’ IQ Gains. Psychological Reports, 19(1), 115–118. 10.2466/pr0.1966.19.1.115″ target=”_blank” rel=”noopener noreferrer”>https://doi.org/10.2466/pr0.1966.19.1.115
Rosenthal, R., & Jacobson, L. (1968). Pygmalion in the classroom. The Urban Review, 3(1), 16–20. 10.1007/bf02322211″ target=”_blank” rel=”noopener noreferrer”>https://doi.org/10.1007/bf02322211
Skinner, B. F. (1948). “Superstition” in the pigeon. Journal of Experimental Psychology, 38(2), 168–172. 10.1037/h0055873″ target=”_blank” rel=”noopener noreferrer”>https://doi.org/10.1037/h0055873
Spitz, H. H. (1999). Beleaguered Pygmalion: A History of the Controversy Over Claims that Teacher Expectancy Raises Intelligence. Intelligence, 27(3), 199–234. 10.1016/s0160-2896(99)00026-4″ target=”_blank” rel=”noopener noreferrer”>https://doi.org/10.1016/s0160-2896(99)00026-4
Thorndike, R. L. (1968). Reviews: Rosenthal, Robert, and Jacobson, Lenore. Pygmalion in the Classroom. New York: Holt, Rinehart and Winston, 1968. 240 + xi pp. $3.95. American Educational Research Journal, 5(4), 708–711. 10.3102/00028312005004708″ target=”_blank” rel=”noopener noreferrer”>https://doi.org/10.3102/00028312005004708