Positive Controls for Psychology – My pitch for a SIPS project

Positive controls are one of the most useful tools for ensuring interpretable and fruitful research.  Strangely, though, positive controls are rarely used in psychological research.  That’s a shame, but also an opportunity–it would be an easy but substantial improvement for psychological researchers to start using regularly using positive controls.  I (Bob) am currently at SIPS 2018; I’ll be giving a lighting talk about positive controls and hopefully developing some resources to encourage the use of positive controls.  To kick things off, I’ve started this OSF page on positive controls: https://osf.io/n5yx9/

What is a positive control?  A positive control (aka an active control) is a research condition that has a known effect in the research domain; it’s a research condition that ought to work if the research is conducted properly.  For example, a researcher might be studying how much a new drug affects alertness.  She will administer either the new drug or placebo and then measure alertness with an odd-ball task.  A positive control would be adding a third group that receives caffeine (blinded, of course), a drug well known to produce a modest increase in alertness on this task.

Why use positive controls?  There are several potential benefits:

  • Positive controls help indicate the sensitivity and integrity of the experiment.  If the experiment is conducted properly and with sufficient data then the positive control ought to show the expected effect.  If the experiment does not, then the research will know that something may have gone on and will be able to investigate.  Positive controls are especially useful for interpreting “negative” results.  From the example above, if the researcher finds that the new drug does not influence alertness, she may wonder about the result: was enough data collected and was the procedure administered correctly?  Checking that the positive control came out as expected gives reassurance that the research was conducted properly and was sensitive to the desired range of effects.
  • Positive controls can be used as a training tool–new researchers can run positive controls to ensure procedural proficiency before collecting real data (and while collecting real data)
  • Screening for outliers and/or non-compliant responding – for some positive controls there is a clear range of valid responses even at the individual level.  In these cases, positive controls provide an additional way to screen for outliers and unusual responses.

How to select a positive control?  To aid interpretation, a positive control should be well-matched to the experimental question.  The ideal positive control:

  • Is from the same research domain
  • Has a well-characterized effect size that is similar to what is expected for the research question (or a set of positive controls can be used to test sensitivity to small, medium, and large effects)
  • Is sensitive to the factors that could ruin the effect of interest
  • Is easy/short to administer

Can positive controls really help in psychology? Yes.  I (Bob) have been using positive controls extensively in my replication research.  These have been essential in demonstrating the quality and sensitivity of the replication research.  For some examples, see:

So how do I get started?  I (Bob) have started an OSF page on positive controls.  I’m hoping to use some of my time at SIPS 2018 to populate the page and start some research to show they are worth using.   Here’s the page (still in development): https://osf.io/n5yx9/



Precision for Planning: Great New Developments

–updated with a link from Ken Kelley to access the functions in the paper, 6/28/2018–

In a new-statistics world, the best way to choose N for a study is to use precision for planning (PfP), also known as accuracy in parameter estimation (AIPE). Both our new-statistics books explain PfP and why it is better than a power analysis–which is the way to choose N in an NHST world. ESCI allows you to use PfP, but only for the two independent groups and paired designs.

The idea of PfP, as you may know, is to choose a target MoE; in other words, choose a CI length that you do not wish to exceed. Then PfP tells you the N that will deliver that MoE (or shorter)–either on average, or with 99% assurance.

PfP is a highly valuable approach, hampered to date by the lack of software to make using it easy for a full range of measures and designs. Indeed the PfP techniques required to build such software have been developed only comparatively recently; many have been developed by Ken Kelley and colleagues. Further developments are needed and now Ken and colleagues have published a new article with important advances:

Kelley, K., Darku, F. B., & Chattopadhyay, B. (2018). Accuracy in parameter estimation for a general class of effect sizes: A sequential approach. Psychological Methods, Vol 23(2), 226-243. http://dx.doi.org/10.1037/met0000127

The translational (simplified) abstract is below, at the bottom.

The article may be available here, or you may need to get it via your library.

Traditional PfP, as described in our two books and implemented in ESCI, has some severe limitations:
1. The population distribution is assumed known–usually a normal distribution.
2. A particular effect size measure is used, for example the mean or Cohen’s d.
3. A value needs to be assumed for one or more population parameters, even though these are usually unknown. For example, our books and ESCI support PfP when target MoE is expressed in units of population standard deviation, even though this is usually unknown.
4. Traditional PfP gives a single fixed value of N for the target MoE to be achieved (on average, or with 99% assurance).

Very remarkably, the Kelley et al. article improves on ALL 4 of these aspects of traditional PfP! Imho, this is a wonderful contribution to our understanding of PfP and to the range of situations in which PfP can be used. It will, I hope, contribute to the much wider use of PfP for sample-size planning.

Much of the article is necessarily quite technical, but here is my understanding of the approach, in relation to the 4 points above.
1. A non-parametric approach is taken, meaning that no particular form of the population distribution is assumed. Using the central theorem makes the analysis tractable.
2. A very general form of effect size measure is assumed (in fact, the ratio of two functions of the population parameters). A large number of familiar effect size measures, including the mean, mean difference, and Cohen’s d, are special cases of this general measure, so the PfP technique that Kelley et al. develop can be applied to any of these familiar measures, as well as many others.
3. The sequential approach they take–see 4 below–allows them to estimate the relevant population parameters, and to update that estimate as the process proceeds. No dubious assumption of parameter values is required.
4. Conventional approaches to statistical inference rely on N being specified in advance. Open Science has emphasised that data peeking invalidates p value and other conventional approaches to inference. (In data peeking, you run a study, analyse, then decide whether to run some more participants, for example until statistical significance is achieved.) Avoiding data peeking is one reason for preregistration–which includes stating N in advance, or at least the stopping rule, which must not depend on the results obtained.

However, sequential analysis was developed about 75 years ago in the NHST world. It is seldom used in the behavioral sciences, but allows you to analyse data collected to date and then decide whether to continue, or to stop and declare in favour of the null hypothesis, or the specified alternative hypothesis. The stopping rule is designed so the procedure gives Type I and Type II error rates that are as selected for the NHST. Yes, sequential analysis is more complex to use, and you don’t know in advance how many participants you will need, but it can on average lead to smaller N being required than for conventional fixed-N approaches.

Kelley et al. have very cleverly used the sequential approach to PfP and, at the same time, have solved 3 above. The idea is that you take a pilot sample of size N1, then use the results from that to estimate relevant parameters and to calculate the MoE on your effect size estimate. If that MoE is not sufficiently short to provide the precision you are seeking, you test a further N0 participants (N0 is generally small, and may be 1). Then again estimate the parameters and calculate MoE. It that MoE is not sufficiently small, test a further N0 participants, and so on, until you achieve the desired precision.

Then interpret the final effect size estimate and its CI. Yes, the method my be complex, but it is very general and should on average give a smaller N than conventional PfP would require.

I find the generality and potential of the method stunning, and I can’t wait to see it made available within full-function data analysis applications. That will give a great boost to the highly desirable shift from power analysis to PfP, and more generally from NHST to the new statistics. Hooray!


—UPDATE — Ken Kelley writes:

On my web site is a link with instructions and code for a few specific instances of the method. The link is here:


For each of the effect sizes, there are several functions that need to be run first. But, after getting those into one's workspace, the actual function is easy to use. The functions available at the above link are for the coefficient of variation, for a regression coefficient in simple regression, and for the standardized mean difference. 

My co-authors and I have plans to develop an R package for a more general applications. In fact, we already have made progress on the package, which will focus on sequential methods

Translational Abstract
Accurately estimating effect sizes is an important goal in many studies. A wide confidence interval at the specified level of confidence (e.g., .95%) illustrates that the population value of the effect size of interest (i.e., the parameter) has not been accurately estimated. An approach to planning sample size in which the objective is to obtain a narrow confidence interval has been termed accuracy in parameter estimation. In our article, we first define a general class of effect size in which special cases are several commonly used effect sizes in practice. Using the general effect size we develop, we use a sequential estimation approach so that the width of the confidence interval will be sufficiently narrow. Sequential estimation is a well-recognized approach to inference in which the sample size for a study is not specified at the start of the study, and instead study outcomes are used to evaluate a predefined stopping rule, which evaluates if sampling should continue or stop. We introduce this method for study design in the context of the general effect size and call it “sequential accuracy in parameter estimation.” Sequential accuracy in parameter estimation avoids the difficult task of using supposed values (e.g., unknown parameter values) to plan sample size before the start of a study. We make these developments in a distribution-free environment, which means that our methods are not restricted to the situations of assumed distribution forms (e.g., we do not assume data follow a normal distribution). Additionally, we provide freely available software so that readers can immediately implement the methods.

P.S. I haven’t yet located the software mentioned in the final sentence above. Ken’s great software for PfP (and other things) is MBESS, so that may be where to look.

Effect Sizes for Open Science

For the last 20 years or so, many journals have emphasised the reporting of effect sizes. The new statistics emphasised also the reporting of CIs on those effect sizes. Now Open Science places effect sizes, CIs, and their interpretation centre stage.

Here’s a recent article with interesting discussion and much good advice about using effect sizes in an Open Science world:
Pek, J., & Flora, D. B. (2018). Reporting effect sizes in original psychological research: A discussion and tutorial. Psychological Methods, 23(2), 208-225. http://dx.doi.org/10.1037/met0000126

The translational (simplified) abstract is down the bottom.

Unfortunately, Psychological Methods is behind the APA paywall, so you will need to find the article via a library. (Sometimes it’s worth searching for the title of an article, in case someone has secreted the pdf somewhere online. Not in this case at the moment, I think.)

A couple of the article’s main points align with what we say in ITNS:

Give a thoughtful interpretation of effect sizes, in context
Choose effect sizes that best answer the research questions and that make most sense in the particular situation. Often interpretation is best done in the original units of measurement, assuming these have meaning–especially to likely readers. Use judgment, compare with past values found in similar contexts, give practical interpretations where possible. Where relevant, consider theoretical and possibly other aspects or implications. (And, we add in ITNS, consider the full extent of the CI in discussing and interpreting any effect size.)

Use standardised effect sizes with great care
Standardised (or units-free) effect sizes, often Cohen’s d or Pearson’s r, can be invaluable for meta-analysis, but it’s often more difficult to give them practical meaning in context. Beware glib resort to Cohen’s (or anyone else’s) benchmarks. Be very conscious of the measurement unit–for d, the standardiser. If, as usual, that’s a standard deviation estimated from the data, it has sampling variability and will be different in a replication. (In my first book, UTNS, I introduced the idea of the rubber ruler. Imagine the measuring stick to be a rubber cord, with knots at regular intervals to mark the units. Every replication results in the cord being stretched to a different extent, so the knots are further or less far apart. The Cohen’s d value is measured in units of the varying distance between knots.)

There’s also lots more of interest in this article, imho.

Translational Abstract
We present general principles of good research reporting, elucidate common misconceptions about standardized effect sizes, and provide recommendations for good research reporting. Effect sizes should directly answer their motivating research questions, be comprehensible to the average reader, and be based on meaningful metrics of their constituent variables. We illustrate our recommendations with four different empirical examples involving popular statistical methods such as ANOVA, categorical variable analysis, multiple linear regression, and simple mediation; these examples serve as a tutorial to enhance practice in the research reporting of effect sizes.

APS in San Fran 3: Workshop on Teaching the New Stats

Tamarah Smith and Bob presented a workshop on Teaching the New Stats to an almost sold-out crowd. I wasn’t there, but by all reports it went extremely well. Such a workshop seems to me a terrific way to help interested stats teachers introduce the new stats into their own teaching.

After taking that first step, it may all get easier, because, in my experience, teaching the new stats brings its own reward, in that students understand better and therefore feel better. So we the teachers will also feel the joy.

Tamarah and Bob’s slides are here. It strikes me as a wonderful collection, with numerous links to useful resources, and great advice about presenting an appealing and up-to-date course to beginning students. Also, indeed, to more advanced students. It’s well worth browsing those slides. Here are a few points that struck me as I browsed:

**You can download the workshop files, and follow along.

**You may know that jamovi and JASP are open source applications for statistical analysis designed to supersede commercial applications, notably SPSS. They are more user friendly, as well as being extensible by anyone. Already, add-on modules written in R are beginning to appear. (These are exciting developments, worth trying out.)

**Bob is developing add-ons for jamovi for the new statistics. (Eventually, jamovi augmented by Bob’s modules may replace ESCI–with the great advantages of providing data file management and a gateway to the power and scope of a full data analysis application.)

**The workshop discussed three simple examples (comparison of two independent means, comparison of two independent proportions, and estimation of interactions).

**The first example (independent means) was discussed in most detail, with a comparison of traditional NHST analysis, and new-stats analysis using ESCI then jamovi (with Bob’s add-on); then a Bayesian credible-interval approach. Then meta-analysis, to emphasise that estimation thinking and meta-analytic thinking are essential frameworks for the new stats.

**The GAISE guidelines were used to frame the discussion of the pedagogical approach. There is lots on encouraging students to think and judge in context–which should warm the heart of any insightful stats teacher.

**There are examples of students’ responses to illustrate the presenters’ conviction that conceptual understanding is better when using the new stats.

**There’s a highly useful discussion of a range of statistical software for new-stats learning and data analysis.

There’s lots more, but I’ll close with the neat summary below of the new-stats approach, which is now best ethical practice for conducting research and inferential data analysis.


APS in San Fran 2: Symposium on Teaching the New Stats

Our symposium was titled Open Science and Its Statistics: What We Need to Teach Now. The room wasn’t large, but it was packed, standing room only. I thought the energy was terrific. There were four presentations, as below.

Bob and Tamarah Smith have set up an OSF page on Getting started teaching the New Statistics. It holds all sorts of goodies, including the slides for our symposium.

At that site, expand OSF Storage, then Open Science and its Statistics–2018 APS Symposium slides and see 4 files for the 4 presentations:

Bob Calin-Jageman (Chair)
Open Science and Its Statistics: What We Need to Teach Now
Examples of students being stumped by traditional NHST analysis and presentation of a result, but readily (and happily) understanding the same result presented using the new stats. In addition we should teach and advocate the new statistics to improve statistical understanding and communication across all of science that uses statistical inference.

Geoff Cumming
Open Science is Best Practice Science
Being ethical as a teacher or researcher requires that we use current best practice, and for statistical inference that is the new statistics. The forest plot is, in my experience, a highly effective picture for introducing students to estimation and meta-analysis. I gave a paper advocating its use in the intro stats course back in 2006. In my experience, the new stats is, in contrast to NHST, a joy to teach. Students saying ‘It just all makes sense…’ is one of the most heart-warming things any teacher can hear.

Susan Nolan & Kelly Goedert
Transitioning a Traditional Department: Roadblocks and Opportunities for Incorporating the New Statistics and Open Science into Teaching Materials
Roadblocks include NHST appearing everywhere, common software (SPSS) not making the new stats easy, colleagues who are steeped in the old ways, widely-used textbooks taking traditional approaches, … But there are new textbooks and open source software emerging, and there are strategies for spreading the word and bringing colleagues on board. (See the slides for numerous practical suggestions and links to many useful resources to support teaching and using the new statistics.)

Tamarah Smith
Feeling Good about the New Statistics: How the New Statistics Improves the Way Researchers and Students Feel about Statistics
Statistics anxiety is a problem for many students, and impedes their learning. The new statistics opens up great opportunities for teaching so that anxiety is much reduced and students’ attitudes are more positive. The new statistics helps teachers meet the GAISE guidelines for assessment and instruction in stats education. Students feel better and more engaged and their learning is grounded. As a result their teachers also feel better. Let’s do it.

Personally, I found it wonderful to hear so many examples and reasons all converging on the conclusion that teaching the new statistics (1) is what’s needed for ethical science, (2) helps students understand much better and feel good about their learning, and (3) is great for teachers also. A triple win!


P.S. The pic below is from Bob’s slides and is adapted from Kruschke and Liddell (2018). The crucial thing for Open Science is the shift to estimation and meta-analysis, and away from the damaging dichotomous decision making of NHST. The estimation and meta-analysis can be frequentist (conventional confidence intervals) or Bayesian (credible intervals)–either is fine. In other words, there is a Bayesian new statistics, alongside the new statistics of ITNS. Maybe the Bayesian version will come to be the more widely used? I believe the biggest hurdle to overcome for that to happen is the arrival of good teaching materials that make Bayesian estimation easily accessible to beginning students, as well as to researchers steeped in NHST.

But the main message is that either of the cells in the bottom row is just what we need.

Kruschke, J. K. & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25, 178–206.

APS in San Fran 1: Open Science is Maturing

It was, as ever, a great pleasure to catch up with Bob last weekend. We were in San Francisco for the APS Convention. That convention has been for the last five years or so a hotbed of discussion about Open Science–many times I felt I was witnessing decisions about what best practice science now should be. Open Science practices were being identified and refined, as I listened and watched!

This year the convention was back in the same place as four years ago, when my workshop on the new statistics and OS was filmed. That resulted in the six videos available here.

My sense last year and especially this year is that OS practices are maturing. A few years ago we discussed evidence of the replication crisis and possible strategies for improvement. But now OS badges have been in the field for a few years, preregistration is widely recognised, and the emphasis has shifted to actually doing these good things, and providing encouragement and facilities for researchers generally to adopt OS practices.

As an example, it’s no longer a matter of explaining why preregistration helps and encouraging folks to use it, but now we discuss the best ways to ensure that preregistration is fully detailed and includes a detailed plan for the statistical analysis.

Again this year there were a number of lively symposia in which OS issues were debated, but now with some data about how we’re tracking. As an example of a great symposium contribution I’ll mention 19 Ways Journal Editors Can Promote Transparency and Replicability by Steve Lindsay, editor-in-chief of Psychological Science. The 19 encompass a very broad range of good practices, from encouraging preregistration to suggesting several strategies for improving statistical practices in articles that make the grade and are published. See a pdf of Steve’s slides here.

Below is a pic from those slides.


The Perils of MTurk, Part 1: Fuel to the Publication Bias Fire?

It’s not going to be a popular opinion, but I think MTurk has become a danger to sound psychological science.  This breaks my heart.  MTurk has helped transform my career for the better.  Moreover, MTurk participants are amazing: they are primarily diligent and honest folks with a genuine interest in the science they are participating in.  Still, I’ve come around to the conclusion that MTurk + current publication practices = very bad science (in many cases).  That’s a worrisome conclusion, given that reliance on MTurk is already fairly overwhelming in some subfields of psychology; 40-50% of all manuscripts in top social psych journals include an MTurk sample (Zhou & Fishbach, 2016).

The Theory: MTurk as an accelerant to noise mining

Here’s the problem: MTurk makes running studies so easy that it exacerbates the publication bias problem.  There are *so* many researchers running *so many* studies.  Yes, you know that–problems with non-naivete are already well-documented with MTurk (Chandler, Mueller, & Paolacci, 2013).  But think about this: it is the nature of science that within a particular field its diverse workforce often works on a relatively small set of problems.  It is this relatively narrow focus within each field that explains why the history of science is so full of instances of parallel/multiple discovery: https://en.wikipedia.org/wiki/Multiple_discovery.

So it’s not just that lots of people are running lots of studies–it’s that often there are large cohorts of researchers who are unwittingly running essentially the same studies.   Then, given our current publication practices, the studies reaching ‘statistical significance’ are published while those that are ‘not significant’ are shelved.   That’s a recipe for disaster.  And, as Leif Nelson recently pointed out, mining noise in this way is not nearly as inefficient as one might expect (http://datacolada.org/71).

So… my theory is basically just a re-telling of the publication bias story, an old saw (Sterling, 1959), though still an incredibly pernicious one.   What is new, I think, is the way MTurk has made the opportunity costs for conducting a study so negligible: it’s like fuel being poured on the publication dumpster fire.  MTurk dramatically increases the number of people running studies and the number of studies run by each researcher.  Moreover, the low opportunity costs means it is less painful for researchers to simply move on if results didn’t pan out.  With MTurk it costs very little to fill your file drawer while mining noise for publication gold.

The Semi-Anecdotal Data

I’ve developed these concerns based on my own experiences with MTurk.  This is semi-anecdotal.  I mean, the data are real, but none of it was collected specifically to probe publication bias with MTurk… so it may not be representative of psychology, or of any particular subfield.  That’s part of why I wrote this blog post–to see if others have had experiences like mine and to try to think about how the size/scope of the problem might be estimated in an unbiased way.  Anyway, here are the experiences I’ve had which suggest MTurk accentuates publication bias:

Online studies might have a higher file drawer rate

I recently collaborated on a meta-analysis on the effect of red on romantic attraction (paper is under review, OSF page with all data and code are here: https://osf.io/xy47p/ ).   For experiments in which incidental red (in the border, clothing, etc.) was contrasted with another color we ended up with data summarizing 8,007 participants.  Incredibly, only 45% of this was published.  The other 55% (4,436 participants) was data shared with us from the file drawer.  That, on it’s own, is crazy!  And yes, the published literature is distorted: the unpublished data yield much weaker effect sizes than the published data.

Beyond the troubling scope of the file drawer problem in the red/romance literature, we found that data source is related to publication status.  For in-person studies, 52% was in the file drawer (2,484 of 4,763 participants); for studies conducted online, 60% was in the file drawer (1,952 of 3,244 participants).  This comparison is partly confounded by time: in-person studies have been conducted from 2008 to present whereas online studies only since 2013.  Also, online studies are typically much larger, so even if there is the same experimental rate of drifting into the file drawer, the overall participant rate will be higher for online studies.  Still, this data seems worrisome to me and potentially indicative that MTurk/Online studies fill the file drawer at a faster rate than in-person studies.

Those who do online studies sometimes seem to have little invested in them

For the red-romance meta-analysis, one lab sent us 6 unpublished online studies representing 956 participants (all conducted in 2013).  The lab actually sent us the data in a chain of emails because digging up data from one study reminded of the next and so on.  It had been 5 years, but the lab leader reported having completely forgotten about the experiments.  That, to me, indicates the incredibly low opportunity cost of MTurk.  If I had churned through 956 in-person participants I would remember it, and I would have had so much sunk costs I’m sure I would have wanted to try to find some outlet for publishing the result.  I guess that’s its own problem, because the enormous costs of in-person studies provide incentives to try to p-hack them into a ‘publishable’ state.  But still, when you can launch a study and see 300 responses roll in within an hour or so, your investment in eventually writing it all up seems fairly weak.

MTurk participants say they’ve taken part in experiments which have not been reported

In 2015 I was running replication studies on power and performance (Cusack, Vezenkova, Gottschalk, & Calin-Jageman, 2015).  None of the in-person studies were showing the predicted strong effects, so I turned to MTurk.

I knew from the paper by Chandler (2013) that non-naivete is a huge issue and that common manipulations of power were played out on MTurk.  So I went looking for a manipulation that had not yet been used online.  I settled on a word-search task (participants find either power-related or neutral words). I selected this task because adopting it online requires some decent coding skills and because I could not find any published articles reporting a word-search manipulation (of any type) with Mturk participants.  I figured that by developing an online word search I could be assured of an online study with naive participants.

Even though the word search task I painstakingly coded was novel (for an online context), when I launched the study on MTurk I included an item at the end where I asked participants to rate their familiarity with the study materials.  In this last section, it was made clear that payment was confirmed and not contingent on their answers.  Participants were asked “How familiar to you was the word-search task you answered?” and responded on a 1-4 scale.

The results were astonishing.  Of 442 participants who responded, 19 (4%) rated their familiarity as a 4: Very familiar – I’ve completed on online word search before using this exact same word list“; another 46 (9.7%) rated theri familiarity as a 3: Familiar – I’ve completed word searches like this online before, and some of the words were probably the same“.

Yikes! That’s means 16% of MTurk participants reported having taken part in a manipulation that had not ever been reported in published form (as far as I can tell).  If there are, say 7,300 MTurk survey takers (Stewart et al., 2015), that means about 1,100 MTurkers had already been through an experiment with this manipulation.  Given that there is constant turnover of MTurk, the real number of unpublished power/word-search studies is likely considerably higher.

Materials I know are being used aren’t being reported

In March 2015 I wrote a blog post about my word search task (https://calin-jageman.net/lab/word_search/ ) and offered the code for free for anyone who wants to use it.  Since then, I’ve been contacted by >30 labs seeking the code and sample Qualtrics survey.  I’ve happily shared the materials each time, but asked the recipients to please cite my paper.  So far, no one has cited the paper for this reason.  I’m sure some are still in progress and that others decided against using the task… but I’ve done a lot of tech support on this task and feel confident at least 1/2 have actually tried to collect data with it.  If I were to guess that only 1/2 of those are mature enough to have been published by now, that’s still gives me a guess of about 7 studies which have used the task but which are lurking in the file drawer.  If each study has 300 participants, that’s easily 2100 participants in the file drawer.  So how representative, really, is the published literature on power in the MTurk era?  Surely, you can now find lots of published effects of power that use MTurk or online participants… but it may be that these are just the tip of the iceberg.

Multiple discovery is real

In 2014 I went to the EASP meeting in Amsterdam and saw a talk by Pascal Burgmer.  He reported studies in which he had adapted a perspective-taking task by Tversky & Hard for use online (Tversky & Hard, 2009).  As the talk was happening, bells were going off in my head: I could use this online perspective-taking measure to conduct a conceptual replication of the famous finding by Galinsky and colleagues that power decreases perspective-taking (Galinsky, Magee, Inesi, & Gruenfeld, 2006). (you know the one… it’s in all the Psych 101 textbooks).

I wrote to Pascal Burgmer; he graciously shared the materials, and then I set to work with a student to design a replication (https://osf.io/wch5r/).  We collected the data in early 2015 on both MTurk and Prolific Academic.  For the MTurk study, 36% of the participants reported strong familiarity with the task.  Clearly, others were investigating power and perspective-taking on MTurk at or around the same time.  For comparison, only 3% of Prolific Academic participants reported strong familiarity with the study materials.

Eventually, we’ve been able to identify some of the other labs that were interested in this question.  In 2016 Blader and colleagues reported an experiment almost exactly like the one my student and I had run (Study 4 of (Blader, Shirako, & Chen, 2016)): power was manipulated and the effect was measured using the same online adaptation of Tversky and Hard’s visual perspective-taking task (even the same images!).  Also, it turns out that in late 2014 one of the Many Labs projects included a study on power and perspective-taking (though with a somewhat different paradigm) and that 773 MTurk participants were collected  (Ebersole et al., 2016).

So – that means at least 3 labs were collecting data on power and perspective-taking at or around the same time on MTurk.  We were using materials that are spoiled with repeated use (once you’ve been debriefed on the visual-perspective-taking task it is ruined), but as far as I can tell I was the only one to probe for non-naivete.  Moreover, non-naivete makes a measurable difference: in our MTurk data including the non-naive participants yields a statistically significant effect, but excluding the non-naive participants does not (though it is a moderate effect in the predicted direction).

Not only were we spoiling each others experiments, it is also clear that the published record on the topic is distorted.  The Blader et al. (2106) paper was published, but the authors were (of course) unaware of two other large studies using essentially the same materials… synthesizing all these studies would suggest a weak, potentially 0 effect.

I made this scatterplot for an APS poster in March of 2017 that has the original study, my student’s direct replication, our MTurk and Prolific Academic samples, the Blader study, and an student replication we were able to find online…. the balance of evidence suggests a very weak effect in the predicted direction, but 0 effect cannot be ruled out.  I haven’t had time to work on the project since then.. so there may be more relevant data to synthesize.  And, of course, one of the Many Labs projects failed to find an effect of power on perspective-taking in another paradigm.

Caveats and Discussion

I keep saying “MTurk” when really I mean online… I would guess the problem of low opportunity cost exacerbates the file drawer problem with any online platform in which obtaining large samples becomes very low cost in terms of time, effort, and actual money.

I don’t think MTurk is completely ruined.  If you have truly novel research materials or materials that don’t spoil with repeated use then MTurk seems much more promising.  But think about this–if you do find an effect with new materials, how long before they are spoiled on MTurk?  What will we do when folks can report original research on MTurk but then claim all subsequent replications are invalid due to non-naivete?

Overall, I’m still a bit equivocal about the dangers of MTurk.  I love the platform and seeing the data roll in.  On the other hand, it is this very love of MTurk that is so dangerous.  There are so many researchers with so few distinct research questions–it’s a simple tragedy of the commons.

All of the data I presented above is semi-anecdotal.  The projects I’ve been working on keep bumping me up against the idea that the file drawer problem is much worse with online studies.  But maybe my experiences are unusual or unique?  Maybe it depends on the subfield of psychology or the topic?  I’m happy for feedback, sharing of experiences, etc.  I’d also love to hear from folks interested in collaborating to try to measure the extent of the file drawer problem on MTurk.



Blader, S. L., Shirako, A., & Chen, Y.-R. (2016). Looking Out From the Top. Personality and Social Psychology Bulletin, 42(6), 723–737. doi:10.1177/0146167216636628
Chandler, J., Mueller, P., & Paolacci, G. (2013). Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behavior Research Methods, 46(1), 112–130. doi:10.3758/s13428-013-0365-7
Cusack, M., Vezenkova, N., Gottschalk, C., & Calin-Jageman, R. J. (2015). Direct and Conceptual Replications of Burgmer & Englich (2012): Power May Have Little to No Effect on Motor Performance. PLOS ONE, 10(11), e0140806. doi:10.1371/journal.pone.0140806
Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks, J. B., … Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82. doi:10.1016/j.jesp.2015.10.012
Galinsky, A. D., Magee, J. C., Inesi, M. E., & Gruenfeld, D. H. (2006). Power and Perspectives Not Taken. Psychological Science, 17(12), 1068–1074. doi:10.1111/j.1467-9280.2006.01824.x
Sterling, T. D. (1959). Publication Decisions and their Possible Effects on Inferences Drawn from Tests of Significance—or Vice Versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.1080/01621459.1959.10501497
Stewart, N., Ungemach, C., Harris, A., Bartels, D., Newell, B., Paolacci, G., & Chandler, J. (2015). The average laboratory samples a population of 7,300 amazon mechanical turk workers. Judgment and Decision Making, 10(5). Retrieved from https://repub.eur.nl/pub/82837/
Tversky, B., & Hard, B. M. (2009). Embodied and disembodied cognition: Spatial perspective-taking. Cognition, 110(1), 124–129. doi:10.1016/j.cognition.2008.10.008
Zhou, H., & Fishbach, A. (2016). The pitfall of experimenting on the web: How unattended selective attrition leads to surprising (yet false) research conclusions. Journal of Personality and Social Psychology, 111(4), 493–504. doi:10.1037/pspa0000056

Tony Hak 1950-2018: Champion of Better Methods, Better Statistics

It was a shock to receive the very sad news that Tony Hak died last week, unexpectedly. Too young! And only 3 years into an active retirement.

Tony was an Emeritus Associate Professor, having retired in 2015 from the Department of Technology and Operations Management of RSM, the Rotterdam School of Management, Erasmus University, The Netherlands.

Tony was indefatigable in teaching and advocating better research practices and, in particular, use of statistical methods not based on NHST and p values. He had worked in a number of different disciplines during his career, and maintained a great breadth of interest–just about all disciplines need to lift their statistical game. Tony recognised many of the issues that led to Open Science well before that term became established.

I’ll mention just two of his recent pieces of work:

This is an Excel-based set of tools for meta-analysis that was developed with two of Tony’s graduate students. It goes well beyond ESCI intro meta-analysis. It is a free download from here. We recommend it in Chapter 9 of ITNS; it’s well worth checking out.

ICOTS9, 2014
ICOTS9 was the 9th International Conference on Teaching Statistics, held in Flagstaff, Arizona. Tony’s provocative paper was titled AFTER STATISTICS REFORM: SHOULD WE STILL TEACH SIGNIFICANCE TESTING? Download it here.

Tony argued that we should seriously consider no longer teaching NHST. His paper caused quite a stir at ICOTS, and, of course, few agreed to go as far as he was advocating. But he made some strong arguments for his position. His paper is very much worth reading–a real thought-provoker.

Australia in 2012, Netherlands in 2013
Tony visited my research group at La Trobe in 2012–we had excellent discussions. Then in 2013 he was a generous academic host to me at RSM for a couple of weeks. I had a wonderful time, giving talks and/or workshops in Groningen, Utrecht, Amsterdam and, especially, Rotterdam. I met and had lively discussions with many of the Netherlands-based folks who are still leaders in Open Science and statistical debates.

One day Tony and I were chatting in his office when a delivery arrived–a stack of heavy cartons. Tony happily explained that the boxes contained my first book, UTNS–sufficient copies for Tony’s incoming methods class. Enough to warm the heart of any author!

Tony was a fine man, wonderful with students, and energetic, persistent, and innovative in pursuit of his academic goals–which were shrewdly chosen to improve how research can be done.


TONY HAK 1950 – 2018

Badges, Badges, Badges: Open Science on the March

Here are two screen pics from today’s notice about the latest issue of Psychological Science.

Four of the first five articles earned all three badges, including Pre-reg! Gold! (OK, by showing just those five I’m cherry picking, but other articles also have lots of badges and, in any case, what a lovely juicy cherry!)

If you would like to know about Open Science, or you would like your students to learn how science is now being done, you could try ITNS, our intro textbook that includes lots of Open Science. And/or you could find out more at the APS Convention in San Fran next month:


ITNS–A New Review on Amazon

The ITNS page on Amazon (U.S.) is here. Scroll down to see the 4 reviews by readers.

Recently a five-star review was added by Edoardo Zamuner. Here it is:

“I am an experimental psychologist with training in NHST. Cumming’s book has helped me realise that my understanding of key statistical tests was largely mistaken. As a result, Cumming’s discussion of confidence intervals, effect sizes, meta-analysis and open science has become essential to my work. I found the section on effect sizes especially important, since many academic journals now require authors to report effect sizes with error estimations — with or without p-values. Cumming’s New Statistics is at the center of Psychological Science’s statistical reforms (Eich, 2013, Psychological Science, “No More business as Usual—Embracing the New Statistics”).

“In addition to being an excellent tool for researchers, Cumming’s book is relevant to teachers interested in switching from NHST to effect sizes, and who want their students to learn about the replication crisis, meta-analysis, and open science. Students using this book will still gain the necessary skills to understand the literature published prior to the statistical reform.

“The book comes with a number of excellent online resources, including a companion website hosted by the publisher (google: routledge textbooks introduction new statistics). From this website, teachers and students can download slides with images from lecture notes, the Exploratory Software for Confidence Intervals (ESCI), and guides to SPSS and R. More online material is available on the APS website (google: aps new statistics estimation cumming), where readers can watch six excellent videos in which Cumming expands on topics from the book.”

Thank you Edoardo. I was delighted to see those judgments.

If I could just add one remark: My co-author Bob Calin-Jageman deserves a big mention for an immense and ongoing contribution to the book and all the materials.

P.S. If you have read or worked with ITNS and/or any of the accompanying materials, please consider writing a review for Amazon. It’s actually easy to do.