It’s not going to be a popular opinion, but I think MTurk has become a danger to sound psychological science. This breaks my heart. MTurk has helped transform my career for the better. Moreover, MTurk participants are amazing: they are primarily diligent and honest folks with a genuine interest in the science they are participating in. Still, I’ve come around to the conclusion that MTurk + current publication practices = very bad science (in many cases). That’s a worrisome conclusion, given that reliance on MTurk is already fairly overwhelming in some subfields of psychology; 40-50% of all manuscripts in top social psych journals include an MTurk sample (Zhou & Fishbach, 2016).
The Theory: MTurk as an accelerant to noise mining
Here’s the problem: MTurk makes running studies so easy that it exacerbates the publication bias problem. There are *so* many researchers running *so many* studies. Yes, you know that–problems with non-naivete are already well-documented with MTurk (Chandler, Mueller, & Paolacci, 2013). But think about this: it is the nature of science that within a particular field its diverse workforce often works on a relatively small set of problems. It is this relatively narrow focus within each field that explains why the history of science is so full of instances of parallel/multiple discovery: https://en.wikipedia.org/wiki/Multiple_discovery.
So it’s not just that lots of people are running lots of studies–it’s that often there are large cohorts of researchers who are unwittingly running essentially the same studies. Then, given our current publication practices, the studies reaching ‘statistical significance’ are published while those that are ‘not significant’ are shelved. That’s a recipe for disaster. And, as Leif Nelson recently pointed out, mining noise in this way is not nearly as inefficient as one might expect (http://datacolada.org/71).
So… my theory is basically just a re-telling of the publication bias story, an old saw (Sterling, 1959), though still an incredibly pernicious one. What is new, I think, is the way MTurk has made the opportunity costs for conducting a study so negligible: it’s like fuel being poured on the publication dumpster fire. MTurk dramatically increases the number of people running studies and the number of studies run by each researcher. Moreover, the low opportunity costs means it is less painful for researchers to simply move on if results didn’t pan out. With MTurk it costs very little to fill your file drawer while mining noise for publication gold.
The Semi-Anecdotal Data
I’ve developed these concerns based on my own experiences with MTurk. This is semi-anecdotal. I mean, the data are real, but none of it was collected specifically to probe publication bias with MTurk… so it may not be representative of psychology, or of any particular subfield. That’s part of why I wrote this blog post–to see if others have had experiences like mine and to try to think about how the size/scope of the problem might be estimated in an unbiased way. Anyway, here are the experiences I’ve had which suggest MTurk accentuates publication bias:
Online studies might have a higher file drawer rate
I recently collaborated on a meta-analysis on the effect of red on romantic attraction (paper is under review, OSF page with all data and code are here: https://osf.io/xy47p/ ). For experiments in which incidental red (in the border, clothing, etc.) was contrasted with another color we ended up with data summarizing 8,007 participants. Incredibly, only 45% of this was published. The other 55% (4,436 participants) was data shared with us from the file drawer. That, on it’s own, is crazy! And yes, the published literature is distorted: the unpublished data yield much weaker effect sizes than the published data.
Beyond the troubling scope of the file drawer problem in the red/romance literature, we found that data source is related to publication status. For in-person studies, 52% was in the file drawer (2,484 of 4,763 participants); for studies conducted online, 60% was in the file drawer (1,952 of 3,244 participants). This comparison is partly confounded by time: in-person studies have been conducted from 2008 to present whereas online studies only since 2013. Also, online studies are typically much larger, so even if there is the same experimental rate of drifting into the file drawer, the overall participant rate will be higher for online studies. Still, this data seems worrisome to me and potentially indicative that MTurk/Online studies fill the file drawer at a faster rate than in-person studies.
Those who do online studies sometimes seem to have little invested in them
For the red-romance meta-analysis, one lab sent us 6 unpublished online studies representing 956 participants (all conducted in 2013). The lab actually sent us the data in a chain of emails because digging up data from one study reminded of the next and so on. It had been 5 years, but the lab leader reported having completely forgotten about the experiments. That, to me, indicates the incredibly low opportunity cost of MTurk. If I had churned through 956 in-person participants I would remember it, and I would have had so much sunk costs I’m sure I would have wanted to try to find some outlet for publishing the result. I guess that’s its own problem, because the enormous costs of in-person studies provide incentives to try to p-hack them into a ‘publishable’ state. But still, when you can launch a study and see 300 responses roll in within an hour or so, your investment in eventually writing it all up seems fairly weak.
MTurk participants say they’ve taken part in experiments which have not been reported
In 2015 I was running replication studies on power and performance (Cusack, Vezenkova, Gottschalk, & Calin-Jageman, 2015). None of the in-person studies were showing the predicted strong effects, so I turned to MTurk.
I knew from the paper by Chandler (2013) that non-naivete is a huge issue and that common manipulations of power were played out on MTurk. So I went looking for a manipulation that had not yet been used online. I settled on a word-search task (participants find either power-related or neutral words). I selected this task because adopting it online requires some decent coding skills and because I could not find any published articles reporting a word-search manipulation (of any type) with Mturk participants. I figured that by developing an online word search I could be assured of an online study with naive participants.
Even though the word search task I painstakingly coded was novel (for an online context), when I launched the study on MTurk I included an item at the end where I asked participants to rate their familiarity with the study materials. In this last section, it was made clear that payment was confirmed and not contingent on their answers. Participants were asked “How familiar to you was the word-search task you answered?” and responded on a 1-4 scale.
The results were astonishing. Of 442 participants who responded, 19 (4%) rated their familiarity as a 4: Very familiar – I’ve completed on online word search before using this exact same word list“; another 46 (9.7%) rated theri familiarity as a 3: Familiar – I’ve completed word searches like this online before, and some of the words were probably the same“.
Yikes! That’s means 16% of MTurk participants reported having taken part in a manipulation that had not ever been reported in published form (as far as I can tell). If there are, say 7,300 MTurk survey takers (Stewart et al., 2015), that means about 1,100 MTurkers had already been through an experiment with this manipulation. Given that there is constant turnover of MTurk, the real number of unpublished power/word-search studies is likely considerably higher.
Materials I know are being used aren’t being reported
In March 2015 I wrote a blog post about my word search task (https://calin-jageman.net/lab/word_search/ ) and offered the code for free for anyone who wants to use it. Since then, I’ve been contacted by >30 labs seeking the code and sample Qualtrics survey. I’ve happily shared the materials each time, but asked the recipients to please cite my paper. So far, no one has cited the paper for this reason. I’m sure some are still in progress and that others decided against using the task… but I’ve done a lot of tech support on this task and feel confident at least 1/2 have actually tried to collect data with it. If I were to guess that only 1/2 of those are mature enough to have been published by now, that’s still gives me a guess of about 7 studies which have used the task but which are lurking in the file drawer. If each study has 300 participants, that’s easily 2100 participants in the file drawer. So how representative, really, is the published literature on power in the MTurk era? Surely, you can now find lots of published effects of power that use MTurk or online participants… but it may be that these are just the tip of the iceberg.
Multiple discovery is real
In 2014 I went to the EASP meeting in Amsterdam and saw a talk by Pascal Burgmer. He reported studies in which he had adapted a perspective-taking task by Tversky & Hard for use online (Tversky & Hard, 2009). As the talk was happening, bells were going off in my head: I could use this online perspective-taking measure to conduct a conceptual replication of the famous finding by Galinsky and colleagues that power decreases perspective-taking (Galinsky, Magee, Inesi, & Gruenfeld, 2006). (you know the one… it’s in all the Psych 101 textbooks).
I wrote to Pascal Burgmer; he graciously shared the materials, and then I set to work with a student to design a replication (https://osf.io/wch5r/). We collected the data in early 2015 on both MTurk and Prolific Academic. For the MTurk study, 36% of the participants reported strong familiarity with the task. Clearly, others were investigating power and perspective-taking on MTurk at or around the same time. For comparison, only 3% of Prolific Academic participants reported strong familiarity with the study materials.
Eventually, we’ve been able to identify some of the other labs that were interested in this question. In 2016 Blader and colleagues reported an experiment almost exactly like the one my student and I had run (Study 4 of (Blader, Shirako, & Chen, 2016)): power was manipulated and the effect was measured using the same online adaptation of Tversky and Hard’s visual perspective-taking task (even the same images!). Also, it turns out that in late 2014 one of the Many Labs projects included a study on power and perspective-taking (though with a somewhat different paradigm) and that 773 MTurk participants were collected (Ebersole et al., 2016).
So – that means at least 3 labs were collecting data on power and perspective-taking at or around the same time on MTurk. We were using materials that are spoiled with repeated use (once you’ve been debriefed on the visual-perspective-taking task it is ruined), but as far as I can tell I was the only one to probe for non-naivete. Moreover, non-naivete makes a measurable difference: in our MTurk data including the non-naive participants yields a statistically significant effect, but excluding the non-naive participants does not (though it is a moderate effect in the predicted direction).
Not only were we spoiling each others experiments, it is also clear that the published record on the topic is distorted. The Blader et al. (2106) paper was published, but the authors were (of course) unaware of two other large studies using essentially the same materials… synthesizing all these studies would suggest a weak, potentially 0 effect.
I made this scatterplot for an APS poster in March of 2017 that has the original study, my student’s direct replication, our MTurk and Prolific Academic samples, the Blader study, and an student replication we were able to find online…. the balance of evidence suggests a very weak effect in the predicted direction, but 0 effect cannot be ruled out. I haven’t had time to work on the project since then.. so there may be more relevant data to synthesize. And, of course, one of the Many Labs projects failed to find an effect of power on perspective-taking in another paradigm.
Caveats and Discussion
I keep saying “MTurk” when really I mean online… I would guess the problem of low opportunity cost exacerbates the file drawer problem with any online platform in which obtaining large samples becomes very low cost in terms of time, effort, and actual money.
I don’t think MTurk is completely ruined. If you have truly novel research materials or materials that don’t spoil with repeated use then MTurk seems much more promising. But think about this–if you do find an effect with new materials, how long before they are spoiled on MTurk? What will we do when folks can report original research on MTurk but then claim all subsequent replications are invalid due to non-naivete?
Overall, I’m still a bit equivocal about the dangers of MTurk. I love the platform and seeing the data roll in. On the other hand, it is this very love of MTurk that is so dangerous. There are so many researchers with so few distinct research questions–it’s a simple tragedy of the commons.
All of the data I presented above is semi-anecdotal. The projects I’ve been working on keep bumping me up against the idea that the file drawer problem is much worse with online studies. But maybe my experiences are unusual or unique? Maybe it depends on the subfield of psychology or the topic? I’m happy for feedback, sharing of experiences, etc. I’d also love to hear from folks interested in collaborating to try to measure the extent of the file drawer problem on MTurk.
Blader, S. L., Shirako, A., & Chen, Y.-R. (2016). Looking Out From the Top. Personality and Social Psychology Bulletin
(6), 723–737. doi:10.1177/0146167216636628
Chandler, J., Mueller, P., & Paolacci, G. (2013). Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behavior Research Methods
(1), 112–130. doi:10.3758/s13428-013-0365-7
Cusack, M., Vezenkova, N., Gottschalk, C., & Calin-Jageman, R. J. (2015). Direct and Conceptual Replications of Burgmer & Englich (2012): Power May Have Little to No Effect on Motor Performance. PLOS ONE
(11), e0140806. doi:10.1371/journal.pone.0140806
Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks, J. B., … Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology
, 68–82. doi:10.1016/j.jesp.2015.10.012
Galinsky, A. D., Magee, J. C., Inesi, M. E., & Gruenfeld, D. H. (2006). Power and Perspectives Not Taken. Psychological Science
(12), 1068–1074. doi:10.1111/j.1467-9280.2006.01824.x
Sterling, T. D. (1959). Publication Decisions and their Possible Effects on Inferences Drawn from Tests of Significance—or Vice Versa. Journal of the American Statistical Association
(285), 30–34. doi:10.1080/01621459.1959.10501497
Stewart, N., Ungemach, C., Harris, A., Bartels, D., Newell, B., Paolacci, G., & Chandler, J. (2015). The average laboratory samples a population of 7,300 amazon mechanical turk workers. Judgment and Decision Making
(5). Retrieved from https://repub.eur.nl/pub/82837/
Tversky, B., & Hard, B. M. (2009). Embodied and disembodied cognition: Spatial perspective-taking. Cognition
(1), 124–129. doi:10.1016/j.cognition.2008.10.008
Zhou, H., & Fishbach, A. (2016). The pitfall of experimenting on the web: How unattended selective attrition leads to surprising (yet false) research conclusions. Journal of Personality and Social Psychology
(4), 493–504. doi:10.1037/pspa0000056