We’ve Been Here Before: The Replication Crisis over the Pygmalion Effect
[UPDATE: Thanks to twitter I came across this marvelous book(Jussim, 2012) that does a great job explaining the Pygmalion effect, the controversy around it, and the overall state of research on expectancy effects. I’ve amended parts of this post based on what I’ve learned from Chapter 3 of the book…looking forward to reading the rest]
Some studies stick with you; they provide a lens that transforms the way you see the world. My list of ‘sticky’ studies includes Superstition in the Pigeon (Skinner, 1948), the Good Samaritan study (Darley & Batson, 1973), Milgram’s Obedience studies (Milgram, 1963), and the Pygmalion Effect by Rosenthal and Jacobson.
Today I’m taking the Pygmalion Effect off my list. It turns out that it is much less robust than my Psych 101 textbook led me to believe (back in 1994). Expectancy effects do occur, but it is unlikely that teacher expectations can dramatically shape IQ as claimed by Rosenthal & Jacobson.
This is news to me…though maybe not to you. Since I first read about the Pygmalion effect as a first-year college student I ‘ve bored countless friends and acquaintances with this study. It was a conversational lodestone; I could find expectancy effects everywhere and so talked about them frequently. No more, or at least not nearly so simplistically. The original Pygmalion Effect is seductive baloney. [Update: I mean this in terms of teacher expectancy being able to have a strong impact on IQ. Fair point by Jessim that expectancy effects matter alot even if IQ isn’t directly affected. ]
What has really crushed my spirit today is the history of the Pygmalion Effect. It turns out that when it was published it set off a wave of debate that very closely mirrors the current replication crisis. Details are below, but here’s the gist:
- The original study was shrewdly popularized and had an enormous impact on policy well before sufficient data had been collected to demonstrate it is a reliable and robust result.
- Critics raged about poor measurement, flexible statistical analysis, and cherry-picking of data.
- That criticism was shrugged off.
- Replications were conducted.
- The point of replication studies was disputed.
- Direct replications that showed no effect were discounted for a variety of post-hoc reasons.
- Any shred of remotely supportive evidence was claimed as a supportive replication. This stretched the Pygmalion effect from something specific (an impact on actual IQ) to basically any type of expectancy effect in any situation…. which makes it trivially true but not really what was originally claimed. Rosenthal didn’t seem to notice or mind as he elided the details with constant promotion of the effect.
- Those criticizing the effect were accused of trying to promote their own careers, bias, animus, etc.
- The whole thing continued on and on for decades without satisfactory resolution.
- Multiple rounds of meta-analysis were conducted to try to ferret out the real effect; though these were always contested by those on opposing sides of this issue. [Update – on the plus side, Rosenthal helped pioneer meta-analysis and bring it into the mainstream…so that’s a good thing!]
- Even though the best evidence suggests that expectation effects are small and cannot impact IQ directly, the Pygmalion Effect continues to be taught and cited uncritically. The criticisms and failed replications are largely forgotten.
- The truth seems to be that there *are* expectancy effects–but:
- that there are important boundary conditions (like not producing real effects on IQ)
- they are often small
- and there are important moderators (Jussim & Harber, 2005).
- Appreciation of how real expectancy effects works has likely been held back by tons of attention and research on this one, particular research claim, which was never very reasonable or well supported in the first place.
So: The famous Pygmalion Effect is likely an illusion and the bad science that produced it was accompanied by a small-scale precursor of the current replication crisis. Surely this is a story that has been repeating itself many times across many decades of psychology research:
Toutes choses sont dites déjà; mais comme personne n’écoute, il faut toujours recommencer / Everything has been said already; but as no one listens, we must always begin again.
(I just learned about this quote today in a a Slate article; it is from Andre Gide and was recently quoted in footnote by Supreme Court Justice Sonia Sotomayor)
I’ve based this brief blog post on two papers that summarize the academic history of the Pygmalion effect: Spitz (1999) and Jussim & Harber (2005). If you are interested in this topic, I strongly recommend them along with this book by Jessum (Jussim, 2012) There’s no way I could match either of these sources for their breadth and depth of coverage of this topic. So below here are the cliff notes:
The Pygmalion Effect
The original Pygmalion Effect was an experiment by Rosenthal & Jacobson in which teachers at an elementary school were told that some of their students were ready to exhibit remarkable growth (based on the “Harvard Test of Inflected Acquisition”). In reality the students designated as “about to bloom” were selected at random (about 5 per classroom in 18 classrooms spanning 5 grades). IQ was measured before this manipulation and again at several time points after the study began. At the 8 month time point, the 255 control students showed growth of 4
The results were reported across several publications: results were presented (briefly) in a book by Rosenthal (1966), then more fully in a journal article (Rosenthal & Jacobson, 1966), then in a Scientific American (1968), a book chapter (also 1968), and then in a full-length book (Rosenthal & Jacobson, 1968). According to Google Scholar, the book version has been cited over 5,000 times since publication (though Google Scholar links to a summary of the book published in Urban Review)
There experiment caused a sensation, garnering tremendous public attention and almost immediately influencing public policy and even legal decisions (Spitz, 1999).
Not all the reaction to the Pygmalion Effect was positive. Doubters emerged. Some pointed out that the teachers could not recall who had been designated a student of great potential…meaning the manipulation should not have been effective (the teachers received a list of students at the beginning of the semester; few could recall the names of those on the list and many reported it to have been ‘just another memo’ in a sea of back-to-school business). Questions were also raised about the quality of the measurement: the scores seemed to indicate that the incoming students were mentally disabled, and the IQ test used may not have been valid with children in the younger grades (the ones who drove all the gains). Spitz (1999) has a great historical overview.
Here are a few juicy tidbits from a ferociously bad review of the book by Thorndike (Thorndike, 1968):
- “In spite of anything I can say, I am sure it will become a classic — widely referred to and rarely examined critically”
- “Alas, it is so defective technically that one can only regret that it ever got beyond the eyes of the original investigators!”
- “When the clock strikes thirteen, doubt is not only cast on the last stroke but also on all that have come before….When the clock strikes 14, we throw away the clock.”
The endless back and forth
I can’t even begin to summarize the long-standing back-and-forth over the Pygmalion effect. Spitz (1999) does a good job summarizing from a primarily critical point of view. It’s worth reading. Equally worthy is a more sympathetic review by Jussim & Harber (Jussim & Harber, 2005).
One theme that emerges from the Spitz summary is that as more data rolled in the concept of what the Pygmalion Effect is became a point of contention. Critics were eager to focus on IQ and to show that there is no way a specific and large effect of IQ could be reliable. Rosenthal, on the other hand, seemed comfortable with a very flexible definition of the Pygmalion Effect, accepting nearly any type of expectancy effect in a school setting as confirmation while discounting or eliding negative results. Overall, the impression one gets is that Rosenthal was eager for a simple story (expectancy effects are real) and didn’t want to get caught up in the nuances. The critics were eager to show that at least parts of the story were questionable. In my reading, this ended up being a colossal waste of time–it lead to many resources poured into direct replications and endless argument but not much productive in terms of fleshing out and rigorously testing theories about how/when expectency effects would occur.
The Jussim & Harber paper does a great job at trying to move things forward–acknowledging the weak evidence specifically for IQ but pushing the field to think more critically about moderators, effect sizes, boundary conditions, and the like. They end up with a much more nuanced take–that even if IQ effects might not be reliable, expectancy effects are likely real.
If you bother to read any of these sources, I’m guessing that you’ll join me in feeling an eerie and worrying sense of deja vu related to the current replication crisis. The Pygmalion Effect stirred up many of the same debates we’re currently hashing out (measurement quality, rigor of prediction, value of meta-analysis, standards of evidence, utility of replication, etc.). There are also a lot of similarities in terms of tone and the way folks on opposing sides treated each other. Rosenthal seems to shrug off criticism, and be very inventive at post-hoc reasoning. It must have driven his critics mad. I’ll let him have the last word, which I think those pushing for better science will find frustratingly familiar. This is from a paper he wrote in 1980 celebrating the Pygmalion Effect reaching the status of “citation classic”:
There were also some wonderfully inept statistical critiques of Pymalion research. This got lots of publications for the critics of our research including one whole book aimed at devastating the Pygmalion results, which only showed that the results were even more significant than Lenore Jacobson and I had claimed.
Yes, that’s the “what doesn’t kill my statistical significance makes it stronger” fallacy Gelman has been blogging about. And, yes, it’s that same mocking dismissal of cogent criticism in favor of simplistic but almost certainly wrong stories that frustrates those trying to raise standards today. And yes, this was 38 years ago… so things haven’t changed much and Rosenthal is still highly and uncritically cited.
We’ve got to do better this time around.