Badges, Badges, Badges: Open Science on the March

Here are two screen pics from today’s notice about the latest issue of Psychological Science.

Four of the first five articles earned all three badges, including Pre-reg! Gold! (OK, by showing just those five I’m cherry picking, but other articles also have lots of badges and, in any case, what a lovely juicy cherry!)

If you would like to know about Open Science, or you would like your students to learn how science is now being done, you could try ITNS, our intro textbook that includes lots of Open Science. And/or you could find out more at the APS Convention in San Fran next month:


ITNS–A New Review on Amazon

The ITNS page on Amazon (U.S.) is here. Scroll down to see the 4 reviews by readers.

Recently a five-star review was added by Edoardo Zamuner. Here it is:

“I am an experimental psychologist with training in NHST. Cumming’s book has helped me realise that my understanding of key statistical tests was largely mistaken. As a result, Cumming’s discussion of confidence intervals, effect sizes, meta-analysis and open science has become essential to my work. I found the section on effect sizes especially important, since many academic journals now require authors to report effect sizes with error estimations — with or without p-values. Cumming’s New Statistics is at the center of Psychological Science’s statistical reforms (Eich, 2013, Psychological Science, “No More business as Usual—Embracing the New Statistics”).

“In addition to being an excellent tool for researchers, Cumming’s book is relevant to teachers interested in switching from NHST to effect sizes, and who want their students to learn about the replication crisis, meta-analysis, and open science. Students using this book will still gain the necessary skills to understand the literature published prior to the statistical reform.

“The book comes with a number of excellent online resources, including a companion website hosted by the publisher (google: routledge textbooks introduction new statistics). From this website, teachers and students can download slides with images from lecture notes, the Exploratory Software for Confidence Intervals (ESCI), and guides to SPSS and R. More online material is available on the APS website (google: aps new statistics estimation cumming), where readers can watch six excellent videos in which Cumming expands on topics from the book.”

Thank you Edoardo. I was delighted to see those judgments.

If I could just add one remark: My co-author Bob Calin-Jageman deserves a big mention for an immense and ongoing contribution to the book and all the materials.

P.S. If you have read or worked with ITNS and/or any of the accompanying materials, please consider writing a review for Amazon. It’s actually easy to do.

It’s not just Psychology: Questionable Research Practices in Ecology

Today’s fine article from The Conversation is:

Our survey found ‘questionable research practices’ by ecologists and biologists – here’s what that means

The authors are Fiona Fidler and Hannah Fraser, of The University of Melbourne.

Fidler and Fraser surveyed 807 researchers (494 ecologists and 313 evolutionary biologists) about their use of Questionable Research Practices (QRPs), including cherry picking statistically significant results, p-hacking, and hypothesising after the results are known (HARKing). The authors also asked them to estimate the proportion of their colleagues that use each of these QRPs. For each QRP, roughly around half the respondents stated that they had used that practice at least once. For some practices, they estimated higher rates among their research colleagues. These results are confronting, but the proportions are similar to those previously reported for psychology.

The preprint that gives more details of their survey and the results is here.

So QRPs have been endemic in Psychology, and now Ecology and Evolutionary Biology. And in even more disciplines, we’d have to guess. Open Science has, of course, developed to improve research practices, in particular by reducing QRPs markedly.

One of the problems is that anti-science forces can attempt to exploit these sort of findings, not to mention the also confronting findings of the replication crisis. The specific focus of Fidler and Fraser’s article is to respond to this problem. They pose and then reply to a number of the accusations that might be prompted by their results:

It’s fraud!
NO, it’s not! Scientific fraud does occur, and is extremely serious, but the evidence is that, thankfully, it’s very rare.

Scientists lack integrity and we shouldn’t trust them
The authors present evidence and several reasons why this is not true. The rapid rise and spread of Open Science may be the strongest indicator that researchers are responding with great integrity, energy, and conviction as they develop and adopt the better ways of Open Science.

We can’t base important decisions on current scientific evidence
On the contrary, in numerous important cases, including climate change and the effectiveness of vaccination, the evidence is multi-pronged, massive, and much replicated.

Scientists are human and we need safeguards
Yes indeed, and perhaps one of the biggest challenges of Open Science is to achieve change in the incentive systems that scientists are subjected to, and that so easily lead to QRPs.

But read the article itself–it’s short and very well-written.


We’ve Been Here Before: The Replication Crisis over the Pygmalion Effect

[UPDATE: Thanks to twitter I came across this marvelous book(Jussim, 2012) that does a great job explaining the Pygmalion effect, the controversy around it, and the overall state of research on expectancy effects.  I’ve amended parts of this post based on what I’ve learned from Chapter 3 of the book…looking forward to reading the rest]

Some studies stick with you; they provide a lens that transforms the way you see the world.   My list of ‘sticky’ studies includes Superstition in the Pigeon (Skinner, 1948), the Good Samaritan study (Darley & Batson, 1973), Milgram’s Obedience studies (Milgram, 1963), and the Pygmalion Effect by Rosenthal and Jacobson.

Today I’m taking the Pygmalion Effect off my list.  It turns out that it is much less robust than my Psych 101 textbook led me to believe (back in 1994).   Expectancy effects do occur, but it is unlikely that teacher expectations can dramatically shape IQ as claimed by Rosenthal & Jacobson.

This is news to me…though maybe not to you.  Since I first read about the Pygmalion effect as a first-year college student I ‘ve bored countless friends and acquaintances with this study.  It was a conversational lodestone; I could find expectancy effects everywhere and so talked about them frequently.  No more, or at least not nearly so simplistically.  The original Pygmalion Effect is seductive baloney.  [Update: I mean this in terms of teacher expectancy being able to have a strong impact on IQ.  Fair point by Jessim that expectancy effects matter alot even if IQ isn’t directly affected. ]

What has really crushed my spirit today is the history of the Pygmalion Effect.  It turns out that when it was published it set off a wave of debate that very closely mirrors the current replication crisis.  Details are below, but here’s the gist:

  • The original study was shrewdly popularized and had an enormous impact on policy well before sufficient data had been collected to demonstrate it is a reliable and robust result.
  • Critics raged about poor measurement, flexible statistical analysis, and cherry-picking of data.
  • That criticism was shrugged off.
  • Replications were conducted.
  • The point of replication studies was disputed.
  • Direct replications that showed no effect were discounted for a variety of post-hoc reasons.
  • Any shred of remotely supportive evidence was claimed as a supportive replication.  This stretched the Pygmalion effect from something specific (an impact on actual IQ) to basically any type of expectancy effect in any situation…. which makes it trivially true but not really what was originally claimed.  Rosenthal didn’t seem to notice or mind as he elided the details with constant promotion of the effect.
  • Those criticizing the effect were accused of trying to promote their own careers, bias, animus, etc.
  • The whole thing continued on and on for decades without satisfactory resolution.
  • Multiple rounds of meta-analysis were conducted to try to ferret out the real effect; though these were always contested by those on opposing sides of this issue.  [Update – on the plus side, Rosenthal helped pioneer meta-analysis and bring it into the mainstream…so that’s a good thing!]
  • Even though the best evidence suggests that expectation effects are small and cannot impact IQ directly, the Pygmalion Effect continues to be taught and cited uncritically.  The criticisms and failed replications are largely forgotten.
  • The truth seems to be that there *are* expectancy effects–but:
    • that there are important boundary conditions (like not producing real effects on IQ)
    • they are often small
    • and there are important moderators (Jussim & Harber, 2005).
  • Appreciation of how real expectancy effects works has likely been held back by tons of attention and research on this one, particular research claim, which was never very reasonable or well supported in the first place.

So: The famous Pygmalion Effect is likely an illusion and the bad science that produced it was accompanied by a small-scale precursor of the current replication crisis.  Surely this is a story that has been repeating itself many times across many decades of psychology research:

Toutes choses sont dites déjà; mais comme personne n’écoute, il faut toujours recommencer / Everything has been said already; but as no one listens, we must always begin again.

(I just learned about this quote today in a Slate article; it is from Andre Gide and was recently quoted in footnote by Supreme Court Justice Sonia Sotomayor)

The Details

I’ve based this brief blog post on two papers that summarize the academic history of the Pygmalion effect: Spitz (1999) and Jussim & Harber (2005).  If you are interested in this topic, I strongly recommend them along with this book by Jessum (Jussim, 2012) There’s no way I could match either of these sources for their breadth and depth of coverage of this topic.  So below here are the cliff notes:

The Pygmalion Effect

The original Pygmalion Effect was an experiment by Rosenthal & Jacobson in which teachers at an elementary school were told that some of their students were ready to exhibit remarkable growth (based on the “Harvard Test of Inflected Acquisition”).  In reality the students designated as “about to bloom” were selected at random (about 5 per classroom in 18 classrooms spanning 5 grades).  IQ was measured before this manipulation and again at several time points after the study began.   At the 8 month time point, the 255 control students showed growth of 4 IQ points whereas the 65 children designated bloomers gained an average of 12 IQ points.  Most of this was due to much higher growth in the first and second grade classes.   The IQ tests were somewhat standardized, so supposedly the DV was *not* subject to expectancy effects by the teacher who administered it.  Thus, the original Pygmalion Effect was the notion that teacher expectancy could literally increase IQ.

The results were reported across several publications: results were presented (briefly) in a book by Rosenthal (1966), then more fully in a journal article (Rosenthal & Jacobson, 1966), then in a Scientific American (1968),  a book chapter (also 1968), and then in a full-length book (Rosenthal & Jacobson, 1968).  According to Google Scholar, the book version has been cited over 5,000 times since publication (though Google Scholar links to a summary of the book published in Urban Review)

There experiment caused a sensation, garnering tremendous public attention and almost immediately influencing public policy and even legal decisions (Spitz, 1999).

The Problems

Not all the reaction to the Pygmalion Effect was positive.  Doubters emerged.  Some pointed out that the teachers could not recall who had been designated a student of great potential…meaning the manipulation should not have been effective (the teachers received a list of students at the beginning of the semester; few could recall the names of those on the list and many reported it to have been ‘just another memo’ in a sea of back-to-school business).  Questions were also raised about the quality of the measurement: the scores seemed to indicate that the incoming students were mentally disabled, and the IQ test used may not have been valid with children in the younger grades (the ones who drove all the gains).  Spitz (1999) has a great historical overview.

Here are a few juicy tidbits from a ferociously bad review of the book by Thorndike (Thorndike, 1968):

  • “In spite of anything I can say, I am sure it will become a classic — widely referred to and rarely examined critically”
  • “Alas, it is so defective technically that one can only regret that it ever got beyond the eyes of the original investigators!”
  • “When the clock strikes thirteen, doubt is not only cast on the last stroke but also on all that have come before….When the clock strikes 14, we throw away the clock.”

The endless back and forth

I can’t even begin to summarize the long-standing back-and-forth over the Pygmalion effect. Spitz (1999) does a good job summarizing from a primarily critical point of view.  It’s worth reading.  Equally worthy is a more sympathetic review by Jussim & Harber (Jussim & Harber, 2005).

One theme that emerges from the Spitz summary is that as more data rolled in the concept of what the Pygmalion Effect is became a point of contention.  Critics were eager to focus on IQ and to show that there is no way a specific and large effect of IQ could be reliable.  Rosenthal, on the other hand, seemed comfortable with a very flexible definition of the Pygmalion Effect, accepting nearly any type of expectancy effect in a school setting as confirmation while discounting or eliding negative results.  Overall, the impression one gets is that Rosenthal was eager for a simple story (expectancy effects are real) and didn’t want to get caught up in the nuances.  The critics were eager to show that at least parts of the story were questionable.  In my reading, this ended up being a colossal waste of time–it lead to many resources poured into direct replications and endless argument but not much productive in terms of fleshing out and rigorously testing theories about how/when expectency effects would occur.

The Jussim & Harber paper does a great job at trying to move things forward–acknowledging the weak evidence specifically for IQ but pushing the field to think more critically about moderators, effect sizes, boundary conditions, and the like.  They end up with a much more nuanced take–that even if IQ effects might not be reliable, expectancy effects are likely real.

Deja Vu

If you bother to read any of these sources, I’m guessing that you’ll join me in feeling an eerie and worrying sense of deja vu related to the current replication crisis.  The Pygmalion Effect stirred up many of the same debates we’re currently hashing out (measurement quality, rigor of prediction, value of meta-analysis, standards of evidence, utility of replication, etc.).  There are also a lot of similarities in terms of tone and the way folks on opposing sides treated each other.  Rosenthal seems to shrug off criticism, and be very inventive at post-hoc reasoning.  It must have driven his critics mad.  I’ll let him have the last word, which I think those pushing for better science will find frustratingly familiar.  This is from a paper he wrote in 1980 celebrating the Pygmalion Effect reaching the status of “citation classic”:

There were also some wonderfully inept statistical critiques of Pymalion research.  This got lots of publications for the critics of our research including one whole book aimed at devastating the Pygmalion results, which only showed that the results were even more significant than Lenore Jacobson and I had claimed.

Yes, that’s the “what doesn’t kill my statistical significance makes it stronger” fallacy Gelman has been blogging about.  And, yes, it’s that same mocking dismissal of cogent criticism in favor of simplistic but almost certainly wrong stories that frustrates those trying to raise standards today.  And yes, this was 38 years ago… so things haven’t changed much and Rosenthal is still highly and uncritically cited.

We’ve got to do better this time around.



Darley, J. M., & Batson, C. D. (1973). “From Jerusalem to Jericho”: A study of situational and dispositional variables in helping behavior. Journal of Personality and Social Psychology, 27(1), 100–108. 10.1037/h0034449″ target=”_blank” rel=”noopener noreferrer”>
Jussim, L. (2012). Social Perception and Social Reality. OUP USA.
Jussim, L., & Harber, K. D. (2005). Teacher Expectations and Self-Fulfilling Prophecies: Knowns and Unknowns, Resolved and Unresolved Controversies. Personality and Social Psychology Review, 9(2), 131–155. 10.1207/s15327957pspr0902_3″ target=”_blank” rel=”noopener noreferrer”>
Milgram, S. (1963). Behavioral Study of obedience. The Journal of Abnormal and Social Psychology, 67(4), 371–378. 10.1037/h0040525″ target=”_blank” rel=”noopener noreferrer”>
Rosenthal, R., & Jacobson, L. (1966). Teachers’ Expectancies: Determinants of Pupils’ IQ Gains. Psychological Reports, 19(1), 115–118. 10.2466/pr0.1966.19.1.115″ target=”_blank” rel=”noopener noreferrer”>
Rosenthal, R., & Jacobson, L. (1968). Pygmalion in the classroom. The Urban Review, 3(1), 16–20. 10.1007/bf02322211″ target=”_blank” rel=”noopener noreferrer”>
Skinner, B. F. (1948). “Superstition” in the pigeon. Journal of Experimental Psychology, 38(2), 168–172. 10.1037/h0055873″ target=”_blank” rel=”noopener noreferrer”>
Spitz, H. H. (1999). Beleaguered Pygmalion: A History of the Controversy Over Claims that Teacher Expectancy Raises Intelligence. Intelligence, 27(3), 199–234. 10.1016/s0160-2896(99)00026-4″ target=”_blank” rel=”noopener noreferrer”>
Thorndike, R. L. (1968). Reviews: Rosenthal, Robert, and Jacobson, Lenore. Pygmalion in the Classroom. New York: Holt, Rinehart and Winston, 1968. 240 + xi pp. $3.95. American Educational Research Journal, 5(4), 708–711. 10.3102/00028312005004708″ target=”_blank” rel=”noopener noreferrer”>

The National Association of Scholars Weighs in on ‘The Irreproducibility Crisis’

STOP PRESS: Since first writing this post I’ve discovered that all may not be as it seems–especially to me at the other end of the Earth. As ever, we need to be vigilant for any dark forces wishing to use the replication crisis as an excuse to discredit science. Science has, overwhelmingly, been highly successful and effective, even if the knowledge it has provided has not always been used for the benefit of humanity. Open Science is the positive and creative response of scientists to the discovery that some findings in some fields can’t be replicated. Science can only become even more successful and effective as OS practices become more widespread. Developing OS practices further and, above all, working to see them adopted more widely–these should be the aims of everyone, and all organisations, concerned to enhance scientific research and its use to advance human well-being. To the extent that the NAS report does this, and is used to do this, it deserves our support.

My original post, slightly revised:
I’ve just received notice of the launch of a report titled The Irreproducibility Crisis of Modern Science: Causes, Consequences, and the Road to Reform. The public launch will be in Washington, D.C., on April 17.

The report is by the (U.S.) National Association of Scholars, which is described in Wikipedia as “an American non-profit politically conservative advocacy group, with a particular interest in education”.

I was sent a two-page summary of the report; this summary is here. [Sorry, I’ve now (6 April) taken down the summary, at the request of the NAS–they wish it to remain embargoed until the launch on 17 April. I’ll re-post it, and/or the whole report, after the launch.]

Judging by that brief summary, the full report, to be released on April 17, *seems* like a reasonable report that recognizes some of the problems that Open Science seeks to address. However, there is a lot to be concerned about as well: the report will apparently be launched at an event chaired by a noted climate-change denier… so there is some strong concern here that the science reforms being advocated may be cover for an overall anti-science program.

It’s hard to tell….but here are brief paraphrases of a few of the report’s 40 recommendations:

Statistical Standards. Researchers should define statistical significance as p < .01 rather than as p < .05.

Research Practices. Researchers should pre-register their research protocols.

Schools. High school and colleges should teach basic statistics, including discussion of uncertainty.

In my view, the first is weak and largely irrelevant, while the other two are fine. Further recommendations emphasise the importance of requiring evidence from reproducible research to guide U.S. government policies and research practice.


The Beautiful Face of a Confidence Interval: The Cat’s Eye Picture

Pawel (Pav) Kalinowski and Jerry Lai completed their PhDs a few years back. A recently published Frontiers article (citation below) reports what was primarily Pav’s research on how people understand confidence intervals (CIs). The short version is “for many people, not very well, but there is hope”.

Kalinowski, P., Lai, J., & Cumming, G. (2018). A cross-sectional analysis of students’ intuitions when interpreting CIs. Frontiers in Psychology: Quantitative Psychology and Measurement, 16 February. [free download at that link]

Random sampling: The dance of the means
I’ll talk about Pav’s work shortly, but first a few words about random sampling and the cat’s eye picture of a CI. The figure below shows the dance of the means–sample means (green dots) generated by simulation of repeated sampling from a normally distributed population. (The population has mean μ = 50, as marked by the central vertical line. Population SD = 46 and the size of each sample is N = 20, although those values are not important.) The pile of means at the bottom is the mean heap–all the sample means from previous samples. The curve is the sampling distribution of the sample means–what we expect theoretically if an infinitely large number of samples is taken. The figure is from the CIjumping page of ESCI intro chapters 3-8, which is a free download from here–click on the ESCI download tab.

Most sample means fall close to μ
The figure illustrates how most sample means fall close to μ (population mean, marked by central vertical line). Progressively fewer fall further from μ, and in the long run just 5% fall outside the two outer vertical lines. The curve is a summary of this pattern. The curve has SD given the name standard error (SE). The outer vertical lines mark the central 95% of the area under the curve, and are approximately 2 x SE either side of μ.

In real life we don’t, of course, know μ, and we have only a single mean. Given our mean, where is μ? That’s the core challenge of statistical inference. The figure tells us that our best bet is that our sample mean has fallen fairly close to μ, tho’ it could easily have fallen a little distance from μ and, just possibly rather further still.

The 95% CI on our sample mean has length equal to the distance between the outer verticals, which is approximately 4 x SE. Place a line of this length so it is centred on our sample mean and we have the CI. Now the critical step: In addition, centre the curve from the figure also on our sample mean–not on μ as in the figure above. Just for neat symmetry, also centre an upside down version of the curve on our mean. We get the upper picture in the figure below, which comes from the Frontiers article.

That upper picture is the cat’s eye picture on the 95% CI, with, for additional emphasis, the area spanned by the CI shown shaded. This is 95% of the total area between the two curves. The lower picture is the same for the 50% CI.

The cat’s eye picture of a CI
The cat’s eye picture makes salient what implicit in a CI represented, conventionally, by a mere line. It tells us that our mean has most likely fallen quite close to the unknown μ, with the ‘fatness’ or vertical extent of the cat’s eye telling us the relative likelihood, or plausibility, that various points along the CI are where μ is located. Most likely, μ is fairly close to the mean, but it could easily be out towards either end of the CI, or even, just possibly, a little beyond the end of the CI.

Note that nothing special happens exactly at the end of the CI. Just inside or just outside–virtually no difference in chances that μ lies there.

Note also that the 50% CI is about one-third the length of the 95% CI, which tells us that there is approximately a 50-50 chance that μ lies in the middle third of the 95% CI. A handy little fact to remember.

So, when you see a 95% CI, visualise the cat’s eye, to help your intuitions about where the population effect size that you are trying to estimate is most likely to lie. All the above is a long-winded way to say that the cat’s eye is the beautiful, but usually hidden, face of a CI.

Pav’s research
But do people understand that the chances of where μ lies vary across (and beyond) a CI? Traditional strict Frequentist dogma distinguishes only inside and outside a CI, and doesn’t permit any distinctions of different points within the interval. That, however, flies in the face of the dance of the means and how sampling actually behaves. The cat’s eye tells true. To what extent do people appreciate that?

Pav defined the Subjective Likelihood Distribution (SLD) as the “cognitive representation of the relative likelihood of each point across and beyond a CI in landing on the population parameter. For example, a uniform SLD reflects the (incorrect) belief that every point inside a CI is equally likely to have landed on μ.” He used several empirical approaches to estimate the shape of the SLD of seniour undergraduate and graduate students.

In very brief summary, Pav found that students’ SLD curves varied widely in shape. Some were close to flat, although many were (correctly) higher close to the sample mean than further away. Pav also identified a number of basic misconceptions about CIs that were held by some students. Some did not understand that, for example, a 99% CI must be longer than a 95% CI–because it encompasses a larger %age of the area of the cat’s eye.

Pav then interviewed some of the students at length. He identified in finer detail the correct and wrong conceptions each student held about CIs. Then he introduced the cat’s eye picture. He found encouraging initial evidence that learning about that picture helped many of the students to a better understanding of CIs.

There is enormous scope for Pav’s work to be followed up in more detail, and for the teaching implications to be explored. We believe that the cat’s eye can be extremely useful to help students, including beginners, develop better intuitions about CIs. Researchers as well! So ITNS does illustrate and discuss cat’s eye pictures.

I’d be very interested to hear of experiences from the classroom of anyone who has used cat’s eyes as part of their teaching of CIs. Thanks!


P.S. And now the most important bit, the abstract:
We explored how students interpret the relative likelihood of capturing a population parameter at various points of a CI in two studies. First, an online survey of 101 students found that students’ beliefs about the probability curve within a CI take a variety of shapes, and that in fixed choice tasks, 39% CI [30, 48] of students’ responses deviated from true distributions. For open ended tasks, this proportion rose to 85%, 95% CI [76, 90]. We interpret this as evidence that, for many students, intuitions about CIs distributions are ill-formed, and their responses are highly susceptible to question format. Many students also falsely believed that there is substantial change in likelihood at the upper and lower limits of the CI, resembling a cliff effect (Rosenthal and Gaito, 1963; Nelson et al., 1986). In a follow-up study, a subset of 24 post-graduate students participated in a 45-min semi-structured interview discussing the students’ responses to the survey. Analysis of interview transcripts identified several competing intuitions about CIs, and several new CI misconceptions. During the interview, we also introduced an interactive teaching program displaying a cat’s eye CI, that is, a CI that uses normal distributions to depict the correct likelihood distribution. Cat’s eye CIs were designed to help students understand likelihood distributions and the relationship between interval length, C% level and sample size. Observed changes in students’ intuitions following this teaching program suggest that a brief intervention using cat’s eyes can reduce CI misconceptions and increase accurate CI intuitions.

Measuring Heterogeneity in Meta-Analysis: The Diamond Ratio (DR)

This is a post about the Diamond Ratio (DR), a simple measure of the extent of heterogeneity in a meta-analysis. We introduced the DR in ITNS. But first, some background.

Fixed Effect (FE) model for meta-analysis
The diamond at the bottom of the forest plot picturing a meta-analysis reports the overall point estimate and its 95%CI. If there’s not too much study-to-study variation in the results–in other words if the forest plot looks rather like a dance of the CIs arising from exact replications–then we’re done. The simple Fixed Effect (FE) model, which assumes the studies are all estimating the same population effect size, is probably reasonable. If so, we say the studies are homogeneous.

Below is Figure 9.2 of ITNS, showing a FE meta-analysis of 10 studies. (The ESCI Meta-Analysis file used to make this figure is a free download by following the links from here.)

Random Effects (RE) model
However, it’s usually unrealistic to assume the studies are homogeneous. If there looks to be rather more study-to-study variation in the forest plot, then the studies may be heterogeneous and we need the Random Effects (RE) model, which assumes that the different studies may be estimating somewhat differing population effect sizes.

Below is Figure 9.3 of ITNS, showing the same studies as in Figure 9.2 above, but this time a RE meta-analysis. In ESCI, click between the two radio buttons at bottom left to switch between FE and RE models. The most obvious change is that the RE diamond is longer than the FE diamond. In fact about 40% longer. The Diamond Ratio (DR) is the ratio of the RE diamond length to the FE diamond length. Here, the value of the DR is 1.40, as reported centre bottom in both figures.

Measures of heterogeneity
The conventional measures of heterogeneity in a meta-analysis are Q, I-squared, and tau-squared. In Chapter 8 of my first book, UTNS, I give the formulas, explain how the three inter-relate, and try to explain what they mean. It’s all a bit complicated, and I’ve always felt that these measures don’t really give a good intuition of heterogeneity. For ITNS, the introductory book, we needed something much simpler and more intuitive.

The radio buttons in ESCI make it easy to click between FE and RE models, and to watch as the diamond jumps between shorter and longer–when there is heterogeneity. The ratio of these two lengths seemed a simple way to express the amount of difference between the two models, and so ESCI reports the value of the DR. If DR= 1.0 the two models give the same result, suggesting there is little or no heterogeneity–the studies could easily be homogeneous. The larger the DR, the greater the heterogeneity. Values above around 1.5, and especially values approaching or exceeding 2.0, suggest considerable heterogeneity.

Heterogeneity and Moderators
Heterogeneity need not be a nuisance. If we can find a moderator that accounts for a usefully large part of the heterogeneity, then we may have made a discovery–we may be able to answer a research question that no single one of the separate studies in the meta-analysis can address. To take a simple example, maybe some of the studies used only females, and others only males. If gender can account for a large part of the heterogeneity in the results of the separate studies, we may have made a useful discovery.

Moderator analysis can only identify correlation, not causality. But identifying a moderator can lead to theoretical advances, and help guide research fruitfully. Even beginners should be able to appreciate the value of heterogeneity and moderators in meta-analysis. We attempt to explain all that, using simple examples with forest plots, in Chapter 9 of ITNS.

Studying the DR
The DR seemed an appropriately simple and intuitive measure of heterogeneity for ITNS, but what are its properties? I did a few simple simulations, and found the DR seemed to behave sensibly, but I couldn’t get far when I tried to investigate the underlying maths.

Fortunately, my savvy statistics colleague at La Trobe University, Luke Prendergast was interested in the DR. He has been supervising Max Cairns, a PhD student, who has been investigating the DR.

A CI on the DR!
They recently invited me to La Trobe to discuss progress. It turns out that the DR, under another name, has received some (favourable) attention in the tekkie research literature. Max has now taken that previous work notably further. He reports that the DR does seem to behave well, and to be related in a sensible way to the conventional measures of heterogeneity. The big breakthrough is that Max has found a way to calculate a CI on the DR. That’s great news, and a big advance! His simulations suggest that his CI behaves well for a wide variety of situations.

There is more work to be done, then Luke and Max plan to write up their DR findings. We may together prepare a version for a psychology audience. They may also develop an online tool, to make it easy for researchers to enter their data, then get not only the DR but also the CI on that DR. Which would be wonderful!

It will take a while, and the outcomes are not guaranteed, but progress so far is highly promising. Well done Max (and Luke)! Best of luck with the next stages. I shall report further in due course.

Meanwhile we can all read Chapter 9 of ITNS and appreciate that the DR is a legitimate indicator of the extent of heterogeneity. Yay!


Randomistas: Dare we hope for evidence-based decisions in public life?

I’ve just listened to a great 20-min podcast, published by The Conversation. The podcast is here. It’s an interview by my colleague Fiona Fidler with Anthony Leigh, about his recently released book:

Randomistas: How Radical Researchers Changed Our World. Published by Black Inc. and La Trobe University Press.

Andrew Leigh is a Harvard-trained economist who was formerly a professor of economics at the Australian National University in Canberra. In Randomistas he argues that we should be using randomised trials much more often to guide public policy choices. He describes numerous examples of randomised trials, in a wide variety of fields. He’s well aware of the replication crisis and the Open Science practices needed to ensure trustworthy research.

So far, so good. But the really great thing is that Leigh is not just any ex-professor. He’s also an elected member of the House of Representatives, which is Australia’s Lower House of Parliament–approx. equiv. to Congress, or the House of Commons. Furthermore, he’s the Shadow Assistant Treasurer. If, as current polls suggest is likely, there is a change of government at the next Federal Election, due probably in early 2019, then he could easily be Australia’s Assistant Treasurer. And thus in a position to practise what he’s preaching in Randomistas.

Of course, it’s much easier to express good intentions when in Opposition than to put them in to practice when in Government. But it’s a great start that someone in his position knows enough, and cares enough about randomised trials and evidence-based policy-making to write so impressively about them.

Australia has had more than its share of atrocious political decisions that fly in the face of science and evidence. Dare we hope that a change of government might lead to an improvement?


Sample-size planning – a short video

Here’s a short talk I gave at the 2017 Society for Neuroscience meeting on sample size planning.  The talk discusses:

  • Why you should plan your sample sizes in advance
  • What not to do (how some common approaches can lead you astray)
  • What an effect size is and why it’s important to have a sense of effect sizes
  • How to plan a sample for power
  • Even better, how to plan a sample for precision
  • Resources to get you started (with the talk, script, and resources posted here:

All in 25 minutes.  Whew.  SFN taped the talk and made a fancy video.  Unfortunately, they made the video available only to members.  This is the copy I liberated from the SFN website.  Viva la revolucion.

Open Science: This Time in Orthodontics

Last month it was the Antarctic Scientists, this month the Orthodontists, and once again I had a most enjoyable time. Lindsay my wife and I are just back from 5 days in Sydney. I was speaking at the 26th Australian Orthodontic Congress, which is the biennial meeting of the Australian Society of Orthodontists.

My first presentation was a 90 min workshop for post-graduate students: Open Science and The New Statistics: Doing Research in the Post-p<.05 World. My slides are here. There were about 100 grad students in the group, and I felt they were very much on the ball. It was a pleasure to discuss with them Open Science, and better ways to do statistics.

My second presentation was 30 min in the Doctors’ Program, meaning the main Congress. Considering that most were practitioners, and thus mainly consumers of research, my title was: Open Science and The New Statistics: What Research Looks Like in This Post-p<.05 World. My shorter set of slides is here.

That night was party night, so I could chat to numerous people about Open Science, orthodontics, statistics, and lots more, as our party boat slowly cruised around Sydney Harbour and fine food and drink was served. People were clearly getting the Open Science message, and I heard many stories about selective publication, problems with getting research published, and the need for replication. The next morning I had a 60 min slot for follow-up discussion. I was impressed that more than 100 people came. I could by then say a bit more about Open Science and the new statistics in the context of orthodontics. Some excellent questions were raised from the floor.

It was clear that many smart and highly accomplished people were joining the discussion. It was also clear that many appreciated that orthodontics must lift its research game. I was asked what, concretely, the Australian journal should do, and what the ASO Foundation–which sponsors much research and post-grad student project work–should do. More generally, what can ASO do to lift standards, in accord with Open Science practices?

I asked Brian Nosek, ED of the Center for Open Science, how best the ASO could obtain COS advice, and access to the services offered by COS. He kindly replied immediately, nominating Matt Spitzer, leader of the COS Community team. Email: So Matt should be able to provide advice and contacts for ASO leaders interested in adopting Open Science practices. Maybe even contacts with other groups of orthodontists, or other medical professionals, who are also working towards Open Science?

It’s great to know that Open Science awareness is spreading to yet more disciplines and professions. I could even point out that, already, three orthodontic journals have signed up to the TOP guidelines. So things are already changing in orthodontics. Well done to those who are pushing this ahead–and to those who are about to take things even further.

P.S. Thanks to the ASO, especially Mark Cordato and Ali Darendeliler, for the invitation and generous hospitality.
P.P.S. Lindsay and I enjoyed our comfortable hotel room, a corner room on the 22nd floor with amazing views over the Harbour. One day we took the ferry to the Sydney Opera House and saw an excellent performance of Carmen in the Joan Sutherland Theatre, then took the ferry back ‘home’ again. One of life’s good experiences!
P.P.P.S. Just near the International Conference Centre, on Darling Harbour, we happened on Experiment Street. A sign that science and evidence matter, at least to some folks around there?