Which Standardised Effect Size Measure Is Best When Variances Are Unequal?
A great new preprint by Marie Delacre (at Université Libre de Bruxelles, email@example.com) and colleagues (Daniel Lakens, Christophe Ley, Limin Liu, & Christophe Leys) throws valuable light on this question.
The title is: Why Hedges’ gs* based on the non-pooled standard deviation should be reported with Welch’s t-test
The issue is important for Bob and me as we work on ITNS2 and esci in jamovi, so I was an avid reader. I sent comments and questions and have had a quick and generously detailed response from Marie. She intends to revise the paper around September. I suspect she would be happy to have further comments.
Below is my take on the preprint. In brief, the authors report numerous simulations to investigate the properties of 8 (!) standardised ES measures, focussing on unequal variances and departures from normality.
When variances are equal: Two familiar ES estimates
With two independent groups and assuming homogeneity of variance we usually use Cohen’s d, being the difference between sample means divided by the pooled SD, sp. The pooled SD is the standardiser, the unit of measurement for d. Then a simple adjustment gives us dunbiased, also called Hedges’ g, as an unbiased estimate of δ, the population effect size (ES). Cohen’s d and Hedges’ g are the first of the ES measures investigated.
(In the preprint, the 8 ES measures are indicated as Cohen’s ds and Hedges’ gs, etc, with ‘s’ subscript. These seem redundant and I understand may be removed in the revised version.)
When variances are not equal: Six further ES estimates
Sometimes it’s unjustified, or questionable, to assume population variances are equal. For example, a treatment often increases the variance as well as the mean, compared with the Control condition. It may then make sense to use sC, the SD of the Control group, as standardiser, to get Glass’s d, which becomes Glass’s g when debiased. These are the third and fourth of the ESs studied.
When variances are unequal, we use Welch’s t test:
The denominator is an estimate that weights the two sample variances by sample size, with the larger group receiving the smaller weight. For inference, as with a t test, that’s correct—think of the formula for the SE.
Shieh (2013) proposed using a standardised ES measure based on a standardiser closely related to the denominator in the equation for t‘. In a comment (Cumming, 2013) I argued that Shieh’s d was pretty much uninterpretable: Among other problems, it didn’t estimate an ES in any existing population, and its value was greatly dependent merely on the relative sizes of the two samples. I recommended against using it.
Delacre and colleagues cited my comment, but did include Shieh’s d and Shieh’s g (the unbiased version) for completeness and in line with earlier work of theirs on inference (e.g. Delacre et al., 2017) that advocated use of Welch’s t.
However, inference should not dictate choice of standardiser: We sometimes need a standardiser not based on the SE appropriate for inference, e.g. in the simple paired design, as discussed in ITNS, pp. 207-208.
which bases the standardiser on the average of the two sample variances, whatever the sample sizes. Again, it’s challenging to interpret because it doesn’t estimate a population ES for any existing population, but at least it’s not dependent on relative sample sizes. Cohen’s d* and its unbiased version, Hedges’ g*, complete the 8 ES estimates investigated by Delacre and colleagues.
Results and recommendations
The simulations explored bias and variance of the 8 ES measures for a range of pairs of population variances, pairs of sample sizes, and normal and 3 distinctly non-normal population distributions: a massive project giving a rich trove of information about the robustness of 8 measures. There are numerous tables and figures of estimates of bias and variance to pore over.
The authors’ conclusions:
- “Because the assumption of equal variances… is rarely realistic… both Cohen’s d and Hedges’ g should be abandoned.” (p. 10)
That’s arguable. I’m not convinced the assumption is rarely realistic. (It’s also very often made, even if sometimes it shouldn’t be.) The emphasis should be on informed judgment in context rather than simply abandoning these two most familiar estimates. In addition, when population variances are equal, Hedges’ g performs very well. It’s also familiar and readily interpretable.
- Shieh’s d and Shieh’s g generally perform poorly and are not recommended.
That’s a relief and what I expected. Let’s not consider them further.
- “We do not recommend using [Glass’s d or g].” (p. 28)
I suggest that Glass vs something else is the choice that most clearly should be based on the context. Does it make sense to use the SD of one group, often the Control group, as the standardiser? If so, we should do so, unless there are very strong reasons against. We should use choice of sample sizes and perhaps other strategies (transform the DV to reduce departure from normality?) to minimise any disadvantage of the Glass’s g estimate. The simulation results give valuable guidance on when we might be concerned and what strategies might help.
- “The measure … we believe performs best across scenarios is Hedges’ g*.” (p. 28).
This conclusion is expressed in the preprint’s title: Why Hedges’ gs* based on the non-pooled standard deviation should be reported with Welch’s t-test. The authors draw this conclusion despite having noted the wide criticism of Cohen’s d* (and by implication Hedges’ g*) because the standardiser is not the SD of an existing relevant population, so may be difficult to interpret.
Interpretability as the primary requirement for a standardised ES
When should we transform from an original to a standardised measure? What’s the purpose? As the authors note (pp. 3-4), a standardised measure can assist (i) interpretation of results in context and (ii) comparison of results for DVs with different original measures, for example using meta-analysis. It’s also (iii) useful when planning studies, whether using precision for planning or statistical power.
Above all, I’d argue, we need to be able to make sense of any point estimate—what is it estimating, what’s the unit of measurement, what does its magnitude tell us in the context? We also need an interval estimate to tell us the precision.
Hence my above comments that I would consider Hedges’ g and perhaps Glass’s g first, and contemplate Hedges’ g* only if those first two seemed seriously problematic and I couldn’t find a way to make them acceptable in context.
Estimation and assessing robustness
I’m looking for quantitative guidance about the likely bias in the point estimate, and error in the CI length of for example my favourite, Hedges’ g, in some context. If bias is likely to be 1-2% or a nominally 95% CI to have 92% or 96% coverage in the context, then I may stick with Hedges’ g. I’d have in mind the dance of the means and dance of the CIs: Replicate and most likely get a quite different point and interval estimate, so let’s not fuss too much about tiny biases. Within limits!
The Delacre simulations explore an admirably wide but realistic range of differences in sample sizes and variances, and departures from normality that are fairly extreme. I suspect the authors’ main strong conclusion in favour of Hedges’ g* is driven largely by big bias and variance problems found with the more extreme cases, although I’m not sure the extent that’s true.
However, if I’m dealing with g values less than 1 or 1.5, as often in psychology, and the sample sizes are within a factor of 2, how large is the likely bias? How close to 95% is the likely coverage of CIs? The robustness results are gold, and can answer many such questions, but will be most useful when re-expressed with such questions in mind. Further analysis and perhaps further simulations may be needed to give a full picture in terms of CI lengths and coverages. Then we’d have a wonderfully usable and valuable resource.
Currently this is Why Hedges’ gs* based on the non-pooled standard deviation should be reported with Welch’s t-test’. If we want a p value, then Delacre et al. (2017) make a strong case for routinely preferring Welch’s t test over the conventional t test that requires homogeneity of variance: Little to lose if variances are equal and much to gain if not.
However, choice of standardised ES measure is a quite different question. Also, the formula for Welch’s t (formula above for t‘) bears no relation to that for Hedges’ gs*, so I see no reason to link the two in the title, especially since Welch’s t test is scarcely considered in the preprint.
My preference would be to use the title of this blog post, or something like: Cohen’s d and related effect size estimators: Interpretability, bias, precision, and robustness.
Marie Delacre has kindly indicated that she’s open to discussion as she and colleagues work on revisions. There may be future projects, perhaps focussed on CIs. Please add comments below, or send to her (firstname.lastname@example.org) or me. Thanks.
Cumming, G. (2013). Cohen’s d needs to be readily interpretable: Comment on Shieh (2013). Behavior Research Methods, 45, 968–971. https://doi.org/10.3758/s13428-013-0392-4
Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch’s t-test instead of Student’s t-test. International Review of Social Psychology, 30 (1), 92–101. https://doi.org/10.5334/irsp.82
Shieh, G. (2013). Confidence intervals and sample size calculations for the standardized mean difference effect size between two normal populations under heteroscedasticity. Behavior Research Methods, 45, 955–967. https://doi.org/10.3758/s13428-012-0228-7