Which Standardised Effect Size Measure Is Best When Variances Are Unequal?
A great new preprint by Marie Delacre (at Université Libre de Bruxelles, firstname.lastname@example.org) and colleagues (Daniel Lakens, Christophe Ley, Limin Liu, & Christophe Leys) throws valuable light on this question.
The title is: Why Hedges’ gs* based on the non-pooled standard deviation should be reported with Welch’s t-test
The issue is important for Bob and me as we work on ITNS2 and esci in jamovi, so I was an avid reader. I sent comments and questions and have had a quick and generously detailed response from Marie. She intends to revise the paper around September. I suspect she would be happy to have further comments.
Below is my take on the preprint. In brief, the authors report numerous simulations to investigate the properties of 8 (!) standardised ES measures, focussing on unequal variances and departures from normality.
When variances are equal: Two familiar ES estimates
With two independent groups and assuming homogeneity of variance we usually use Cohen’s d, being the difference between sample means divided by the pooled SD, sp. The pooled SD is the standardiser, the unit of measurement for d. Then a simple adjustment gives us dunbiased, also called Hedges’ g, as an unbiased estimate of δ, the population effect size (ES). Cohen’s d and Hedges’ g are the first of the ES measures investigated.
(In the preprint, the 8 ES measures are indicated as Cohen’s ds and Hedges’ gs, etc, with ‘s’ subscript. These seem redundant and I understand may be removed in the revised version.)
When variances are not equal: Six further ES estimates
Sometimes it’s unjustified, or questionable, to assume population variances are equal. For example, a treatment often increases the variance as well as the mean, compared with the Control condition. It may then make sense to use sC, the SD of the Control group, as standardiser, to get Glass’s d, which becomes Glass’s g when debiased. These are the third and fourth of the ESs studied.
When variances are unequal, we use Welch’s t test:
The denominator is an estimate that weights the two sample variances by sample size, with the larger group receiving the smaller weight. For inference, as with a t test, that’s correct—think of the formula for the SE.
Shieh (2013) proposed using a standardised ES measure based on a standardiser closely related to the denominator in the equation for t‘. In a comment (Cumming, 2013) I argued that Shieh’s d was pretty much uninterpretable: Among other problems, it didn’t estimate an ES in any existing population, and its value was greatly dependent merely on the relative sizes of the two samples. I recommended against using it.
Delacre and colleagues cited my comment, but did include Shieh’s d and Shieh’s g (the unbiased version) for completeness and in line with earlier work of theirs on inference (e.g. Delacre et al., 2017) that advocated use of Welch’s t.
However, inference should not dictate choice of standardiser: We sometimes need a standardiser not based on the SE appropriate for inference, e.g. in the simple paired design, as discussed in ITNS, pp. 207-208.
which bases the standardiser on the average of the two sample variances, whatever the sample sizes. Again, it’s challenging to interpret because it doesn’t estimate a population ES for any existing population, but at least it’s not dependent on relative sample sizes. Cohen’s d* and its unbiased version, Hedges’ g*, complete the 8 ES estimates investigated by Delacre and colleagues.
Results and recommendations
The simulations explored bias and variance of the 8 ES measures for a range of pairs of population variances, pairs of sample sizes, and normal and 3 distinctly non-normal population distributions: a massive project giving a rich trove of information about the robustness of 8 measures. There are numerous tables and figures of estimates of bias and variance to pore over.
The authors’ conclusions:
- “Because the assumption of equal variances… is rarely realistic… both Cohen’s d and Hedges’ g should be abandoned.” (p. 10)
That’s arguable. I’m not convinced the assumption is rarely realistic. (It’s also very often made, even if sometimes it shouldn’t be.) The emphasis should be on informed judgment in context rather than simply abandoning these two most familiar estimates. In addition, when population variances are equal, Hedges’ g performs very well. It’s also familiar and readily interpretable.
- Shieh’s d and Shieh’s g generally perform poorly and are not recommended.
That’s a relief and what I expected. Let’s not consider them further.
- “We do not recommend using [Glass’s d or g].” (p. 28)
I suggest that Glass vs something else is the choice that most clearly should be based on the context. Does it make sense to use the SD of one group, often the Control group, as the standardiser? If so, we should do so, unless there are very strong reasons against. We should use choice of sample sizes and perhaps other strategies (transform the DV to reduce departure from normality?) to minimise any disadvantage of the Glass’s g estimate. The simulation results give valuable guidance on when we might be concerned and what strategies might help.
- “The measure … we believe performs best across scenarios is Hedges’ g*.” (p. 28).
This conclusion is expressed in the preprint’s title: Why Hedges’ gs* based on the non-pooled standard deviation should be reported with Welch’s t-test. The authors draw this conclusion despite having noted the wide criticism of Cohen’s d* (and by implication Hedges’ g*) because the standardiser is not the SD of an existing relevant population, so may be difficult to interpret.
Interpretability as the primary requirement for a standardised ES
When should we transform from an original to a standardised measure? What’s the purpose? As the authors note (pp. 3-4), a standardised measure can assist (i) interpretation of results in context and (ii) comparison of results for DVs with different original measures, for example using meta-analysis. It’s also (iii) useful when planning studies, whether using precision for planning or statistical power.
Above all, I’d argue, we need to be able to make sense of any point estimate—what is it estimating, what’s the unit of measurement, what does its magnitude tell us in the context? We also need an interval estimate to tell us the precision.
Hence my above comments that I would consider Hedges’ g and perhaps Glass’s g first, and contemplate Hedges’ g* only if those first two seemed seriously problematic and I couldn’t find a way to make them acceptable in context.
Estimation and assessing robustness
I’m looking for quantitative guidance about the likely bias in the point estimate, and error in the CI length of for example my favourite, Hedges’ g, in some context. If bias is likely to be 1-2% or a nominally 95% CI to have 92% or 96% coverage in the context, then I may stick with Hedges’ g. I’d have in mind the dance of the means and dance of the CIs: Replicate and most likely get a quite different point and interval estimate, so let’s not fuss too much about tiny biases. Within limits!
The Delacre simulations explore an admirably wide but realistic range of differences in sample sizes and variances, and departures from normality that are fairly extreme. I suspect the authors’ main strong conclusion in favour of Hedges’ g* is driven largely by big bias and variance problems found with the more extreme cases, although I’m not sure the extent that’s true.
However, if I’m dealing with g values less than 1 or 1.5, as often in psychology, and the sample sizes are within a factor of 2, how large is the likely bias? How close to 95% is the likely coverage of CIs? The robustness results are gold, and can answer many such questions, but will be most useful when re-expressed with such questions in mind. Further analysis and perhaps further simulations may be needed to give a full picture in terms of CI lengths and coverages. Then we’d have a wonderfully usable and valuable resource.
Currently this is Why Hedges’ gs* based on the non-pooled standard deviation should be reported with Welch’s t-test’. If we want a p value, then Delacre et al. (2017) make a strong case for routinely preferring Welch’s t test over the conventional t test that requires homogeneity of variance: Little to lose if variances are equal and much to gain if not.
However, choice of standardised ES measure is a quite different question. Also, the formula for Welch’s t (formula above for t‘) bears no relation to that for Hedges’ gs*, so I see no reason to link the two in the title, especially since Welch’s t test is scarcely considered in the preprint.
My preference would be to use the title of this blog post, or something like: Cohen’s d and related effect size estimators: Interpretability, bias, precision, and robustness.
Marie Delacre has kindly indicated that she’s open to discussion as she and colleagues work on revisions. There may be future projects, perhaps focussed on CIs. Please add comments below, or send to her (email@example.com) or me. Thanks.
Cumming, G. (2013). Cohen’s d needs to be readily interpretable: Comment on Shieh (2013). Behavior Research Methods, 45, 968–971. https://doi.org/10.3758/s13428-013-0392-4
Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch’s t-test instead of Student’s t-test. International Review of Social Psychology, 30 (1), 92–101. https://doi.org/10.5334/irsp.82
Shieh, G. (2013). Confidence intervals and sample size calculations for the standardized mean difference effect size between two normal populations under heteroscedasticity. Behavior Research Methods, 45, 955–967. https://doi.org/10.3758/s13428-012-0228-7
Dear Dr Cumming,
Thank you again for this amazing feedback and your blog post!
About the credibility of the assumption of equal population variances: there are many authors who argued, like us, that the assumption of homogeneity of variances often does not hold (see for example Erceg-Hurn & Mirosevich, 2008; Zumbo & Coulombe, 1997). In a previous paper (Delacre et al. 2017), we develop many reasons why we think equal population variances are very rare in practice. Moreover, it’s very hard to check for the homogeneity of variances assumption, because:
– the assumption is about population parameters that we don’t know (σ1 and σ2);
– inferential statements about the homogeneity of variances assumptions based on assumptions tests often lack power to detect assumption violations.
Finally, when we look at figures in our preprint, we notice that when variances are equal across groups, Hedges’ g and Hedges’ g* are either identical (Figure 2) or very close (Figure 3). The only exception is when both skewness and kurtosis are very large. Most of the time, there is therefore little cost in choosing Hedges g* by default. On the contrary, Hedges g cannot be used in case of heterogeneity of variances.
About Shieh’s d and Shieh’s g: that’s indeed very interesting to notice that it’s not recommended, neither for interpretation nor for inferential purposes.
About Glass’s d and Glass’s g: what makes its use very complicated is the fact that its bias and variance depend on parameters that we can not control. For example, when distributions are skewed (which is very common, according to Micceri, 1989), the bias and variance of Glass’s d and Glass’s g will depend on the chosen standardizer (either S1 or S2, the estimates of the first and second population variances), even when both samples have the same size and are extracted from populations with equal variances! This is only due to a non-null correlation of opposite sign between the mean difference and respectively S1 and S2, as explained in this appendix : https://github.com/mdelacre/Effect-sizes/blob/master/Supplemental%20Material%203/Correlation.pdf (see p.4 raw 41 « When equal population variances are estimated based on equal sample sizes (condition a) »). This problem « disappears » when we compute a standardizer taking both S1 and S2 into account (such as Cohen’s d* or Hedges g*), because it results in a standardizer that is uncorrelated with the mean difference (see Figures 3 and 5 in the prementioned appendix. PS : there is a typo in the Figure Captions : plots show the correlation between S_Cohen’s ds , S_Shieh’s ds and S_Cohen’s d∗ s as a function of the mean difference when samples are extracted from skewed distributions). As you mention, we can minimise disadvantages of Glass’s g estimate with appropriate sample sizes. However, even under the normality assumption, the effect of the sample sizes ratio depends on other parameters that we cannot control, such as the SD-ratio (i.e. the ratio between both population SD) and the population effect size. Depending on these unknown parameters, sometimes it is more interesting to maximize the control group, to maximize the experimental group or to uniformly add subjects in both groups (as we explain here : https://github.com/mdelacre/Effect-sizes/blob/master/Supplemental%20Material%201/Theoretical%20Variance%20of%20all%20estimators%20as%20a%20function%20of%20population%20parameters.pdf ).
As a consequence, even if Glass’s g is easier to interpret, in appearance, I can hardly see how a measure can be very informative if we cannot control its bias and variance. I realize, as we mentioned in the preprint that the standardizer of Hedges’ g* is not easy to interpret per se. I think the easiest way to do so is by comparison with Hedges’ g. One limitation is that Cohen’s d is very often interpreted based on Cohen’s benchmarks and like many of us (I think) I don’t really like them because they are too arbitrary and don’t take the context into account. Note, however, that some authors have proposed more appropriate benchmarks (see for example Funder et al., 2019), where they describe an effect as small, medium or large *in comparison with* commonly published effects (with a correction to compensate for the publication bias). I wish I could provide a more satisfactory solution, and I hope that your blog post will be an opportunity to open debates and to have new insights.
Delacre, M., Lakens, D., &; Leys, C. (2017). Why psychologists should by default use 521 Welch’s t-test instead of Student’s t-test. International Review of Social Psychology, 522 30 (1), 92–101. https://doi.org/10.5334/irsp.82
Erceg-Hurn, D. M. & Mirosevich, V. M. (2008). Modern robust statistical methods: An easy way to maximize the accuracy and power of your research. American Psychologist, 63(7), 591. DOI: https://doi.org/10.1037/0003-066X.63.7.591
Funder, D. C., & Ozer, D. J. (2019). Evaluating effect size in psychological research: Sense and nonsense. Advances in Methods and Practices in Psychological Science, 2(2), 156-168. https://doi.org/10.1177/2515245919847202
Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105(1), 156–166. DOI: https://doi.org/10.1037/0033-2909.105.1.156
Zumbo, B. D. & Coulombe, D. (1997). Investigation of the robust rank-order test for non-normal populations with unequal variances: The case of reaction time. Canadian Journal of Experimental Psychology/Revue Canadienne de Psychologie Expérimentale, 51(2), 139. DOI: https://doi.org/10.1037/1196-1918.104.22.168
Thank you for your comment.
I confess I’m gradually coming around. Partly your results and arguments, partly discussion with Bob. If there is an underlying single σ, the pooled s estimate is weighted by sample size, as it should be. But if there isn’t, this pooled s lies somewhere between our best estimates of the two different σi, and comparative sample sizes influence where between, which makes interpretation crazy. Using g* solves that problem, even if (my main worry) we’re estimating a σ value that doesn’t correspond with any relevant existing population.
Your main argument is strong: If variances are equal or close, g* hardly differs from g, and if not, then using g is a problem. With equal variances and very unbalanced sample sizes, we’d expect g to do a bit better than g*, and your Fig 3 shows that, for bias. Otherwise, g* does seem good.
I take all your points about weird, irrelevant stuff influencing Glass’s g, but I’m thinking of cases where the SD of Control seems strongly the natural unit to use. If the Treatment changes SD—-probably as well as mean—-the unit shouldn’t change. Increasing the level of Treatment might progressively increase SD and mean, but shouldn’t progressively change the unit; only Glass’s g avoids that. I’d consider a transformation, if that made sense, but I’d still like some idea of the consequences of choosing Glass’s g, in some particular context, in terms of bias, and CI coverage and length.