**Golf.com** recently ran a **story** titled ** ‘Lucky’ golf items might actually work, according to study**. The story told of Tiger Woods sinking a very long putt to send the U.S. Open to a playoff. “That day, Tiger had two lucky charms in-play: His Tiger headcover, and his legendary red shirt.”

The story cited **Damisch et al. (2010)**, published in *Psychological Science*, as evidence the lucky charms may have contributed to the miraculous putt success.

Laudably, the APS highlights public mentions of research published in its journals. It posted **this summary** of the Golf.com story, and included it (‘Our science in the news’) in the latest weekly email to members.

However, this was a misfire, because the Damisch results have failed to replicate, and the pattern of results has prompted criticism of the work. Read on…

**The original Lucky Golf Ball study**

**Damisch et al.** reported a study in which students in the experimental group were told—with some ceremony—that they were using the lucky golf ball; those in the control group were not. Mean performance was 6.4 of 10 putts holed for the experimental group, and 4.8 for controls—a remarkable difference of *d* = 0.81 [0.05, 1.60]. (See ITNS, p. 171.) Two further studies using different luck manipulations gave similar results.

**The replications**

**Bob** and colleague **Tracy Caldwell** (**Calin-Jageman & Caldwell, 2014**) carried out two large preregistered close replications of the Damisch golf ball study. Lysann Damisch kindly assisted them make the replications as similar as possible to the original study. Both replications found effects close to zero.

**Meta-analysis of original and replication studies**

Here’s Figure 9.8 from ITNS, a forest plot that shows results from six studies from the Damisch group (red), and the two Calin-Jageman & Caldwell replications (blue). The first of the red studies and both of the blue used the lucky golf ball task.

The clear difference between the overall red mean and overall blue mean is shown at the bottom on a Difference axis; it’s -0.77 [-1.15, -0.38], so a clear failure to replicate.

**Not replicable, but citable**

That’s the title of a 2018 **post** by Bob lamenting the common pattern of a striking original finding that continues to make waves, even while strong counter evidence from replications languishes in the shadows.

Below I’ve updated his figure showing citation counts for the original Damisch article and the Bob & Tracy replication article. The pattern has not improved these last three years!

**The pattern of Damisch results**

The six red CIs in the forest plot are astonishingly consistent, all with *p* values a little below .05. Greg Francis, in **this 2016 post**, summarised several analyses of the patterns of results in the original Damisch article. All, including *p***-curve analysis**, provided evidence that the reported results most likely had been *p*-hacked or selected in some way.

**Another failure to replicate**

Dickhäuser et al. (2020) reported two high-powered preregistered replications of a different one of the original Damisch studies, in which participants solved anagrams. Both found effects close to zero.

All in all, there’s little evidence for the lucky golf ball. APS should skip any mention of the effect.

**What next?**

Open Science practices will help. Perhaps high quality replication articles can be marked with big badges and trumpet fanfares? With everything online, it should be possible to add annotations to original articles and provide links to later replications, no doubt with original authors having a right of reply. Meanwhile, we need to keep up the skepticism and eternal vigilance.

Geoff

Calin-Jageman, R. J., & Caldwell, T. L. (2014). Replication of the Superstition and Performance Study by Damisch, Stoberock, and Mussweiler (2010). *Social Psychology*, *45*(3), 239–245. https://doi/10.1027/1864-9335/a000190

Damisch, L., Stoberock, B., & Mussweiler, T. (2010). Keep Your Fingers Crossed! *Psychological Science*, *21*(7), 1014–1020. https://doi/10.1177/0956797610372631

Dickhäuser, O., Heinze, A., Hamm, M. L., Bales, A. S., Bellmann, S. A., Böger, D., et al. (2020). Zwei teststarke, präregistrierte Replikationsstudien zum Einfluss von Glück auf kognitive Leistung (Two high-powered preregistered replication studies on effects of superstition on cognitive performance). *Zeitschrift für Pädagogische Psychologie, 34,* 51-60. https://doi.org/10.1024/1010-0652/a000263

Cohen’s *d* for two independent groups, of size *n*_{1} and *n*_{2}, with means *M*_{1} and *M*_{2} and SDs of *s*_{1} and *s*_{2} is

*d* = (*M*_{1} – *M*_{2}) / (standardizer)

where ‘standardizer’ is some SD we choose as an appropriate unit of measurement for *d*. The numerator (difference between the means) is the effect size of research interest in original units and *d* is that ES re-expressed as a number of SDs; it’s a kind of *z* score.

Choice of standardizer is critical: *d* needs to be *interpretable* in the context. If our data are IQ scores on a well-established test, we might choose as standardizer σ = 15, the SD in the test’s reference population. But usually we’ll need to choose an estimate, calculated from the data, as standardizer. For two groups, it’s common to assume homogeneity of variance and use *s*_{p}, the pooled estimate of the population SD. If one group is a control group, we might choose the SD of that group as standardizer, thus avoiding the assumption. Other choices are possible.

Unfortunately, *d* is a biased estimate: it overestimates δ, especially for small samples. A simple calculation debiases *d*. Of course, to interpret a value of *d* we need to know what standardizer was used and whether the reported value has been debiased. My question: What symbol should we use for debiased *d*?

In 2011, in **UTNS **(p. 295), I wrote:

You’d think something as basic as *d*_{unb} would have a well-established name and symbol, but it has neither. … In the early days the two independent groups *d* calculated using *s*_{p} [pooled SD for the two groups, assuming homogeneity of variance] as standardizer was referred to as Hedges’ *g*. For example the important book by Hedges and Olkin (1985), which is still often cited, used *g* in that way, and used *d* for what they explained as *g* adjusted to remove bias. So their *d* is my *d*_{unb}. By contrast, leading scholars Borenstein et al. (2009) *swapped* the usage of *d* and *g*, so now their *d* is the version *with* bias, and Hedges’ *g* refers to my *d*_{unb}. Maybe hard to believe, but true. The CMA [meta-analysis] software also uses *g* to refer to *d*_{unb}. In further contrast, Rosnow and Rosenthal (2009) is a recent example of other leading scholars explaining and using Hedges’ *g* with the traditional meaning of *d* standardized by *s*_{p} and *not* adjusted to remove bias. Yes, that’s all surprising, confusing, and unfortunate.

Larry Hedges is one of the authors of Borenstein, et al. (2009), so presumably he supported the swapping of the labels. I asked him about these issues and he kindly replied with an account of the history. In his foundational articles of 1980-82 he used *g* for the biased estimate, to honour meta-analysis pioneer Gene Glass, and *g*^{U} for the unbiased version. Then from around 1985 he started using *d* for the unbiased estimate, to correspond with δ (delta, Greek ‘d’). He reports that he doesn’t know who started using *g* for the unbiased estimate but that, by 2009, his co-authors felt that they should go with what seemed to have become standard practice.

My informal impression—I could be wrong—is that ‘**Hedges’ g**’ is increasingly being used for debiased

Bob and I need to decide what we’ll do in ITNS2 and in **esci**. Specifically, should we stick with ‘** d_{unbiased}**’, or switch to using ‘

Despite the possible messiness of a long word as subscript, I’m currently leaning towards sticking with ** d_{unbiased}**. My thoughts:

- ‘Cohen’s
*d*’, or simply ‘*d*’, is overwhelmingly the term used to denote the standardized ES. It’s used to introduce and explain the idea, and in journal articles—sometimes even if debiased values are reported. Further,*d*_{unbiased}signals a particular variant of*d*, and even explains its key property—being an unbiased estimate. Guessing would probably give reasonable understanding. - I suspect most researchers have heard of
*d*, have interpreted values and perhaps used*d*in their own research, even if they don’t know about debiasing—which anyway isn’t an issue for*N*more than, say, 50. Many fewer would have heard of Hedges’*g*, or be able to link it to*d*, let alone say*how*it relates to*d*. Both those links would need to be explained, and taught. The change of letter symbol seems arbitrary; there is no way to guess. - It’s common (and useful) to refer to ‘the
*d*family’ of standardized effect size measures. How strange that the most commonly needed member of that family is labeled ‘*g*’. - It’s a great convention that a Roman letter estimates the corresponding Greek letter. So
*M*,*s*, and*r*estimate µ, σ, and ρ respectively. Therefore it’s great that δ is widely used for the population value of Cohen’s*d*. Using*g*for the sample value suggests we’re estimating γ, which is never used. How weird to have to explain that the best estimate of δ is*g*. - A mathematical statistician would sidestep all the above by using “delta hat” for the estimate, but I don’t think that’s a good universal solution for psychology, or many other research fields.
- In medicine, SMD, for “standardised mean difference” is widely used and a reasonable acronym. However, it can refer to a population value or sample estimate, and very often we’re left to wonder whether bias has been removed.
- On the other hand, it’s useful for a subscript to signal how
*d*is calculated, perhaps*d*when_{s}*s*_{p}is the standardizer and we assume homogeneity of variance, and*d*_{C}when the SD of the Control group is standardizer and we avoid that assumption. Using*d*and*g*permits subscripts to tell us about the standardizer. However, I don’t think any strong conventions have emerged as to which subscripts tell us what.

Given all that, I’m currently preferring *d*_{unbiased}. However, has *g* become unstoppable? If so, the complexity of *d*, *g*, and δ is just one more baffling inconsistency we have to explain to bemused students.

Please let me have your thoughts.

Geoff

Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). *Introduction to meta-analysis*. Chichester, UK: Wiley.

Cohen, J. (1969). *Statistical power analysis for the behavioral sciences*. New York: Academic Press.

Hedges, L. V., & Olkin, I. (1985). *Statistical methods for meta-analysis*. Orlando, FA: Academic Press.

The title is: **Why Hedges’ g_{s}* based on the non-pooled standard deviation should be reported with Welch’s t-test**

The issue is important for Bob and me as we work on ITNS2 and **esci in jamovi**, so I was an avid reader. I sent comments and questions and have had a quick and generously detailed response from Marie. She intends to revise the paper around September. I suspect she would be happy to have further comments.

Below is my take on the preprint. **In brief**, the authors report numerous simulations to investigate the properties of 8 (!) standardised ES measures, focussing on unequal variances and departures from normality.

With two independent groups and assuming homogeneity of variance we usually use **Cohen’s d**, being the difference between sample means divided by the pooled SD,

(In the preprint, the 8 ES measures are indicated as **Cohen’s d_{s}** and

Sometimes it’s unjustified, or questionable, to assume population variances are equal. For example, a treatment often increases the variance as well as the mean, compared with the Control condition. It may then make sense to use *s*_{C}, the SD of the Control group, as standardiser, to get **Glass’s d**, which becomes

When variances are unequal, we use Welch’s *t* test:

The denominator is an estimate that weights the two sample variances by sample size, with the larger group receiving the smaller weight. For inference, as with a *t* test, that’s correct—think of the formula for the SE.

Shieh (2013) proposed using a standardised ES measure based on a standardiser closely related to the denominator in the equation for *t*‘. In a comment (Cumming, 2013) I argued that Shieh’s *d* was pretty much uninterpretable: Among other problems, it didn’t estimate an ES in any existing population, and its value was greatly dependent merely on the relative sizes of the two samples. I recommended against using it.

Delacre and colleagues cited my comment, but did include **Shieh’s d** and

However, inference should not dictate choice of standardiser: We sometimes need a standardiser not based on the SE appropriate for inference, e.g. in the simple paired design, as discussed in ITNS, pp. 207-208.

Finally, consider

which bases the standardiser on the average of the two sample variances, whatever the sample sizes. Again, it’s challenging to interpret because it doesn’t estimate a population ES for any existing population, but at least it’s not dependent on relative sample sizes. **Cohen’s d*** and its unbiased version,

The simulations explored bias and variance of the 8 ES measures for a range of pairs of population variances, pairs of sample sizes, and normal and 3 distinctly non-normal population distributions: a massive project giving a rich trove of information about the robustness of 8 measures. There are numerous tables and figures of estimates of bias and variance to pore over.

The authors’ conclusions:

- “Because the assumption of equal variances… is rarely realistic… both Cohen’s
*d*and Hedges’*g*should be abandoned.” (p. 10)

That’s arguable. I’m not convinced the assumption is rarely realistic. (It’s also very often made, even if sometimes it shouldn’t be.) The emphasis should be on informed judgment in context rather than simply abandoning these two most familiar estimates. In addition, when population variances *are* equal, Hedges’ *g* performs very well. It’s also familiar and readily interpretable.

**Shieh’s**and*d***Shieh’s**generally perform poorly and are not recommended.*g*

That’s a relief and what I expected. Let’s not consider them further.

- “We do not recommend using [
**Glass’s**or*d*].” (p. 28)*g*

I suggest that Glass vs something else is the choice that most clearly should be based on the context. Does it make sense to use the SD of one group, often the Control group, as the standardiser? If so, we should do so, unless there are very strong reasons against. We should use choice of sample sizes and perhaps other strategies (transform the DV to reduce departure from normality?) to minimise any disadvantage of the Glass’s ** g** estimate. The simulation results give valuable guidance on when we might be concerned and what strategies might help.

- “The measure … we believe performs best across scenarios is Hedges’
*g**.” (p. 28).

This conclusion is expressed in the preprint’s title: **Why Hedges’ g_{s}* based on the non-pooled standard deviation should be reported with Welch’s t-test**. The authors draw this conclusion despite having noted the wide criticism of Cohen’s

When should we transform from an original to a standardised measure? What’s the purpose? As the authors note (pp. 3-4), a standardised measure can assist (i) **interpretation** of results in context and (ii) comparison of results for DVs with different original measures, for example using **meta-analysis**. It’s also (iii) useful when **planning** studies, whether using precision for planning or statistical power.

Above all, I’d argue, we need to be able to make sense of any point estimate—what is it estimating, what’s the unit of measurement, what does its magnitude tell us in the context? We also need an interval estimate to tell us the precision.

Hence my above comments that I would consider **Hedges’ g** and perhaps

I’m looking for quantitative guidance about the likely bias in the point estimate, and error in the CI length of for example my favourite, **Hedges’ g**, in some context. If bias is likely to be 1-2% or a nominally 95% CI to have 92% or 96% coverage in the context, then I may stick with

The Delacre simulations explore an admirably wide but realistic range of differences in sample sizes and variances, and departures from normality that are fairly extreme. I suspect the authors’ main strong conclusion in favour of **Hedges’ g*** is driven largely by big bias and variance problems found with the more extreme cases, although I’m not sure the extent that’s true.

However, if I’m dealing with *g* values less than 1 or 1.5, as often in psychology, and the sample sizes are within a factor of 2, how large is the likely bias? How close to 95% is the likely coverage of CIs? The robustness results are gold, and can answer many such questions, but will be most useful when re-expressed with such questions in mind. Further analysis and perhaps further simulations may be needed to give a full picture in terms of CI lengths and coverages. Then we’d have a wonderfully usable and valuable resource.

Currently this is **Why Hedges’ g_{s}* based on the non-pooled standard deviation should be reported with Welch’s t-test**’. If we want a

However, choice of standardised ES measure is a quite different question. Also, the formula for Welch’s *t* (formula above for *t*‘) bears no relation to that for **Hedges’ g_{s}***, so I see no reason to link the two in the title, especially since Welch’s

My preference would be to use the title of this blog post, or something like: **Cohen’s d and related effect size estimators: Interpretability, bias, precision, and robustness**.

Marie Delacre has kindly indicated that she’s open to discussion as she and colleagues work on revisions. There may be future projects, perhaps focussed on CIs. Please add comments below, or send to her (marie.delacre@ulb.be) or me. Thanks.

Geoff

Cumming, G. (2013). Cohen’s *d* needs to be readily interpretable: Comment on Shieh (2013). *Behavior Research Methods, 45,* 968–971. https://doi.org/10.3758/s13428-013-0392-4

Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch’s *t*-test instead of Student’s *t*-test. *International Review of Social Psychology, 30* (1), 92–101. https://doi.org/10.5334/irsp.82

Shieh, G. (2013). Confidence intervals and sample size calculations for the standardized mean difference effect size between two normal populations under heteroscedasticity. *Behavior Research Methods, 45,* 955–967. https://doi.org/10.3758/s13428-012-0228-7

You can see **here** the list of speakers, the abstracts, and **video recordings** of most of the talks, and links to **slides** and other resources provided by many of the speakers.

*Teaching the New Statistics, Now With Better Software*

Bob’s and my talk was the last—at traditional conferences the grave-yard slot when many have already fled to the airport. It was at 6am in the winter dark for me, but that was OK. At **the site**, scroll to the end to see our links, then click the down arrow to see our abstract.

My job was to demonstrate the great new software: **esci web** by Gordon Moore, and **esci in jamovi** by Bob. Both are freely accessible from the **esci** menu at **our site**. For **esci web**, search at our site for ‘Gordon’ to find four blog posts.

Some starting points in **the video of our talk**:

10.47 **esci web**

25.35 **dance of the p values**, in esci web

35.30 variability of *p* values with replication: **Significance Roulette**

40.00 **esci in jamovi**

41.40 **two independent groups** example

45.50 **single group** example

47.41 **meta-analysis**

50.02 the **diamond ratio** (heterogeneity in a meta-analysis)

50.07 **interaction** in a two-way design

The conference day ran from my 11 pm to 7 am, so I managed to catch only a few other talks. These included:

*Teaching statistics: Damnation and deliverance*

Lively and engaging, this talk included a dazzling array of videos in various punk and gothic styles (I may have those terms a bit wrong…) designed to highlight statistical ideas. Definitely different, even if, I suspect, not to everyone’s taste. However, given the wild success of his statistics textbooks, Andy must be doing many things right.

*The Replication Crisis: What should we teach to undergraduates, and when?*

Interesting discussion of Open Science—why it’s needed and what we should teach about it. Starting at 21.55 are some arguments for *not* teaching undergraduates about meta-analysis. I’m not persuaded, while fully agreeing that selective publication (thank you NHST) is an enormous problem for meta-analysis—and science.

*Tips and tricks for teaching Bayesian statistics*

E.J. was in good form, as lively and compelling as ever. I took my first course on Bayesian statistics in 1965 and have read books and attended many workshops since. The logic, and the match with how human cognition works have always appealed. I’ve kept looking out for simple and practical ways to introduce Bayesian methods in the intro statistics course for psychology students. Ideally, these should also help seasoned researchers brought up in the NHST tradition (ugh!) to understand and adopt Bayesian methods. I’m still looking, which is why I’ve focused on traditional frequentist CIs as the practical way forward, at least at first. I’d be more than happy to see Bayesian estimation, based on credible intervals, and Bayesian modelling much more widely used.

At 5.30 see E.J.’s book—a free **download**—*Bayesian thinking for toddlers*, which presents an ingenious and extremely simple way to introduce the core Bayesian idea of using evidence to update belief.

From 20.42 E.J. is talking about JASP, the wonderful SPSS-killing open-source statistics application that his team has been developing for some time. It supports frequentist as well as Bayesian methods. Bob is working with the team towards have **esci** available within JASP before too long.

At 45.40 E.J. recommends ** Bayesian Statistics the Fun Way** by Will Kurt. I’ve sent off for it—perhaps this will give me the easy way in that I’ve long been seeking?

Bob and I have reports that some instructors are successfully using ITNS and its online resources for a flipped-classroom approach. That’s been great to hear—flipping may be becoming widespread, especially after 2020, and we hope that ITNS2 and its materials will be even more flip-friendly. So I was especially interested to see these three talks from flipsters at McMaster and York Universities.

*Flipping inferential statistics*

*The flipped classroom improves performance in introductory statistics: Early evidence from a systematic review and meta-analysis*

This is the one talk of the three for which a video is available, at least at present. The meta-analysis included 11 studies comparing flipped with lecture formats. The overall mean effect size advantage for flipped was *g* = 0.40, with part of that attributable to use of regular quizzes.

*No tutorials, no problem: The inspiration, planning, and execution of flipping a Statistics II course*

*Teaching Reproducibility and Replicability in Statistics*

This talk was just before mine; I caught the last part. Great enthusiasm and engagement. Lots on Open Science. What seemed like excellent advice on using R from the start, via R Markdown. Pointers to resources.

I highly recommend browsing the abstracts and at least dipping in to the videos. Please make a comment below on any you find especially interesting or useful. Thanks.

Geoff

P.S. **Congratulations** and **thanks** to Kevin, Fegal, and Rob for their initiative and vast amount of work. There are already discussions about when another such conference might be organized.

*The Journal of Physiology* is embracing a number of Open Science practices and has just published an **editorial** highly critical of the *p* value. Yay!

**Simon Gandevia**, of the **NeuRA** research institute in Sydney, has been awarded Honorary Membership of **The Century Citation Club** of *JPhysiol*, having published more than 100 articles (phew!) in that journal. He was invited to write an editorial, which was published in January. Simon reflected on changes in the Journal and the discipline—massive advances in techniques, larger teams, longer articles, stronger links to clinical applications—and closed by explaining how unreliable and misleading *p* can be, especially in relation to replications.

Simon described how *JPhysiol* had, since around 2010, encouraged authors to adopt better statistical techniques and reporting practices, and had published how-to guides to help. However, little changed, and so in 2018 the Journal **mandated** more complete reporting of research—to facilitate replication—and especially of data and statistical analyses. In other words, adoption of key Open Science practices.

Simon then used his lovely figure, above, to demonstrate that an **Initial p value** (

The vertical grey bars (refer to right axis) are the 80% prediction intervals for replication *p*. These come from my *p* intervals article (Cumming, 2008, **here** or **here**). If initial *p* = .05, there’s an 80% chance that *p* in an exact replication lies in (.0002, .65), and a 10% chance it lies to the left of that interval, and 10% to the right. For initial *p* = .01, the prediction interval is (.00001, .41). The intervals are so long! A replication can, alas, give just about any *p* value, so no *p* is to be trusted.

Simon also discusses the advantages of replacing NHST “with presentation of effect sizes and confidence intervals (ESCI). This… would avoid the phoney dichotomy of significant vs. insignificant…”. Indeed! He mentioned a recent **article** of his that’s one of a number in *JPhysiol* in which authors had taken such an estimation approach. That’s progress!

Simon’s editorial is definitely worth a read.

In **reply**, **Brent Raiteri** made a number of points and, in particular, argued that there are typically so many unknowns about a replication that it’s not possible to calculate any value for replication *p*. Simon kindly invited three of us to join him in a **reply** to Raiteri, which has just come online. We clarified a couple of points and Simon prepared an amended version of his figure that makes clear that two-tailed *p* values are used throughout. (It’s this amended figure that I include above.) We also noted that the values in the figure don’t rely on the assumption—unrealistic, but often made—that the initial study estimated exactly the size of the effect in the population.

We noted that the figure assumes that sampling variability is the only cause for differences between initial and replication *p*, so the probabilities and prediction intervals depicted represent a best case—replication *p* may, in practice, be even more unreliable. We agreed with Raiteri that usually there are further differences between initial and replication studies, perhaps sometimes sufficient to justify Raiteri’s claim that it’s not possible to calculate replication *p* values.

You probably know that ** JPhysiol** is one of the longest-running and most highly regarded journals in the biological sciences. Founded in 1878, it has published classic research from numerous

Within his institute, Simon has long championed improved research practices. The **Research Quality page** of NeuRA introduces the Reproducibility & Quality Sub-Committee that Simon convenes. It’s active in promoting Open Science practices by NeuRA’s researchers.

At that page, scroll down to see the video of a March 2021 talk by Simon:

**Research Quality and Reproducibility: Why You Should Be Worried**

- At about 8.45, note a nice story about Sir John Eccles being acutely aware of having published incorrect findings, then finding a way to do better—which led to his Nobel Prize.
- At about 22.30, see the Quality Output Checklist and Content Assessment (QuOCCA), an instrument for assessing the transparency, data analysis, and reporting practices of a draft journal manuscript.

A little lower on that page is a video of a talk that I, at Simon’s invitation, gave at NeuRA in December 2019:

**Improving the Trustworthiness of Neuroscience Research**

I included two demonstrations of the unreliability of replication *p* values:

- At about 11.00 see the
**dance of the**(or search YouTube for ‘dance of the p values).*p*values - At about 13.00 I move on to explain then demonstrate
**significance roulette**(or search YouTube for ‘significance roulette’ to find two videos).

The slides of that talk are **here** and a blog post about my visit to NeuRA is **here**.

A warm salute to Simon and the editors at *JPhysiol* for bringing that august journal into the world of Open Science and better statistics.

Geoff

]]>**55,000** – the approx. number of times **jamovi** was downloaded in March

**2,500** – the approx. number of times **esci** was added to **jamovi** in March

Each of these is about **double** the number for three months earlier! At this rate, everyone on Earth will have their own copy within a year or two—roughly speaking

Of the 38 modules available in the **jamovi** library, **esci** is currently the **5 ^{th} most popular**—demonstrating that it’s fully usable despite still being in development. Hats off to Bob!

In case it’s new to you, **jamovi** is the free, open-source stats software that crushes SPSS. It’s even better with added **esci**—which is designed to go with the second edition of ITNS, currently in preparation.

To get started, see **this post**.

Even simpler than downloading jamovi—tho’ this is quick and easy—just click on the **big green button** at **the jamovi home page** to open jamovi directly in any browser. Then play. (The online version is experimental, and modules can’t yet be added.)

Enjoy, and please let’s have your comments and suggestions,

Geoff

]]>The **editorial** is short, and a great read. **Christophe Bernard**, the editor-in-chief, includes links to his **2019 editorial** that announced the initiative, **our article** explaining estimation that was published in eNeuro at the same time, and a recent **blog post** in which eNeuro authors reflect on their experiences of figuring out how to include estimation in their analyses.

Christophe also includes a brief intro to estimation, with links to the **dance of the p values**, Gordon’s

That’s all great to hear, and I salute Christophe for his initiative and persistence. Take note, other journal editors (*Journal of Neuroscience*?), it can be done! Judging by author comments in that eNeuro blog post, researchers who have taken the plunge can see the benefits and are generally keen to continue with estimation.

This may have all started back in November 2018 at the giant SfN conference in San Diego, where Bob moderated a **PD Workshop** he had organized: **Improving Your Science: Better Inference, Reproducible Analyses, and the New Publication Landscape**. Christophe was one of the speakers and may have become an enthusiast that day. Shortly after, he started working towards the 2019 announcement and editorial. Bob’s workshop was the acorn…

The highly encouraging eNeuro story prompts me to think back to past efforts by enterprising journal editors to move statistical practices beyond *p* values. Here’s a brief word about a few.

More than 40 years ago, Ken Rothman published articles advocating confidence intervals and explaining how to calculate them in various situations. He was influential in persuading the **International Council of Medical Journal Editors** (ICMJE) to include in their 1988 revision of their **Uniform Requirements for Manuscripts Submitted to Biomedical Journals** the following:

“When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid sole reliance on statistical hypothesis testing, such as the use of *p* values, which fail to convey important quantitative information. . . .”

Rothman, an assistant editor during 1984-87 at the *American Journal of Public Health*, insisted that authors of manuscripts he assessed remove all references to statistical significance, NHST, and *p* values. We, in **Fidler et al. (2004)**, examined articles published in various years from 1982 to 2000 and found that CI reporting increased from 10% to 54% during the Rothman years, then remained at a similar level through to 2000—as was becoming standard in other medical journals, following the ICMJE policy of 1988.

In 1990 Rothman founded the journal *Epidemiology* and declared that this journal did not publish NHST or *p* values. For the 10 years of his editorship it basically didn’t, while CI reporting reached more than 90%.

BUT, even when CIs were reported—often merely as numbers in tables—they were rarely referred to, or used to inform interpretation We suspected that researchers needed way more explanations, examples, and guidance to appreciate what estimation can offer.

Geoffrey Loftus, Editor of *Memory & Cognition* from 1994 to 1997, strongly encouraged presentation of figures with error bars and avoidance of NHST. He even calculated error bars for numerous authors who claimed it was too difficult for them. We, in **Finch et al. (2004)**, reported that use of figures with bars increased to 47% under Loftus’s editorship and then declined. However, bars were rarely used for interpretation, and NHST remained almost universal. It seemed that even strong editorial encouragement, and assistance with analyses, was not sufficient to bring about substantial and lasting improvement in psychologists’ statistical practices.

Eric Eich, as editor-in-chief of *Psychological Science*, initiated perhaps the most important and successful journal transformation, at least in psychology. At the start of 2014 he published his famous editorial **Business Not as Usual**, which introduced Open Science badges, encouragement to use the new statistics, and other important advances. He published **Cumming (2014)**, the tutorial article on the new statistics that he’d invited me to write.

When Steve Lindsay took over as editor-in-chief he introduced further advances, including Preregistered Direct Replications. His **Swan Song Editorial** recounts the Open Science advances from 2014 to 2019, with evidence of sweeping changes in authors’ practices and what the journal has published. (I posted about that editorial **here**.)

Now editor-in-chief Patricia Bauer is continuing Open Science policies. For example, the **Submission Guidelines** still state that “*Psychological Science* recommends the use of the **“new statistics”**—effect sizes, confidence intervals, and meta-analysis—to avoid problems associated with null-hypothesis significance testing…”. They include links to **our site**, my **tutorial article**, and **my videos** introducing the new statistics that were recorded at the 2014 APS Convention.

I’d like to think that Rothman, Loftus, and other editors who, decades ago, tried so hard to encourage better practices did help bring about the advent of Open Science, which shook things up sufficiently to give later enterprising editors a better chance of getting their wonderful initiatives to stick.

Christophe has continued and broadened the crusade to great effect.

I’m delighted to see the evidence that so many of these positive changes look like they will persist and spread further. Bring that on!

And Bob and I hope, of course, that ITNS2 can help students understand why Open Science and the new statistics is the natural, better, and more easily understood way to do things.

Geoff

Cumming, G. (2014) The new statistics: Why and how. *Psychological Science*. *25,* 7-29. https://doi.org/10.1177/0956797613504966

Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can’t make them think: Statistical reform lessons from medicine. *Psychological Science, 15,* 119-126.

https://doi.org/10.1111/j.0963-7214.2004.01502008.x

Finch, S., Cumming, G., Williams, J., Palmer, L., Griffith, E., Alders, C., Anderson, J., & Goodman, O. (2004). Reform of statistical inference in psychology: The case of *Memory & Cognition*. *Behavior Research Methods, Instruments & Computers, 36,* 312-324.

https://doi.org/10.3758/BF03195577

**Cohen’s d** is the ratio of an effect size (often a mean, or difference between means) to a standard deviation. Typically both are estimates from the data, so it’s hardly surprising that the distribution of

Back then we used a very early version of ESCI to illustrate how sliding two ever-changing noncentral *t* curves along the *d* axis (**the pivot method**) allowed us, for those two designs, to find the lower and upper bounds of the CI on the *d* calculated from sample data. The figure below uses the version of ESCI that goes with UTNS to illustrate the pivot method.

For the paired design we couldn’t, alas, find even a good approximate way to calculate a CI on *d*.

Happily, by the time I was writing UTNS, Algina & Keselman (2003) had proposed an approximate solution to the problem of the paired case. They reported simulations that showed their method was pretty good, for a limited range of situations. In UTNS, pp. 306-307, I described my efforts to use simulations to assess their method. I found I could broaden the range of cases for which the approximation did well. Even so, there were limits, as stated in ESCI. For example, *N* had to be at least 6, and * d_{unbiased}* could not be greater than 2. But at least ESCI could provide a quite good approximate CI on

The usual *d* = [(effect size)/SD] overestimates δ. A simple correction factor, which depends on the *df* of the SD, gives us *d*_{unbiased}, which is what we should routinely use. In UTNS, for the paired case, I followed Borenstein et al. (2009) and used *df* = (*N* – 1), even though this seemed a little strange, given that the SD is estimated from the standard deviations of both measures (e.g., the pre-scores and the post-scores).

Goulet-Pelletier & Cousineau (2018 **here**, and erratum 2020 **here**) report a wide-ranging review of *d* and its CI. Their simulations suggest that in the paired case debiasing should use *df* = 2(*N* – 1), not (*N* – 1) as I used in UTNS and ESCI. They refer to *d*_{unbiased} as *g*.

Then Fitts (2020 **here**) investigated this issue and found by simulation that the debiasing *df* needs to reflect ρ, the population correlation between the two measures. When ρ = 0, as in the independent groups case, *df* = 2(*N* – 1), as for independent groups. If ρ = 1, then *df* = (*N* – 1). Intermediate values of ρ need intermediate values of *df*.

Cousineau (2020 **here**) took a major step forward by finding a good approximation to the distribution for *d* in the paired design, and a formula for the *df* that includes ρ.

Now, hot off the press, Cousineau & Goulet-Pelletier (2021 **here**) report a massive set of simulations that assess eight (!!) ways to calculate an approximate CI on *d*, five of them being their new proposals. The Algina-Keselman method that I used in UTNS turns out to be reasonable, but isn’t the best. The best is the ‘Adjusted Λ’ [“lambda-prime”] method’, which is one of their new proposals. This gives CIs that have very close to 95% coverage, and some other desirable properties, for a wide range of values of *N*, *d*, and ρ.

See **their paper **for a description of the method, and on p. 58 the R code. It’s probably what we’ll use in **esci jamovi**.

This progress makes me very happy. Maybe you too?

Geoff

Algina, J., & Keselman, H. J. (2003). Approximate confidence intervals for effect sizes. *Educational and Psychological Measurement*, *63*, 537–553. https://doi.org/10.1177/0013164403256358

Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). *Introduction to meta-analysis*. New York, NY: John Wiley & Sons.

Cousineau, D. (2020). Approximating the distribution of Cohen’s *d _{p}* in within-subject designs.

Cumming, G., & Finch, S. (2001). A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions. *Educational and Psychological Measurement, 61*, 532-574. https://doi.org/10.1177/0013164401614002

Fitts, D. (2020). Commentary on “a review of effect sizes and their confidence intervals, part I: The Cohen’s *d* family”: The degrees of freedom for paired samples designs. *The Quantitative Methods for Psychology*, *16*(4), 281–294. https://doi.org/10.20982/tqmp.16.4.p281

Goulet-Pelletier, J.-C., & Cousineau, D. (2018). A review of effect sizes and their confidence intervals, Part I: The Cohen’s *d* family. *The Quantitative Methods for Psychology*, *14*(4), 242–265. https://doi.org/10.20982/tqmp.14.4.p242

Goulet-Pelletier, J.-C., & Cousineau, D. (2020). Erratum to Appendix C of “A review of effect sizes and their confidence intervals, Part I: The Cohen’s *d* family”. *The Quantitative Methods for Psychology*, *16*(4), 422–423. https://doi.org/10.20982/tqmp.16.4.p422

Cousineau, D., & Goulet-Pelletier, J.-C. (2021). A study of confidence intervals for Cohen’s *d _{p}* in within-subject designs with new proposals.

**Precision for Planning **tells us what *N *we need to achieve the precision we’d like. It’s a much better way to plan than the traditional use of statistical power, which works only within an NHST framework. Far better to adopt an **estimation framework** (the new statistics) and use PfP.

For an intro to PfP, see Chapter 10 in **ITNS**. For more detail, see Chapter 13 in **UTNS**.

For a **two independent groups study**, with two groups of size *N*, below is the PfP picture. Recall that **MoE **is the **margin of error**, which is half the length of a CI. I’ve set the slider at the bottom to **target MoE** = 0.50, meaning that I want to estimate the difference between the group means with a 95% CI having MoE of 0.50. In other words, each arm of the CI should be 0.50 in length.

The lower axis is marked in units of population SD, which we can think of as units of **Cohen’s d**. The cursor marks a target MoE of 0.50 in those units.

The **black curve** shows how required *N* increases dramatically as we aim for smaller values of MoE–in other words, greater precision and a shorter CI. Use this curve to investigate how *N* trades with likely precision.

The small curve at the bottom shows how MoE varies for *N* = 32. It’s usually close to 0.50, but can be as short as 0.40 or long as 0.60, and occasionally even a little outside that range. Use the large slider to move the cursor and see the **MoE distribution** for other values of target MoE and *N*.

The figure gives us a **handy benchmark**, worth remembering: Any study with two independent groups of size 32 will estimate the difference between the group means with a 95% CI that has MoE of 0.50, on average.

The black curve can only give us *N* for MoE that’s sufficiently small **on average**. But we can do better. The **red curve**, below, tells us the *N* we need to achieve target MoE with **assurance of 99%**. This is the *N* that gives MoE smaller than target MoE on at least 99% of occasions. The grey curve reminds us of the ‘on average’ curve–the black curve in the figure above.

**precision for planning** supports PfP for what are probably the two most common designs:** two independent groups**, and the **paired design**. The paired design, with a single repeated measure (for example Pretest-Posttest) has the advantage, where it is possible and appropriate, of usually giving higher precision. The critical feature is the correlation in the population between the two measures, such as Pretest and Posttest. Higher correlation gives a shorter CI on the paired difference and therefore higher precision.

To use PfP we need to specify a value for **ρ** (Greek rho), the **population correlation**. Ideally, previous research gives us a reasonable estimate we can use; otherwise we might have to guess. For research with human participants, typical values are often around .6 to .9.

Here’s a PfP picture for the **Paired Design**, with **ρ set to .70**.

The red curve shows us that a single group of *N* = 21 suffices for target MoE = 0.50 with assurance, when ρ = .70. Compare with two groups of *N* = 44 for the independent groups design. Great news!

However, as you might guess, *N* is highly sensitive to ρ. For ρ = .60 we need *N* = 25, but for ρ = .80 we need only *N* = 16 (or *N* = 9, on average).

It’s wonderful that **precision for planning** makes it easy to explore how *N*, target MoE, choice of design, and–for the Paired Design–ρ, all co-vary. Be fully informed before you choose a design and *N*!

Go to **esci web** and see all six components as here:

Search the blog for ‘**Gordon**‘ to find three posts introducing the previous five components.

Please explore any and all of the **six components**. Send your bouquets to **Gordon Moore**. Your comments and suggestions to any of us.

Enjoy!

Geoff

]]>As you may recall, **ITNS2 **will be accompanied by Bob’s data analysis software, **esci**, in R, and Gordon’s web-based simulations and tools, all of which are based on, and go beyond, my Excel-based **ESCI**. Together the web-based goodies, now including **dance r**, comprise

**dance r **takes random samples from a

Playing yourself is *way* better than seeing the pic. A few things to try:

- Watch the
**population cloud**change for different*ρ*values - Explore the changing length and
**asymmetry of CIs**for different*r*values - Watch the sampling distribution of correlations (the
) build*r*heap - See how its
**skew**changes with*ρ* - Investigate the capture percentage of
**95% CIs** - Study what changes, and how fast, as you change
**N**

A key challenge for students–and researchers–is to build good intuitions about the extent of **uncertainty**, including the extent of sampling variability. **dance r** is a great arena in which to build those intuitions about

As I say, we’d love to have your feedback.

Enjoy.

Geoff

]]>