*[Update 7/4/2020 – Added reference to preprint on Cohen’s d for paired designs and put code in an actual code block]*

Lots of research questions boil down to estimating the difference between two means (*M*_{diff} = *M*_{group_of_interest} – *M*_{reference_group}). This is the ‘raw score’ effect size–it reports the difference between groups on the same scale of measurement they were measured on. Usually, that’s all you need (and an estimate of uncertainty). Sometimes, though, it’s nice to also obtain a standardized effect size, one that does not depend on the scale of measurement. In these cases, Cohen’s d is the go-to measure:

Cohen’s *d* = *M*_{diff} / *sd*_{but_which_sd?}

Cohen’s d turns out to be freaking complicated. First, there are issues with how to standardize the mean difference (which sd do you use?). This bumps up against the thorny issue of it is reasonable to assume equal variance. Then there’s the fact that Cohen’s d from a sample is slightly upwardly biased, so it needs to be corrected for bias, which causes some people to relabel it as Hedges g. And in case that wasn’t confusing enough, there’s an additional issue of how best to estimate the confidence interval of d. There are lots of solutions (some good, some bad), and most stats tools aren’t very clear on which approach they are using. That’s a surprising amount of complexity for what would have hoped would be an easy standardization of effect size.

In this blog post I am not going to wade through all these complexities . Instead, I will demonstrate three different ways you can easily obtain Cohen’s d and its CI. Each of these approaches will be very transparent about the all-important choice of the denominator (Lakens, 2013). Each uses the technique of Goulet-Pelletier & Cousineau (2018), which simulation studies suggest is generally the best approach (though perhaps not for paired designs–see the section on “approaches” at the end for details). In all cases, we are going to assume equality of variance between groups/measures–it turns out that without this assumption the CI on d becomes problematic.

All three of the approaches I’ll explain are based around the esci package for R that I (Bob) am currently working on (https://github.com/rcalinjageman/esci). As of 7/3/2020 this package is a rough draft–I’m now working through it to make the code beautiful (to the extent I can). You can use it as-is with some confidence–but be warned that I am tinkering and may yet make update-breaking updates to the package. I don’t have much documentation yet (does this page count?), but you can find a basic walk through the package here: https://osf.io/d89xg/wiki/tools:%20esci%20for%20R/

**Method 1 – Use esci in jamovi**

Let’s start with the easiest option: using a GUI. jamovi is a delightful point-and-click program for statistical analysis (https://www.jamovi.org/). It’s free, it’s open source, it runs on any platform (even Chromebooks), and it’s extensible with modules. I’d call it an SPSS replacement, but it is so much better than that. jamovi is built on R, so you can obtain R syntax for everything you do in jamovi (just turn on “Syntax mode”). Seriously, jamovi is great.

The esci package I’ve developed for R runs as a module in jamovi. Just: 1) run jamovi, 2) click the modules button near the top-right corner, 3) access the jamovi library, and 4) scroll down to esci and click install. You’ll now have an esci menu in your jamovi program (and it will stay there until you remove it–you only need to install a module 1 time per machine). There are screen-by-screen instructions here: https://thenewstatistics.com/itns/esci/jesci/

Once done, you can obtain cohen’s d for both independent and paired designs, and you can do so from raw data or from just summary data. The commands to use are:

- esci -> estimate independent mean difference (the estimation version of an independent t-test), or
- esci -> estimate paired mean difference (the estimation version of a paired t-test)

For example, here I’ve selected “estimate independent mean difference”. In the analysis page that appears I’ve selected the toggle-box for “summary data”. I then typed in the means, standard deviations, and sample sizes for my two groups. In an instant, I get output which includes Cohen’s d and its CI

Here’s a close-up of the output for Cohen’s d:

d_{unbiased}= 0.91 95% CI [0.30, 1.63] Note that the standardized effect size is d_unbiased because the denominator used was SDpooled which had a value of 2.15 The standardized effect size has been corrected for bias. The bias-corrected version of Cohen's d is sometimes also (confusingly) called Hedges' g.

esci explains to you what denominator was used and its value (important) and it clarifies that correction for bias has been applied. One thing missing (for now) is a reference for the approach to obtaining the CI, which really matters… I’ll fix that soon. Maybe there is additional information that would be useful? If so, let me know.

**Method 2 – Obtain Cohen’s d in R from summary data with estimateStandardizedMeanDifference**

Maybe you are an R power user and you just can’t even when it comes to using a GUI for data analysis. No problem. esci is a package in R. It’s not on CRAN (and probably won’t be for some time), but you can obtain it directly from github using the devtools package. Then you can use the function estimateStandardizedMeanDifference. Here’s a detailed code example that includes everything you would need to download and install the package:

```
# Setup -------------------------------------------
# First, make sure required packages are installed.
if (!is.element("devtools", installed.packages()[,1])) {
install.packages("devtools", dep = TRUE)
}
if (!is.element("esci", installed.packages()[,1])) {
devtools::install_github(repo = "rcalinjageman/esci")
}
# Second, load the required libraries
library("esci")
# Third, get some cohen's d
# Get d directly from summary data for a two-group design
estimate <- estimateStandardizedMeanDifference(m1 = 10,
m2 = 15,
s1 = 2,
s2 = 2,
n1 = 20,
n2 = 20,
conf.level = .95)
estimate
# Get d directly from summary data for a paired design
estimate <- estimateStandardizedMeanDifference(m1 = 10,
m2 = 15,
s1 = 2,
s2 = 2,
n1 = 20,
n2 = 20,
r = 0.80,
paired = TRUE,
conf.level = .95)
estimate
# Or, use raw data to estimate the raw mean difference with CI *and* d with CI
# THis boring example uses mtcars
data <- mtcars
data$am <- as.factor(data$am)
levels(data$am) <- c("automatic", "manual")
estimate <- estimateMeanDifference(data, am, mpg,
paired = FALSE,
var.equal = TRUE,
conf.level = .95,
reference.group = 1)
estimate
plotEstimatedDifference(estimate)
```

Note that for the paired data I passed a flag (paired = TRUE) and *also* an r value–that’s the correlation between the paired measures. If you don’t have it, you can often calculate it from summary data and the t-test results.

I have omitted the output here because it follows the exact same format as for jamovi above (after all, it’s running the same code under the hood).

**Method 3 – Obtain Cohen’s d and its CI from raw data with estimateMeanestimateMeanDifference**

Finally, let’s obtain Cohen’s d from raw data–and in the process obtain the raw-score mean difference and a nice plot that emphasizes the raw data and the effect size and its uncertainty.

Here’s a very uninspired example using the mtcars dataset–sorry it’s not very exciting, but there aren’t a lot of fun built-in datasets in R. We’ll compare the miles per gallon (mpg) for automatic vs. manual cars. The type of transmission is in the column “am” and it is coded as a numeric 0 (manual) or 1 (automatic). In this example I will make it into a factor (esci requires that a grouping variable be a factor) and relabel it to make the output easier to understand.

Again I’ve made the code complete, including everything needed to ensure esci is installed.

```
# Setup -------------------------------------------
# First, make sure required packages are installed.
if (!is.element("devtools", installed.packages()[,1])) {
install.packages("devtools", dep = TRUE)
}
if (!is.element("esci", installed.packages()[,1])) {
devtools::install_github(repo = "rcalinjageman/esci")
}
# Second, load the required libraries
library("esci")
# Now make a copy of mtcars and convert am to a labelled factor
data <- mtcars
data$am <- as.factor(data$am)
levels(data$am) <- c("automatic", "manual")
# Prepare yourself for some Cohen's d (and a nice plot)
estimate <- estimateMeanDifference(data, am, mpg,
paired = FALSE,
var.equal = TRUE,
conf.level = .95,
reference.group = 1
)
estimate
plotEstimatedDifference(estimate)
```

As you can see above, we use this function by passing the dataframe (data), the grouping variable (am) and the outcome variable (mpg). The reference.group parameter is optional–it specifies which level of your grouping variable factor that should serve as the reference group when calculating the effect size (*M*_{diff} = *M*_{group_of_interest} – *M*_{reference_group}). If you leave this parameter out, esci will use the first level of your grouping variable.

Again, the output for Cohen’s d follow the same as above, so I’m not going to go through it. But check out the cool plot:

**Approaches**

There are a number of different ways to estimate the CI on Cohen’s d. esci uses the method explained by Goulet-Pelletier & Cousinea: (Goulet-Pelletier & Cousineau, 2018). I’ll expand this blog-post at some point to explain it, but for now I strongly suggest reading the actual paper–it not only clearly explains the approach but it also compares it against many other options, including the ones used in popular R packages (see the appendix)… it turns out not all R packages emit good quality CIs for d!

I’m deeply indebted to these authors–I was able to adapt the code they provided into esci and they have repeatedly helped answer questions to improve the function.

One big issue, though — a recent preprint? I found on ResearchGate suggests that **all **approaches to obtaining a CI for d will fail with paired designs (Fitts, 2020). I’m still digesting this, and waiting to see the peer-reviewed version. But it is probably worth some extra caution with CIs for a paired design–the preprint shows they can have poor capture rates when r is very strong.

**To Read**

- Goulet-Pelletier, J.-C., & Cousineau, D. (2018). A review of effect sizes and their confidence intervals, Part I: The Cohen’s d family.
*The Quantitative Methods for Psychology*, 242–265. doi: 10.20982/tqmp.14.4.p242 - Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs.
*Frontiers in Psychology*. doi: 10.3389/fpsyg.2013.00863

Materials, including slides, are at https://osf.io/d89xg/

See the first slide for links to the data, and for help (it’s dead easy) to get esci running in jamovi. Also available from our site https://thenewstatistics.com/itns/esci/jesci/

Bob walked us through how **esci **(now in R) makes it easy to analyse data using estimation, for several different measures and designs. Extra data sets allowed us all to do it ourselves, and enjoy the great esci **pictures with CIs**.

He started by explaining why **teaching **this way is more fun, more successful, and leads to happier students. esci will be the cornerstone advance in the second edition of ITNS, which we’re working on right now.

Enjoy,

Geoff

]]>I posted about the project, **repliCATS**, **here**. In short: “…the largest ever empirical study on how scientists reason about other scientists’ work, and what factors makes them trust it.” It’s right at the core of understanding and advancing Open Science.

At workshops around the world, more recently all online, the team has collected a big database of judgments about claims made in reports of research. Would the claim replicate? The team seeks judgments by anyone from undergraduates to seasoned researchers, in any of a wide range of social and behavioural science disciplines.

**Please do consider signing up.** Details are **here**.

**Latest news**: The project has just been expanded to include assessment of claims being made in the social and behavioural sciences about **COVID-19**. Work to start in August. You can sign up for that also.

Back in August Bob posted (**here**) about **eNeuro**‘s great initiative to encourage authors to use **estimation**. The latest eNeuro email update reports how the journal is keeping up the good work. See three paras below:

*eNeuro* **encourages authors to add estimation statistics** to their analyses when appropriate. Below, we will feature papers that did so, along with an author’s response to the query: What value was added to your analysis or perspective through the addition of estimation statistics? For more information read **Estimation for Better Inference in Neuroscience.**

**Circuit-specific dendritic development in the piriform cortex**

Laura Moreno-Velasquez, Hung Lo, Stephen Lenzi, Malte Kaehne, Jörg Breustedt, Dietmar Schmitz, Sten Rüdiger, and Friedrich W. Johenning

*“It was satisfying to see how switching to estimation stats based data display increased the transparency of our data display in a reader-friendly format. By openly displaying the mean effect sizes and confidence intervals, we felt relieved from the pressure to solely rely on p-value based true/false statements about the data. Estimation statistics also makes it obvious to us and our readers where replications and further experiments are most needed in the future.” — Friedrich W. Johenning*

—that’s all great to see. Below are a couple of small parts of the figures that the author refers to:

The lower means and CIs, with plausibility curves, illustrate the estimated differences between the means of the blue dots (L2B cells, whatever they are) and red dots (L2A cells). The author mentioned ‘transparency’; indeed, the pictures do make it easy to appreciate the differences, and the precision with which each was estimated.

I especially love the author’s final sentence:

**Estimation statistics also makes it obvious to us and our readers where replications and further experiments are most needed in the future.**

Science as a cumulative, progressive enterprise–powered by estimation. Yay!

Geoff

Ho, J., Tumkaya, T., Aryal, S. *et al.* Moving beyond *P* values: data analysis with estimation graphics. *Nat Methods* **16, **565–566 (2019). https://doi.org/10.1038/s41592-019-0470-3

This post is just to organize links and resources that might be helpful to those who watched the Q&A (thanks! hope it was helpful). Before listing those, I (Bob) should say that it was a great honor to have been invited to be part of an event with a giant of meta-science, Katherine Button, and a brave advocate for better science, *eNeuro* editor Christophe Bernard. It is also heartening for this to be an SFN initiative. I still feel like the organization doesn’t do quite enough on these issues (especially at the annual meetings), but this was another positive step forward.

First, here are the handful of slides I put together for the Q&A session. In these slides I show data from Manouze et al. (2019) . You can grab this data file yourself in csv format here; this is just the data on anxiety for the socially isolated mice. Email Christophe for the complete data set.

Second, Christophe mentioned the “Dance of the p values”. These are simulations Geoff created to help researchers explore the realities of sampling variation. These now exist in a variety of formats:

- Here’s the Dance of the p values simulation I used in today’s session. It’s in Excel. You’ll need to enable macros, and it doesn’t run great on all versions of Excel. All credit to Geoff-this is the version that accompanied his book
*Understanding the New Statistics.*If you want a guided tour of this sim, check out Geoff’s short video on YouTube: https://www.youtube.com/watch?v=5OL1RqHrZQ8 - Want an online version? Here are two versions of the Dance of the Means (a bit simpler than Dance of the p values):
- A really sharp-looking one created by Kristoffer Magnusson: https://rpsychologist.com/d3/ci/
- An a new version (still in development) by Gordon Moore: http://212.159.76.205/ESCI-JS/

- Want a version for students? Here’s the Dance of the Means as an Excel spreadsheet with an accompanying set of activities for undergraduate education.

Finally, here’s a list of resources relevant to the questions that were submitted.

**Software for Estimation**:

- For confidence intervals, a great option is our new esci module for jamovi.
- Download and install a
**current**version of jamovi (>1.2.21)–it’s free and open source - Open jamovi, and in the modules tab, access the jamovi library. Under “available modules” scroll through until you find esci. Then click “install” to add the module into jamovi.
- A new esci menu will appear that allows you to generate estimates for most common study designs. You can bring raw data into jamovi, or you can often just type in summary data (useful if you’d like to get an estimate from a published paper)
- More details are here, tutorials and examples are in the works, and
**please feel free to send feedback and/or feature requests.**

- Download and install a
- For bootstrapped intervals, DABEST is fantastic. It is available as an R package, a python package, and as a easy-to-use web app.
- For Bayesian estimation:
- The best way to get started it to buy Kruschke’s
*Doing Bayesian Data Analysis*. It is remarkably clear and has extensive examples. - the BEST package in R is great. In addition to R, you’ll need to install jags (but this is pretty easy). This tutorial is especially helpful. There is also a package in development called Bayesian First Aid which makes BEST easier to apply to a bunch of common designs–it doesn’t seem to have been updated for a while, but I find it still useful.
- JASP can help you get some Bayesian estimates and is very easy to use. You can’t use diffuse priors—which I think the developers would tell you is a feature rather than a bug.

- The best way to get started it to buy Kruschke’s

**Alternatives to planning for Power – a few sources to get started**

- Planning for precision
- Rothman, K. J., & Greenland, S. (2018). Planning Study Size Based on Precision Rather Than Power.
*Epidemiology*,*29*(5), 599–603. https://doi.org/10.1097/EDE.0000000000000876

- Rothman, K. J., & Greenland, S. (2018). Planning Study Size Based on Precision Rather Than Power.
- Planning for evidence
- Schönbrodt, F. D., & Wagenmakers, E.-J. (2017). Bayes factor design analysis: Planning for compelling evidence.
*Psychonomic Bulletin & Review*, 1–16. https://doi.org/10.3758/s13423-017-1230-y

- Schönbrodt, F. D., & Wagenmakers, E.-J. (2017). Bayes factor design analysis: Planning for compelling evidence.
- And stay tuned as I (Bob) will hopefully soon have an online course out with lots of resources on sample-size determination.

**Getting the lay of the land in statistical inference:**

- Frequentist Estimation
- Cumming, G., & Calin-Jageman, R. J. (2017).
*Introduction to the new statistics: Estimation, open science, and beyond*. New York: Routledge.

- Cumming, G., & Calin-Jageman, R. J. (2017).
- Bootstrap Estimation
- Hesterberg, T. C. (2015). What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum.
*American Statistician*,*69*(4), 371–386. https://doi.org/10.1080/00031305.2015.1089789

- Hesterberg, T. C. (2015). What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum.
- Bayesian Estimation
- Kruschke, J. K., & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective.
*Psychonomic Bulletin & Review*,*25*(1), 178–206. https://doi.org/10.3758/s13423-016-1221-4

- Kruschke, J. K., & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective.
- When testing, do better with inference by interval
- Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses.
*Social Psychological and Personality Science*,*8*(4), 355–362. https://doi.org/10.1177/1948550617697177

- Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses.

For this site, the main page with details on the modules is here: https://thenewstatistics.com/itns/esci/jesci/

——– Post for the jamovi blog—————

Today there’s a new module available in jamovi: **esci** (effect sizes and confidence intervals), developed by Bob Calin-Jageman and Geoff Cumming (@TheNewStats and TheNewStatistics.com). As a newer module you will need a recent version of jamovi to install esci–probably 1.2.19 or above. You can refresh your install of jamovi here.

esci provides an easy step into estimation statistics (aka the “new statistics”), an approach that emphasizes effect sizes, interval estimates, and meta-analysis. esci can provide estimates and confidence intervals for most of the analyses you would learn in an undergraduate statistics course **and **meta-analysis (which really should be part of a good undergraduate statistics course). Most analyses can be run from raw data or from summary data (enabling you to generate estimates from journal articles that only reported hypothesis tests). All analyses generate nice visualizations that emphasize effect sizes and uncertainty. esci is for everyone, but was developed especially with students in mind–it provides step-by-step instructions, clear feedback, and tries to prevent rookie mistakes (like calculating a mean on a nominal variable).

**What is estimation statistics**?

Inferential statistics has two major traditions: testing and estimation. The testing approach is focused on decision-making. In this approach we propose a null hypothesis, collect data, generate a test-statistic and p-value measuring the degree to which the null hypothesis is compatible with the data, and then make a decision about the hypothesis. For example, we might test the null hypothesis that a drug has exactly 0 effect on depression. We collect data from those randomly assigned to take the drug or placebo. We run a t-test comparing these groups and find *p *= .01. We then make a decision: because *p *< .05 we reject the null hypothesis, deciding that an effect of exactly 0 is not compatible with the data. Huzzah.

The testing approach has its uses, but note two important issues that we have not been addressed: 1) *How much does the drug work? and 2) How wrong might we be?* That’s where estimation comes in. From the same data and assumptions that underlie the testing approach we can generate an *estimate* and a *confidence interval*. So, for example, we might find that the drug improved depression by 10% with a 95% CI of [1%, 19%]. This is some very useful information. It tells us how well the drug worked in this one study (10% benefit). It also gives us an expression of uncertainty about this estimate. Specifically, the CI gives the entire range of benefits that are compatible with the data collected— benefits around 1% are compatible and so are benefits around 19%.

Focusing on estimates can be really helpful:

- It helps us weigh practical significance rather than just statistical significance.
- It helps us calibrate our conclusions to the uncertainty of the study
- It fosters meta-analytic thinking, where we combine data from multiple studies to refine our estimates (like the poll aggregators on fivethirtyeight.com)
- It calibrates expectations for replications
- It helps us think critically about optimizing procedures to maximize effect sizes and minimize noise
- And much more

Estimates and tests are linked. A null hypothesis is rejected at the alpha = .05 level if it is outside a 95% CI and not rejected if it is inside. To put it a different way, a 95% CI is all the null hypotheses you would not reject at alpha .05 (and a 99% CI all those for alpha .01, etc.). This means that if you have an estimate, you can still conduct a test–in fact you can test *any *null hypothesis just by checking for it in the CI. The converse is not true, though: knowing that a test is statistically significant does not easily let you know the magnitude of the effect or the uncertainty around it. So when you focus on estimation you gain some benefits, but you don’t lose anything. That makes it rather bizarre that some fields have come to use only testing. esci is part of an effort to change this around, and to make estimation the typical or default approach to inference.

Want to know more about estimation? Here are some sources:

- Undergraduate textbook: Cumming, G., & Calin-Jageman, R. J. (2017).
*Introduction to the new statistics: Estimation, open science, and beyond*. New York: Routledge. On Amazon. -
Calin-Jageman, R. J., & Cumming, G. (2019). The New Statistics for Better Science: Ask How Much, How Uncertain, and What Else Is Known.

*The American Statistician*,*73*(sup1), 271–280. https://doi.org/10.1080/00031305.2018.1518266

**An example with esci**

Let’s use esci to re-analyze data from a famous paper about the “trust drug” oxytocin. Oxytocin is a neurohormone best known for its role in human reproduction. But in 2005, Kosfeld et al. followed up on some interesting work in rodents to examine if oxytocin might influence trust in humans. The researchers randomly assigned participants to receive oxytocin (squirted up the nose) or placebo (also squirted up the nose) before playing an investment game that depended on trusting an anonymous partner. The average amount invested by each participant was used as a measure of trust. Kosfeld et al. found that oxytocin produced a statistically significant increase in trust (**t*(56) = 1.82, p = .037 one-tailed).

That sounds pretty convincing, right? It must have been, as the paper was published in *Nature *and has now been cited over 4,000 times. Right from the start, citations made the effect seem established and unequivocal. But how much did oxytocin improve trust and how wrong might this study be?

Let’s take a look. The original data is available in .csv format here**. Opening it in jamovi you can conduct a standard t-test to confirm that the difference is statistically significant (for a directional test). Now let’s generate the estimate and CI in esci using “Estimate Independent Mean Difference”.

In the analysis options, we’ll enter Trust as the dependent variable and Condition as the grouping variable (placebo was coded as a 0; oxytocin as a 1). We’ll also set the confidence level to 90% to match the stringency of a directional test.

Our output emphasizes the *effect size*, which in this case is the difference in means, and reports this as both a raw difference (with a CI) and as a standardized difference (also with a CI):

esci also generates a *difference plot*. This shows the oxytocin data (all participants and the group mean with CI) and the placebo data (all participants and the group mean with CI). Most importantly, the graph emphasizes the *difference *between them: we draw a line from the placebo group, considering that our benchmark, and then we measure the space between the groups, marking the difference (delta) with a triangle on a right-side axis anchored at 0 to the placebo group. It sounds a bit complicated to write it out, but just take a look.

The graph shows that the difference in trust was fairly huge–a $1.41 increase in investment in a context where a typical investment was $8-9. The change, though, is highly uncertain, with a 95% CI that runs from $0.11 up to $2.71. This means the data is compatible with a very large range of effect sizes–from the vanishingly small to the dazzlingly large. In other words, this study doesn’t really tell us much about how much oxytocin might influence trust. Perhaps not 0, but basically almost any other positive effect size is on the table, including ones (around $0.11) that would be very difficult to replicate.

Looked at with these eyes, it might not surprise you much to find out that the benefit of oxytocin in human trust has *not *replicated well–and that the consensus is that oxytocin probably does not have a *practically significant *effect on trust. Unfortunately this was not obvious to researchers wedded to the testing approach, and so much faith was put in these results that clinical trials were launched to try to use oxytocin as a therapy for social processing deficits (such as with autism-spectrum disorder). None of these clinical trials have shown much benefit, but they’ve cost a ton and produced a decent handful of (thankfully mild) adverse reactions. If you’re curious about the way the oxytocin story imploded at great costs and hardship, check out the article here.

This is just one example of how you can gain important insight into your data by using estimation thinking in place of or as a supplement to testing. esci should make it easy to get started with this approach.

* — In the original study the researchers didn’t actually use a t-test; they compared median trust using a non-parametric test. This nuance doesn’t alter the patterns in the data presented in this post.

** — This data was extracted by Bob and Geoff from a figure in Kosfeld et al. (2005). The OSF page where it is posted has all the details.

**I’m used to running this test… what would I use in esci?**

Glad you asked. Here’s how the most common statistical tests map on to the estimates generated by esci:

Traditional hypothesis test | esci in jamovi command |

One-sample t-test | Estimate Mean |

Independent samples t-test | Estimate Independent Mean Difference |

Paired samples t-test | Estimate Paired Mean Difference |

One-Way ANOVA | Estimate Ind. Group Contrasts |

2×2 ANOVA | Estimate Ind. 2×2 |

2×2 Chi Squared | Estimate Proportion Difference |

Correlation test | Estimate Correlation |

Correlation test with categorical moderator | Estimate Correlation Difference |

**This module would be better if…**

The esci module is still in alpha. Geoff and Bob have made this initial release to help gather feedback as they continue to work on the module in conjunction with a new edition of their statistics textbook. They welcome your feedback, feature requests, and/or bug reports. Please especially consider esci through the eyes of your students:

- What other analyses would you like to see?
- Anything in the output that is hard to understand? That should be labelled better? That should be added or could be removed?
- Would it be helpful to add the option to see all assumptions for an analysis? Should we provide more guidance on interpreting output?
- Any options missing from analyses?

The best way to provide feedback would be on the github page for this module, which is here: https://github.com/rcalinjageman/esci. If that’s a hassle, then by all means just email Bob directly or tweet at them @TheNewStats

**Haven’t I heard of this before? **

Yes – Geoff Cumming has been developing versions of esci for some time. The original versions were designed as worksheets in Excel. And in addition to analyses, the older version of esci has some great simulations and sample-size planning tools. You can still check these out here: https://thenewstatistics.com/itns/esci/

]]>

A common argument given in defence of NHST is that (1) in the real world we need decisions, (2) NHST and *p*<.05 is a way to make such clear decisions that is (3) based on the evidence and (4) objective. Yes, 1 and 2 are true, but 3 is only partly true, and 4 may be true if all details of the data analysis and decision procedure are preregistered. However, a *p* value reflects *N* as much as anything, and, most importantly, other vital factors need to be considered in making a real-world decision, beyond the current data. Decisions need to be informed by data, of course, preferably via confidence intervals and meta-analysis. But they also should reflect consideration of alternatives, costs and benefits, the values of relevant parties, and so on. ‘Conclusions’ above acknowledges all this. Hooray for JAMA!

That’s all fine, but the title and main consideration of the viewpoint are surprising: * What about non-statistically significant results? *Given the recommended approach to decision making, the viewpoint argues that there may even be cases in which such a non-sig result might justify a change in clinical practice. Phew–it seems a very forced argument to me: Take some very weak evidence and dream up a case in which that might just tip the balance and lead us to change practice? Perhaps, but …

The examples discussed, however, emphasise the role of prior evidence (meta-analysis is not mentioned explicitly) and of costs and benefits. In addition, the extent of uncertainty, even if a clinical decision has to be made, needs emphasis. So that’s all good.

The article includes **this link** to an interview by **Howard Bauchner**, JAMA’s Editor in Chief, with **Paul Young**, author of the viewpoint. OK, it’s quicker to skim the article than to listen to the interview, but the interview makes clear that Bauchner takes seriously the need to move on from p<.05. That, for me, is the main point. Hooray for JAMA!

Geoff

P.S. Many thanks to Anoop Balachandran for the heads up.

]]>

And yet….

And yet mega studies can’t do any good if we simply carry forward poor statistical practices to a larger scale. I’m looking at you, point null hypotheses. **With large Ns, testing against a null hypothesis of exactly 0 makes almost exactly 0 sense. It’s no test at all. **

Why? First, because with larger sample sizes power increases to detect small effects, including effects that are only trivially different from 0. For example, Psych Science Accelerator is about to launch a new mega-study on Covid-19 that will probably recruit over 20,000 participants. One of the included studies is a simple two-group design. That means the nominal power for the study is reasonable (>80%) for effect sizes all the way down to *d* = 0.04, and still at 30% for *d *= .02!

A second problem is that model mis-specification becomes an increasingly big issue with large sample sizes–even slight deviations from the assumptions of the statistical test could produce spurious statistical significance.

Taken together, this means that statistical tests against a point null are overwhelmed by sample size–with real-world data they will emit statistical significance at an unacceptably high rate. Tests against a point null simply do not provide a stringent test in a mega-study (usually not in small studies either, but that’s a different blog post). “Predicting” a non-zero effect in a mega-study is about as impressive as calling a single coin flip.

What’s the solution? Bayes factors? No–these can also overwhelmed by sample size (at least as implemented by default in JASP). The solution (regardless of statistical philosophy) is to make more meaningful predictions. Specifically, one should be able to predict a range of plausible/meaningful effect sizes (aka the smallest effect size of interest). One can then check if the effect size observed is within this more stringent range of predictions. In practice, this is often done by inverse–by specifying the range of non-meaningful effect sizes (aka region of practical equivalence or ROPE) and then checking if the effect size observed is clearly outside of this range of predictions (see Figure 1). Either way, this approach (often called *inference by interval*) not only provides stringency to the test, it also rounds out the test, providing no only a path to support the hypothesis (the data are only compatible with meaningful effects), but also a path to refute the hypothesis (the data are only compatible with non-meaningful effects), and a path to an indeterminate result (the data are compatible with both meaningful and non-meaningful effects).

What if you don’t have a clear range of effect size predictions or can’t easily specify the smallest effect size of interest? Then it’s probably a bit premature to invest mega-study resources into the research question.

How do you implement inference by estimation? If you use estimation, it’s simple. You specify your smallest effect size of interest (thereby defining your interval null) and then plot that against your observed effect size and CI (this can be a confidence interval or a Bayesian credible interval). If the whole CI is outside of the interval null, you have clear evidence of a predicted meaningful effect. If the whole CI is inside the interval null, you have clear evidence the effect is *not *meaningful. If the CI includes both regions the test is indeterminate. There are some nuances here about setting the CI width based on your desired error level, but the basic idea is pretty simple.

If you want *p *values, you can get them–in fact, you can get 4 of them. First you conduct a *minimal effect *test to determine if the observed effect has a statistically significant difference from the interval null. This generates two *p *values, and the overall test is significant if either *p *value is lower than the selected alpha level. Then you conduct a *minimal-effect *test to see if your effect is fully inside the interval null. This also generates two *p *values. You take the maximum of these and then compare to the selected alpha level. To my mind, this is all a bit complicated relative to just seeing the result as a CI, but if you really need all those p values, go for it. Lakens has tutorials and the original source is here: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1954.tb00169.x

- Two studies provide yet more evidence that researchers make
**poorer interpretations**when they focus on**statistical significance testing**and its .05 threshold. **Statisticians**make similar errors, although in some situations to a lesser extent than do other researchers.- Both
**statisticians**and**epidemiologists**show, in some situations, clear evidence of a**steep cliff**between*p*values of .025 and .075. Researchers in psychology and economics show evidence of large cliff between*p*values of .01 and .26. - Students without
**statistical training**in some situations make better judgments than do students trained in statistical significance testing.

Folks have traditionally rejected H0 or not, based on a hard *p* = .05 criterion. Does this mean a sharp **cliff** in ratings of strength of evidence, or confidence that an effect exists, from .04 to .06? Or from just outside to just inside a 95% CI? In **this recent post** I discussed the original work from 1963 by Rosenthal and Gaito that suggested a somewhat steep cliff, and also more recent work. I concluded that there are lots of individual differences in how researchers rate the different values of *p* (or the CI equivalents) and that average results typically show only a **small-to-moderate cliff** effect.

**Blake McShane** (see** this site** of his for **links** to relevant pdfs) has kindly let me know of two articles of his, with David Gal, that report two studies providing evidence relevant to cliff. They studied *p* values, but not CI figures.

From the **abstract**:

… We also present new data showing, perhaps surprisingly, that researchers who are primarily statisticians are also prone to misuse and misinterpret *p* values thus resulting in similar errors. In particular, we show that statisticians tend to interpret evidence dichotomously based on whether or not a *p* value crosses the conventional 0.05 threshold for statistical significance. …

Additional results are reported in the following article, which has a particularly nice title, imho:

McShane, B.B., and Gal, D. (2016), “**Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence**.” *Management Science*,* *62(6), 1707-1718. doi: 10.1287/mnsc.2015.2212.

Here’s my brief take on the two studies:

Participants were (i) authors published in *JASA*, a top statistical journal; (ii) authors published in *NEJM*, a top medical journal; and (iii) members of the editorial board of *Psychological Science*. They saw the results of a fictitious two-group study and were asked about the difference between the means of the two particular groups of participants. They saw two versions of the results, identical except for the *p* value. In response to the *p* = .01 version, around 85% of all groups responded appropriately. In response to the *p* = .27 version, only around 48% of the statisticians and 17% of the medical and psychological researchers gave an appropriate response. The respondents’ explanations for their responses indicated that inappropriate use of statistical significance testing, especially when *p* = .27, was important in guiding very many of the responses.

In the **Supplementary Materials** to the ‘Blinding us to the obvious’ article, results are reported of a version of the study that compared responses from groups of undergraduates with and without statistical training. Those **with** such training showed the familiar large difference between .01 and .27 responses; those **without** statistical training gave the same rate of appropriate responses (around 75%) for the two values of *p*. Hence the title: statistical training is sufficiently tainted by a focus on statistical significance testing to **worsen** judgments in some situations.

Overall these findings add to the large body of evidence that even accomplished researchers are often led astray by reliance on statistical significance testing. I guess statisticians might find a hint of comfort that their results were **not as terrible** as those of the medicos or psychologists. The possibility of cliff was not a particular focus of this study. Yes, there’s a big difference between .01 and .27, but we don’t know whether that’s because of a cliff at around .05, or a more gradual change between those two values.

This study was similar but asked a specifically inferential question, and varied the *p* values between rather than within respondents. The *p* values used were .025, .075, .125, .175. Respondents were **statisticians** and **epidemiologists**. Two questions were asked, the **judgment** question being “A person drawn randomly from the same population as the subjects in the study is * more likely* to recover from the disease if given Drug A than if given Drug B”. Other response options included

The second type of question was a **choice**: “…if you were a patient from the same population as the subjects in the study, what drug would you prefer to take to maximize your chance of recovery?” Response options were: A, B, indifferent between A and B. The results were more equivocal than for the judgment question. For **choice**, the statisticians gave more appropriate responses, with small and graduated change with *p* value, whereas epidemiologists gave results suggesting a **moderate cliff** between .025 and .075. Researchers in psychology and economics showed evidence of a moderate cliff between .01 and .26.

Invited responses from distinguished statisticians were published immediately after the article; then came a rejoinder by McShane and Gal. Interesting reading! A couple of quotes:

From **Donald Berry**, referring to *p* values and how they are used: “**We created a monster. And we keep feeding it, hoping that it will stop doing bad things. It is a forlorn hope. No cage can ****confine this monster. The only reasonable route forward is to kill it.**” (p. 896)

From **William M. Briggs**: “**There are no good reasons nor good ways to use p values. They should be retired forthwith.**” (p. 897)

Did anyone mention CIs?

Geoff

P.S. A big thankyou to Blake McShane for comments on a draft of this post.

]]>- Curves picture how
**likelihood**varies across and beyond a CI. Which is better: One curve (**plausibility picture**) or two (**cat’s eye**)? Which should we use in ITNS2? - Curves can discourage
**dichotomous decision making**based on a belief that there’s a**cliff**in strength of evidence at each limit of a 95% CI, i.e. at*p*=.05, **Explanation**and**familiarity**are probably needed for curves on a CI to encourage**estimation**, rather than mere dichotomous interpretation.

A **CI** is most likely to land so that some point near the centre is at the unknown but fixed μ we wish to estimate. Less likely is that a point towards a limit is at μ. Of course there’s a 5% chance that a 95% CI lands so that μ is outside the interval. This pattern of **relative likelihood** is illustrated in Figure 1.2 from **ITNS**:

The curve illustrates the relative **plausibility** that various values along the axis are μ. The higher the curve, the better the bet that μ lies here. Keep in mind that our interval is one from the dance and that it’s the *interval* that varies over replication, while μ is assumed fixed but unknown. This single curve on a CI is the **plausibility picture**.

In **this paper** back in 2007 I played around with the black and white images at left, among others, as ways to picture how plausibility varies over and beyond the interval. The black bulge became **the cat’s eye picture** of a CI, as illustrated by the blue images in Figure 5.9 (below) from **ITNS**.

The 95% interval, in Figure 1.2 (above, at the top) and the middle of the blue figures, extends to include 95% of the area under the curve, or between the two curves. Similarly for the 99% and 80% CIs.

I don’t say that every graph with CIs needs to picture the cat’s eye, but do suggest that students and researchers would benefit from familiarity with the idea of plausibility changing smoothly across and beyond a CI. See any CI and, in your mind’s eye, see the cat’s eye bulge.

To what extent do researchers and students appreciate that pattern of variation across a CI? Pav Kalinowski and Jerry Lai, who worked with me years ago, investigated this question. This blog post (**The Beautiful Face of a Confidence Interval: The Cat’s Eye Picture**) describes their findings, with a link to the published results. In short, people’s intuitions are mostly inaccurate and highly diverse, but a bit of training and familiarity with the cat’s eye is encouragingly effective in improving their intuitions about the variation in plausibility. These results prompted use of the cat’s eye in **UTNS** (pp. 95-102) and ITNS.

More recently, the single curve rather than the two mirror image curves, has found favour: the **plausibility picture**, rather than the **cat’s eye**. Below is an example that Bob made in R, which appears in **our eNeuro paper** (Bob’s blog post is **here**). The CIs are 90% CIs.

The **plausibility picture** is shown here only on the CI on the difference, not on the two CIs to the left. It may be better to restrict the shading under the curve to the extent of the CI, and perhaps the curve could be half the height, so as not to be so visually dominant. Perhaps. Below is a variation on the same idea, from **this preprint** that Bob recently tweeted about.

The lower figure plots the mean and 95% CI, with plausibility picture, for the differences between the three rightmost conditions and the WT-EGFP condition at left. The **dabestR** package was used to make the figure. (The authors generously describe such a figure as a ‘**Cumming estimation plot**‘. I’m happy if UTNS and ITNS have popularised the use of a **difference axis** to picture a difference with its CI, but I later discovered that the idea goes back a while. The earliest examples I know of are in this **1986 BMJ article by Martin Gardner and Doug Altman**, which includes two figures showing a difference with its CI on a difference axis, without any curve on that CI.) Let’s know of any earlier examples.

Yes, **plausibility picture** or **cat’s eye**? I don’t know of any empirical study of which is more readily understood, or more effective in carrying the message of variation over the extent of a CI and beyond. There’s probably not much in it, so it comes down to a matter of taste. I’m sentimentally attached to the cat’s eye, but admit that the single curve is more visually parsimonious. Simplest would be a single fine line depicting the curve, with no shading. It would need to extend beyond both ends of the CI, but perhaps not by much. Perhaps such a curve is as good as anything. It would be great to have some evidence relevant to these questions. Meanwhile, **I’d love to hear your views.**

If a 95%CI is used merely to note whether or not the interval includes the null hypothesised value, we’re throwing away much of the information it offers and descending to mere dichotomous decision making. Undermining such ideas was one of my main motivations for playing with curves on a CI. In fact, **no sharp change** occurs exactly at a limit of a CI, as curves should make clear. Dropping the little crossbars at the end of the CI graphic (UTNS included crossbars, ITNS does not) was another attempt to de-emphasise CI limits.

To what extent do people think that a result falling just below *p*=.05 rather than just above it makes a difference? What about just inside or just outside a CI? Back in 1963 Rosenthal and Gaito, in **this article** (image below), asked psychology faculty and graduate students about their degree of confidence in a result, on a scale from 0 to 5, for various different *p* values. They identified a relatively steep drop in confidence either side of .05, and described this result as a **cliff**.

Here are their averaged results, **degree of confidence** plotted against ** p value**:

Yes, the steepest part of the curves looks to be from .05 to .10, and results for the graduate students (top 3 curves) show a kink at .05, but the **cliff** is hardly precipitous. If we were not so indoctrinated about .05, perhaps we’d see these curves as suggesting a relatively steady drop, rather than a sudden cliff. I wonder whether any tendency towards cliff has increased since 1963?

**Jerry Lai** conducted an online version of this study, with published researchers from **psychology** and **medicine** as respondents. One version of his task asked about ** p values**, another about

Jerry’s chosen title for his article was: **Dichotomous Thinking: A Problem Beyond NHST**. In other words, CIs can easily be used merely to carry out NHST. A 2010 article from my group titled** Confidence intervals permit, but do not guarantee, better inference than statistical significance testing **reported evidence that researchers in **psychology**, **behavioural** **neuroscience**, and **medicine** tended to make much better interpretations of results shown with CIs if they avoided thinking about the CIs in terms of NHST.

Jerry wondered how the curves for R & G’s **individual participants** may have varied from the average curves in the figure above. He wrote a very polite letter to **Prof Rosenthal**. By return of post came a charming and encouraging note to Jerry, enclosing several photocopied sheets of handwritten notes, which neatly set out full details of the experiment and the data for individuals. Yes, there was quite a diversity of curve shapes. Some 50 years on, the original data were still available! Bob Rosenthal was putting to shame many subsequent researchers who could not maintain data beyond the life of a particular computer and/or were not willing to share it with other researchers.

I was delighted to see this **preprint**:

Bob tweeted about it a couple of weeks ago. It reports the results of online **statistical cognition** surveys. A blog post of ours a year ago, **here**, invited participation.

The authors asked participants to rate their confidence that an effect was non-zero, given CI figures corresponding to *p* values ranging from .001 through .04, .05, and .06, and up to .8. The figures included the **standard 95% CI graphic**, with little crossbars at the ends, and a **violin plot** as at left. The researchers found that they needed to give some explanation of the violin plot, especially considering that a violin plot usually represents the spread of data points, rather than a CI: it’s usually a descriptive rather than an inferential picture, as here. I suspect that would have been clearer if the violin plot had included the standard CI graphic–as the cat’s eye and plausibility picture do.

Overall, there was a small-to-moderate cliff effect between the .04 and .06 figures. The cliff was rather **smaller** for the violin plot than for the standard CI graphic.

- The studies mentioned above don’t give us strong or definitive conclusions; we need
**replications**. - Curves probably help
**CI interpretation**, especially by discouraging mere dichotomous decision making. - The
**plausibility picture**, perhaps without shading, may be the simplest and most parsimonious choice. - Some training and
**familiarity**with any picture that includes one or more curves may be needed for full effectiveness. - There’s lots of scope for valuable
**empirical studies**, perhaps especially of the**plausibility picture**.

A final question: ‘**plausibility picture**‘, ‘**plausibility curve**‘, ‘**likelihood curve**‘, ‘**relative likelihood curve**‘, or what? What’s your preference and why?

I’d love to have **comments** on these issues, and, especially, suggestions for our **CI strategies **in **ITNS2**, the second edition we’re currently working on.

Geoff

]]>