**Precision for Planning **tells us what *N *we need to achieve the precision we’d like. It’s a much better way to plan than the traditional use of statistical power, which works only within an NHST framework. Far better to adopt an **estimation framework** (the new statistics) and use PfP.

For an intro to PfP, see Chapter 10 in **ITNS**. For more detail, see Chapter 13 in **UTNS**.

For a **two independent groups study**, with two groups of size *N*, below is the PfP picture. Recall that **MoE **is the **margin of error**, which is half the length of a CI. I’ve set the slider at the bottom to **target MoE** = 0.50, meaning that I want to estimate the difference between the group means with a 95% CI having MoE of 0.50. In other words, each arm of the CI should be 0.50 in length.

The lower axis is marked in units of population SD, which we can think of as units of **Cohen’s d**. The cursor marks a target MoE of 0.50 in those units.

The **black curve** shows how required *N* increases dramatically as we aim for smaller values of MoE–in other words, greater precision and a shorter CI. Use this curve to investigate how *N* trades with likely precision.

The small curve at the bottom shows how MoE varies for *N* = 32. It’s usually close to 0.50, but can be as short as 0.40 or long as 0.60, and occasionally even a little outside that range. Use the large slider to move the cursor and see the **MoE distribution** for other values of target MoE and *N*.

The figure gives us a **handy benchmark**, worth remembering: Any study with two independent groups of size 32 will estimate the difference between the group means with a 95% CI that has MoE of 0.50, on average.

The black curve can only give us *N* for MoE that’s sufficiently small **on average**. But we can do better. The **red curve**, below, tells us the *N* we need to achieve target MoE with **assurance of 99%**. This is the *N* that gives MoE smaller than target MoE on at least 99% of occasions. The grey curve reminds us of the ‘on average’ curve–the black curve in the figure above.

**precision for planning** supports PfP for what are probably the two most common designs:** two independent groups**, and the **paired design**. The paired design, with a single repeated measure (for example Pretest-Posttest) has the advantage, where it is possible and appropriate, of usually giving higher precision. The critical feature is the correlation in the population between the two measures, such as Pretest and Posttest. Higher correlation gives a shorter CI on the paired difference and therefore higher precision.

To use PfP we need to specify a value for **ρ** (Greek rho), the **population correlation**. Ideally, previous research gives us a reasonable estimate we can use; otherwise we might have to guess. For research with human participants, typical values are often around .6 to .9.

Here’s a PfP picture for the **Paired Design**, with **ρ set to .70**.

The red curve shows us that a single group of *N* = 21 suffices for target MoE = 0.50 with assurance, when ρ = .70. Compare with two groups of *N* = 44 for the independent groups design. Great news!

However, as you might guess, *N* is highly sensitive to ρ. For ρ = .60 we need *N* = 25, but for ρ = .80 we need only *N* = 16 (or *N* = 9, on average).

It’s wonderful that **precision for planning** makes it easy to explore how *N*, target MoE, choice of design, and–for the Paired Design–ρ, all co-vary. Be fully informed before you choose a design and *N*!

Go to **esci web** and see all six components as here:

Search the blog for ‘**Gordon**‘ to find three posts introducing the previous five components.

Please explore any and all of the **six components**. Send your bouquets to **Gordon Moore**. Your comments and suggestions to any of us.

Enjoy!

Geoff

]]>As you may recall, **ITNS2 **will be accompanied by Bob’s data analysis software, **esci**, in R, and Gordon’s web-based simulations and tools, all of which are based on, and go beyond, my Excel-based **ESCI**. Together the web-based goodies, now including **dance r**, comprise

**dance r **takes random samples from a

Playing yourself is *way* better than seeing the pic. A few things to try:

- Watch the
**population cloud**change for different*ρ*values - Explore the changing length and
**asymmetry of CIs**for different*r*values - Watch the sampling distribution of correlations (the
) build*r*heap - See how its
**skew**changes with*ρ* - Investigate the capture percentage of
**95% CIs** - Study what changes, and how fast, as you change
**N**

A key challenge for students–and researchers–is to build good intuitions about the extent of **uncertainty**, including the extent of sampling variability. **dance r** is a great arena in which to build those intuitions about

As I say, we’d love to have your feedback.

Enjoy.

Geoff

]]>Science is under attack around the world, and vital data are being ignored–or totally rejected. Time for a **good news** statistics story. My bedside reading is a recent issue of **Significance **(unfortunate title!) magazine, which goes to members of both the Royal Statistical Society (U.K) and the American Statistical Association.

It’s mostly behind a paywall for 12 months, but, happily, this article is a free **download**: **Science after Covid-19, Faster, better, stronger?** Dare we hope?!

Simon Schwab and Leonhard Held, of the Centre for Reproducible Science, University of Zurich, describe how this year:

- 30 publishers agreed to make Covid-19 research papers and data
**freely available**–no paywalls - Uploading of covid-related
**preprints**exploded, as the figure above tells - Quick action is encouraging rapid and open reviews of preprints, e.g. via
**Outbreak Science**

Schwab and Held also discuss:

- The value of peer review
studies are conducted. Some journals offer*before***registered reports**, and aim to review study plans within 7 days. - Ways that fast and high quality
**peer reviewing**can be supported. - The need for rigour and
**best-practice methods**, as well as speed, and prompt**systematic reviews**. Then presentation of evidence-based advice for public policy and practice.

They conclude “courses in good research practice should be widely adopted to address highly relevant topics such as study design, open science, statistics and reproducibility … and preparation must also include the training of teams for rapid synthesis of relevant evidence. We cannot be prepared enough for the next global health crisis.”

In other words, **Open Science**! Bring it on–on World Statistics Day, and every day.

Geoff

]]>As I explained, **ITNS2 **will be accompanied by Bob’s data analysis software, **esci**, in R, and Gordon’s web-based simulations and tools, all of which are based on, and go beyond, my Excel-based **ESCI**. Together the web-based goodies comprise **esci web**, which you can open in your browser **here**. (Or use the ESCI menu above and choose **esci web** from the dropdown.) From today, **esci web** has four components, with perhaps two yet to come.

** distributions, d picture, and correlation** are visual statistical tools, developed in JavaScript. We’d love to have your feedback.

See the curves, explore *z *scores, find areas, find critical values.

What does *d* = 0.2 look like? How much overlap of distributions? What about *d* = 0.5, 1.0, 1.5, …?

What do you think is the *r* value in each of these scatterplots?

——— Don’t read on just yet. Have an eyeball of the scatterplots. What is each *r*?

——— Last chance… look back up…

OK, the correlation is .3 in all cases. True, if possibly strange. (All the data sets come from a bivariate normal distribution, and in all cases the data set correlation is .3.)

Pro tip: Eyeball, or turn on, a cross through the means, as in lower right. Then eyeball the approximate comparative number of dots in (top right + lower left) quadrants and the (top left + lower right) quadrants. Correlation is a tussle between those first two (the *matched *quadrants) and the second two (the *unmatched*).

Investigate that and other cool things in **correlation**.

As I say, access **esci web** **here**, and please let us have your comments.

Enjoy,

Geoff

]]>…as I was asked recently. A question every author loves to hear. The short answer is **ITNS**, preferably to be followed by **ITNS2**, coming in 2021 we hope. Here’s an overview:

Main changes from the first edition: **fabulous new software**:

**esci **(in R) for data analysis and great graphs with CIs, by Bob, and

**esci web** (in javascript) for dance simulations and tools, by Gordon Moore

There’s even more about **Open Science**, and some new examples–timely studies that have used Open Science practices.

The first introductory textbook to combine **the new statistics** (CIs, estimation, meta-analysis) with **Open Science** practices, from the start and all through. Starts at the very beginning, but goes far enough to include meta-analysis, regression, and simple two-way designs. Basic formulas only, many pictures and interactive demos. Lots of examples. Lots of online resources to support teachers and students.

A streamlined version of **ESCI**, software that runs under Excel, is used throughout the book. More information **here**. Read Contents and Chapter 1 **here**. Publisher’s website **here**. Support materials are **here**. **ESCI intro **is **here**. Amazon page **here**.

The original book, aimed at upper year undergraduates through to researchers. Explains in detail why the dichotomous thinking of NHST is damaging and should be replaced by **the new statistics**. **Estimation **and **meta-analysis** are introduced from the start. Some is just a little technical, for example three chapters on meta-analysis. Predates Open Science. No regression. Accompanied by original **ESCI**, running under Excel. More information **here**. Publisher’s website **here**. **ESCI **is **here**. Amazon page **here**.

Whichever you choose, I hope the book, software and all the materials serve you well. Together let’s change the world, towards better research and statistical practices.

Geoff

]]>I’m delighted to report that they have now posted a **preprint **of their results **here**. We’d love to have **your comments and suggestions**.

Max explored six approaches to calculating a CI for the DR. He used simulation to investigate their properties, especially coverage, and identified two that give excellent CIs. He provides (**here**) R code to allow any researcher to calculate the CI on the DR for their own data, for a range of measures. All Max’s simulation materials are available on OSF **here**, so anyone can recreate or extend Max’s work.

Below is Figure 1 from the preprint, as an example of how the DR and its CI may be reported in a forest plot.

In the figure, DR = 1.40 is reported along with three conventional measures of heterogeneity, all with CIs. Both the RE (Random Effects) and FE (Fixed Effect) diamonds are shown in the forest plot, so it’s easy to eyeball DR, which is simply the length of the RE diamond divided by that of the FE diamond. DR = 1 suggests little or no heterogeneity, and increasing values of DR suggest increasing heterogeneity. One vital message is given by the CI on the DR, which is [0, 3.09], so this meta-analysis, which integrates only 10 studies, can give us only a very imprecise estimate of heterogeneity.

Along with the DR, the figure reports the 95% prediction interval (PI) for true effect sizes as a further estimate of heterogeneity. Borenstein et al. (2017) advocated use of the PI, which is reported here to be 0.285. The red line segment just under the RE diamond pictures that length. Informally, that segment illustrates the likely extent of spread of true effect sizes. The PI is 4 x *T*, where *T* is the estimated population SD of true effect sizes. The very long CI reported for *T* indicates once again a very imprecise estimate of heterogeneity.

In the preprint we conclude that the DR, and its CI, can be valuable for students as they learn about meta-analysis, and for researchers as they interpret and communicate their meta-analyses.

**It would be great to have any comments about Max’s work and the preprint. Thanks!**

Geoff

Max: mrcairns994@gmail.com Geoff: g.cumming@latrobe.edu.au

Borenstein, M., Higgins, J. P., Hedges, L. V., & Rothstein, H. R. (2017). Basics of meta-analysis: *I*^{2} is not an absolute measure of heterogeneity. *Research Synthesis Methods, 8*, 5-18. doi:10.1002/jrsm.1230

We are now releasing Gordon’s **dances** in beta, and seek your feedback. Developed in JavaScript, **dances** opens in your browser via **this link**. ITNS2 will be accompanied by Bob’s data analysis software, **esci**, in R, and Gordon’s web-based simulations, all of which are based on, and go beyond, my Excel-based **ESCI**. The first and most important of Gordon’s simulations is **dances**, which replaces and goes beyond **CIjumping** in ESCI.

Below are four examples of **dances** bringing key statistical ideas alive. These are frozen images: It’s ** way **more convincing watching the simulations dancing down the screen.

Getting started with **dances**:

- Open
**dances**in a browser - Click on the ‘
**?**’ at top right in the control panel (left side of screen) to turn on popout tips, which give brief explanations when the mouse hovers over labels or controls. - Use the three big buttons. Play as you wish. Click ‘Clear’ to start again.

Take repeated samples of size *N* = 20 from the pictured normally distributed population. Watch the pattern of values (blue open circles) jump around from sample to sample. Watch the means (green dots) from successive samples dance down the screen: So much variation, even with samples of size 20! This is the **dance of the means**.

Place 95% CIs on each of the dancing means, again with samples of *N* = 20. CIs that don’t capture the population mean, mu (blue line), are red. In the short term, red CIs seem to come very haphazardly, sometimes rarely, sometimes in clumps. In the long term, however, very very close to 95.0% of CIs will capture mu and 5.0% will be red.

This happens when CIs are all the same length, being based on the population SD, sigma, assumed known. Remarkably, it also happens when, as in the picture below, CIs vary in length because they are based on sample SDs, when sigma is assumed not known. Either way, we are seeing the **dance of the CIs**.

The falling means pile up to form the **mean heap**; means in the heap keep their colour, red or green. In the long run, the mean heap shape will closely match the theoretically expected, normally distributed, sampling distribution curve.

The **central limit theorem** states that, almost whatever the shape of the population distribution, the sampling distribution of sample means will be approximately normal. Furthermore, the larger the samples, the closer the sampling distribution will be to normal.

In **dances** you can draw whatever weird shape of population distribution you choose, then take samples of some chosen size, *N*, and compare the mean heap with the normal curve.

The figure below shows that, even with my hand-drawn, highly skewed population, and samples as tiny as *N* = 3, the mean heap is much less skewed than the population, and surprisingly close in shape to the symmetric normal curve.

Run a replication, exactly the same as the original experiment but with a new sample, and find that the *p* value is likely to be very different. The sampling variability of the *p* value is surprisingly large: Alas, we simply shouldn’t trust any *p* value.

The figure below shows the **dance of the CIs** and the corresponding *p* values—which vary from <.001 to more than .8! Deep blue patches mark *p*>.10, through to bright red patches for *p*<.001. This is the **dance of the p values**!

Population mean, mu, is 60, and SD, sigma, is 20. The null hypothesis is H0: mu0 = 50, so the effect size in the population is half of sigma, or Cohen’s delta = 0.50, conventionally considered to be a medium-sized effect. With *N* = 16, the power is about .50, which is typical for many research fields in psychology and some other disciplines.

The running simulation is way more vivid than any picture, especially when sounds are turned on, ranging from a bright trumpet for *p*<.001 down to a deep trombone for *p*>.10.

Change *N*, or population effect size, and see generally lower or higher *p* values but, most surprisingly, in every case the values of *p* still jump around dramatically.

For videos of such dances, search YouTube for ‘**dance of the p values**’ and ‘**significance roulette**’.

Figures and dances like those shown here will come in Chapters 4, 5, and 6 in ITNS2.

Meanwhile, please have a play with Gordon’s wonderful **dances** and let us have your thoughts and suggestions. Thanks.

Geoff

]]>The **dance of the p values** was my first go at making vivid the amazingly large sampling variability of the

But that’s when we know the population mean. What about a more realistic situation when all we know from the initial experiment is the *p *value? What is a close replication, just the same but with a new sample, likely to give? In other words, what’s **replication p**

In most cases replication *p *can take just about any value

For explanation and all the formulas, see **this paper** (cited 375 times). For the demo, search YouTube for ‘Significance Roulette’ to find two videos. Or they are **here **and **here**.

The figure above is a wheel that’s equivalent to **the distribution of replication p **following an initial experiment that gives

Of course, if the initial *p *value is different we’ll need a different wheel. Below is the wheel for initial *p *= .01. More red (*p *< .001) and less deep blue (*p *> .10), but still an alarmingly wide spread of possibilities.

You don’t believe me? It took me ages to accept what the formulas and simulations were telling me. But they are correct. We really, really can’t trust a *p *value–which seems to promise certainty, a clear outcome. Far, far better to use the **confidence interval**, whose length makes the degree of uncertainty salient, even tho’ that’s often a depressing message.

Geoff

]]>From Fiona’s review:

“Together with his overview of the replication crisis, this introduction would be useful for undergraduates or general readers.

“Fraud, bias, negligence and hype are the themes of *Science Fictions*. Some of the cases Ritchie presents… are intriguing and disturbing combinations of all four.

“This comprehensive collection of mishaps, misdeeds and tales of caution is the great strength of Ritchie’s offering.”

Then come the **really interesting bits**, including discussion of Richie’s interpretation of the problem and his view of what science should be and what he sees it as having become. Fiona can expertly set that all in perspective, as a scholar of the history and philosophy of science, and now metascience. Definitely worth the read. She’d recommend the book also, despite some flaws.

Geoff

Richie, S. (2020). * Science fictions: How fraud, bias, incompetence, and hype undermine the search for truth*. Metropolitan Books.

This post describes the approach I’m considering–if you have any feedback, I’d be very happy to have it.

**The goals**

I’m writing functions that will provide estimates with CIs for different effect sizes (means, mean differences, R, proportions, etc.)

**Goal 1 – **I’d like all functions to work with both summary and raw data. For example, I want users to be able to do:

`estimate_mean(mtcars, mpg, conf.level = 0.99)`

As well as

`estimate_mean(m = 10, s = 2, n = 5)`

**Goal 2 – **I’d like to try to make the functions compatible with tidyverse style and tidy evaluation (to the extent I understand either of these). So, for tidy style my goal is to use camel_case for function names. For tidy evaluation, it should be possible to send in column names unquoted, and to arbitrarily expand the number of columns to be processed. As in:

`estimate_mean(data = mtcars, y = mpg, cyl, wt, gear)`

**Goal 3 **– I’d like the package to be clean and easy to use. I don’t want different functions for slightly different styles of input (raw data vs. summary data). I **do **want the auto-fill for a function to be informative.

**Goal 4 **– I’d like the code to be efficient and easy to maintain. My approach has been to write a basic function that deals with summary data only, and then to use that as the basis for processing raw data, hopefully avoiding code duplication.

**Current Implementation**

To meet these goals, I’ve been tinkering around with Rs S3 “classes” and the UseMethod dispatcher that enables you to route function calls to different implementations based on the types of objects passed. I’ve found this to be quite a journey. I’ve devised a working approach (code listing is below), but I’m eager for feedback, as it feels a bit icky–I am not confident that I’m dealing with unquoted arguments well/properly.

**Detected unquoted arguments… is there a better way**? To dispatch properly, I’d like to detect if the user has passed in column name as an unquoted argument (rather than, say a vector). However, there doesn’t seem to be a way to dispatch by an unquoted argument (they don’t have a natural class, and checking the class before quoting causes an error). You can quote first, but literally anything can be quoted, so you don’t end up with a substantively different object if the original argument was unquoted vs. not. There does not seem to be a ‘is.unquoted’ or the like to check. So I’ve hit upon testing by using:

`qpassed <- try(eval(y), silent = TRUE)`

If y is an unquoted column name, this eval throws an error, and qpassed obtains the class “try-error”. This feels a bit icky. And it doesn’t distinguish an unquoted column name from a typo in an otherwise valid expression.

**Is it ok to dispatch based on a calculated switch rather than what is actually passed?**In the set of functions I envision, the real key to how to dispatch is in the format of y. If y is a vector, I want to dispatch to a function that estimates from a vector. If y is an unquoted column, name, I want to dispatch to a function that handles a column from a data frame. If y is the beginning of a list of unquoted column names, I want to dispatch to a function that handles that list. And finally, if y is null it means summary data has been passed in, so I want to dispatch accordingly. The thing is, though, you**can’t**dispatch on y because it might be an unquoted column name, and these don’t have a class, can’t be evaluated before quoting, and therefore won’t be dispatched properly. I was stuck, and**then**it dawned on me that I could use the dispatching function to determine what type of y has been passed, and dispatch accordingly. What’s really strange about this, is I realized I could code the class of y in another variable (I called it switch) and ask UseMethod to use this calculated variable even though it wasn’t passed–and it does! And yet it does not actually pass the dispatching variable to the function that gets called! That’s not like any other type of class dispatch method I’ve ever seen… seems to work, but feels kind of wrong.**Is there no way to tell a vector from a single number?**I was surprised when writing the dispatch function to pass in a vector and have the class show up as numeric. It looks like R doesn’t assign a vector class to vectors. Is there some way to distinguish single numbers from vectors? (length?)**as_label rather than quo_name**— it looks like quo_name is being retired in favor of as_label… Not really a question, just a note to myself to update my code for this.

**Current Implementation: **

Here’s the dispatch function I’ve generated for estimating a single mean. It dispatches to one of 4 functions: 1) to one that works with summary data (.numeric), 2) to one that works with a vector (.vector), 3) to one that works with a single column name of data (.character), 4) or to one that works with a list of column names (.list). Although the dispatch function itself keels icky to me, I do like that these 4 functions are built stacked on top of each other (the list calls the single column, which calls the vector, which calls the numeric). I *think* that’s the way to go to avoid code duplication and make maintenance easy–but happy for feedback on this as well.

This is just a skeleton that I will use/elaborate for other functions. I am still planning on adding assertive-based input checking, and providing a rich, well-named, and consistent output object.

```
estimate_mean <- function(data = NULL,
y = NULL,
...,
m = NULL,
s = NULL,
n = NULL,
conf.level = 0.95) {
switch <- 1
if (!is.null(m)) {
if(!is.null(data)) stop("You have passed summary statistics, so don't pass the 'data' parameter used for raw data.")
if(!is.null(y)) stop("You have passed summary statistics, so don't pass the 'y' parameter used for raw data.")
} else {
if(!is.null(m)) stop("You have passed raw data, so don't pass the 'm' parameter used for summary data.")
if(!is.null(s)) stop("You have passed raw data, so don't pass the 's' parameter used for summary data.")
if(!is.null(n)) stop("You have passed raw data, so don't pass the 'n' parameter used for summary data.")
qpassed <- try(eval(y), silent = TRUE)
if(class(qpassed) != "try-error" & class(qpassed) == "numeric") {
if(!is.null(data)) stop("You have passed y as a vector of data, so don't pass the 'data' parameter used for data frames.")
class(switch) <- "vector"
} else {
dotlist <- rlang::quos(...)
if(length(dotlist) != 0) {
switch <- list()
} else {
switch <- "character"
}
}
}
UseMethod("estimate_mean", switch)
}
#' @export
estimate_mean.numeric <- function(m, s, n, conf.level = 0.95) {
sem <- s / sqrt(n)
moe <- qt(1 - (1-conf.level)/2, n-1)
ci.low <- m - moe
ci.high <- m + moe
res <- list(m = m,
sem = sem,
moe = moe,
ci.low = ci.low,
ci.high = ci.high)
formatted_mean <- stringr::str_interp(
"mean = $[.2f]{m} ${conf.level*100}% CI [$[.2f]{ci.low}, $[.2f]{ci.high}]"
)
class(res) <- "estimate"
return(res)
}
#' @export
estimate_mean.vector <- function(y, conf.level = 0.95) {
y <- na.omit(y)
m <- mean(y, na.rm = TRUE)
s <- sd(y, na.rm = TRUE)
n <- length(y)
res <- estimate_mean.numeric(m, s, n, conf.level = conf.level)
return(res)
}
#' @export
estimate_mean.character <- function(data, y, conf.level = 0.95) {
y_enquo <- rlang::enquo(y)
y_quoname <- rlang::quo_name(y_enquo)
res <- estimate_mean.vector(data[[y_quoname]], conf.level = conf.level)
return(res)
}
#' @export
estimate_mean.list <- function(data, ..., conf.level = 0.95) {
res <- list()
dotlist <- rlang::quos(...)
for (y_var in dotlist) {
y_name <- rlang::quo_name(y_var)
res[[y_name]] <- estimate_mean.character(data = data, y = !!y_name, conf.level = conf.level)
}
class(res) <- "estimate list"
return(res)
}
```

So…. if any R package-developers have thoughts or suggestions I’d be glad to have them. Thanks in advanced.

]]>