## The Shape of a Confidence Interval: Cat’s Eye or Plausibility Picture, and What About Cliff?

In brief:

• Curves picture how likelihood varies across and beyond a CI. Which is better: One curve (plausibility picture) or two (cat’s eye)? Which should we use in ITNS2?
• Curves can discourage dichotomous decision making based on a belief that there’s a cliff in strength of evidence at each limit of a 95% CI, i.e. at p=.05,
• Explanation and familiarity are probably needed for curves on a CI to encourage estimation, rather than mere dichotomous interpretation.

## Variation Across and Beyond a CI

A CI is most likely to land so that some point near the centre is at the unknown but fixed μ we wish to estimate. Less likely is that a point towards a limit is at μ. Of course there’s a 5% chance that a 95% CI lands so that μ is outside the interval. This pattern of relative likelihood is illustrated in Figure 1.2 from ITNS:

The curve illustrates the relative plausibility that various values along the axis are μ. The higher the curve, the better the bet that μ lies here. Keep in mind that our interval is one from the dance and that it’s the interval that varies over replication, while μ is assumed fixed but unknown. This single curve on a CI is the plausibility picture.

## Cat’s Eyes

In this paper back in 2007 I played around with the black and white images at left, among others, as ways to picture how plausibility varies over and beyond the interval. The black bulge became the cat’s eye picture of a CI, as illustrated by the blue images in Figure 5.9 (below) from ITNS.

The 95% interval, in Figure 1.2 (above, at the top) and the middle of the blue figures, extends to include 95% of the area under the curve, or between the two curves. Similarly for the 99% and 80% CIs.

I don’t say that every graph with CIs needs to picture the cat’s eye, but do suggest that students and researchers would benefit from familiarity with the idea of plausibility changing smoothly across and beyond a CI. See any CI and, in your mind’s eye, see the cat’s eye bulge.

To what extent do researchers and students appreciate that pattern of variation across a CI? Pav Kalinowski and Jerry Lai, who worked with me years ago, investigated this question. This blog post (The Beautiful Face of a Confidence Interval: The Cat’s Eye Picture) describes their findings, with a link to the published results. In short, people’s intuitions are mostly inaccurate and highly diverse, but a bit of training and familiarity with the cat’s eye is encouragingly effective in improving their intuitions about the variation in plausibility. These results prompted use of the cat’s eye in UTNS (pp. 95-102) and ITNS.

## Plausibility Pictures

More recently, the single curve rather than the two mirror image curves, has found favour: the plausibility picture, rather than the cat’s eye. Below is an example that Bob made in R, which appears in our eNeuro paper (Bob’s blog post is here). The CIs are 90% CIs.

The plausibility picture is shown here only on the CI on the difference, not on the two CIs to the left. It may be better to restrict the shading under the curve to the extent of the CI, and perhaps the curve could be half the height, so as not to be so visually dominant. Perhaps. Below is a variation on the same idea, from this preprint that Bob recently tweeted about.

The lower figure plots the mean and 95% CI, with plausibility picture, for the differences between the three rightmost conditions and the WT-EGFP condition at left. The dabestR package was used to make the figure. (The authors generously describe such a figure as a ‘Cumming estimation plot‘. I’m happy if UTNS and ITNS have popularised the use of a difference axis to picture a difference with its CI, but I later discovered that the idea goes back a while. The earliest examples I know of are in this 1986 BMJ article by Martin Gardner and Doug Altman, which includes two figures showing a difference with its CI on a difference axis, without any curve on that CI.) Let’s know of any earlier examples.

## One or Two Curves? Plausibility Picture or Cat’s Eye?

Yes, plausibility picture or cat’s eye? I don’t know of any empirical study of which is more readily understood, or more effective in carrying the message of variation over the extent of a CI and beyond. There’s probably not much in it, so it comes down to a matter of taste. I’m sentimentally attached to the cat’s eye, but admit that the single curve is more visually parsimonious. Simplest would be a single fine line depicting the curve, with no shading. It would need to extend beyond both ends of the CI, but perhaps not by much. Perhaps such a curve is as good as anything. It would be great to have some evidence relevant to these questions. Meanwhile, I’d love to hear your views.

## A Cliff at p=.05? At the End of a CI?

If a 95%CI is used merely to note whether or not the interval includes the null hypothesised value, we’re throwing away much of the information it offers and descending to mere dichotomous decision making. Undermining such ideas was one of my main motivations for playing with curves on a CI. In fact, no sharp change occurs exactly at a limit of a CI, as curves should make clear. Dropping the little crossbars at the end of the CI graphic (UTNS included crossbars, ITNS does not) was another attempt to de-emphasise CI limits.

To what extent do people think that a result falling just below p=.05 rather than just above it makes a difference? What about just inside or just outside a CI? Back in 1963 Rosenthal and Gaito, in this article (image below), asked psychology faculty and graduate students about their degree of confidence in a result, on a scale from 0 to 5, for various different p values. They identified a relatively steep drop in confidence either side of .05, and described this result as a cliff.

Here are their averaged results, degree of confidence plotted against p value:

Yes, the steepest part of the curves looks to be from .05 to .10, and results for the graduate students (top 3 curves) show a kink at .05, but the cliff is hardly precipitous. If we were not so indoctrinated about .05, perhaps we’d see these curves as suggesting a relatively steady drop, rather than a sudden cliff. I wonder whether any tendency towards cliff has increased since 1963?

Jerry Lai conducted an online version of this study, with published researchers from psychology and medicine as respondents. One version of his task asked about p values, another about CI figures–which showed intervals overlapping zero to varying extents corresponding to the various p values. His results are summarised here, along with brief mention of other similar studies since 1963. He found a diversity of shapes of curves: Only a few showed a steep cliff, many showed a weak cliff, as in the R & G average results above, some a more-or-less linear decline, and others some other shape. Psychology and medical researchers gave a similar diversity of curves. Results for CIs showed, if anything, more evidence of cliff than did those for p values. Alas!

Jerry’s chosen title for his article was: Dichotomous Thinking: A Problem Beyond NHST. In other words, CIs can easily be used merely to carry out NHST. A 2010 article from my group titled Confidence intervals permit, but do not guarantee, better inference than statistical significance testing reported evidence that researchers in psychology, behavioural neuroscience, and medicine tended to make much better interpretations of results shown with CIs if they avoided thinking about the CIs in terms of NHST.

### A remarkable Open Science story

Jerry wondered how the curves for R & G’s individual participants may have varied from the average curves in the figure above. He wrote a very polite letter to Prof Rosenthal. By return of post came a charming and encouraging note to Jerry, enclosing several photocopied sheets of handwritten notes, which neatly set out full details of the experiment and the data for individuals. Yes, there was quite a diversity of curve shapes. Some 50 years on, the original data were still available! Bob Rosenthal was putting to shame many subsequent researchers who could not maintain data beyond the life of a particular computer and/or were not willing to share it with other researchers.

## What About a Violin Plot?

I was delighted to see this preprint:

Bob tweeted about it a couple of weeks ago. It reports the results of online statistical cognition surveys. A blog post of ours a year ago, here, invited participation.

The authors asked participants to rate their confidence that an effect was non-zero, given CI figures corresponding to p values ranging from .001 through .04, .05, and .06, and up to .8. The figures included the standard 95% CI graphic, with little crossbars at the ends, and a violin plot as at left. The researchers found that they needed to give some explanation of the violin plot, especially considering that a violin plot usually represents the spread of data points, rather than a CI: it’s usually a descriptive rather than an inferential picture, as here. I suspect that would have been clearer if the violin plot had included the standard CI graphic–as the cat’s eye and plausibility picture do.

Overall, there was a small-to-moderate cliff effect between the .04 and .06 figures. The cliff was rather smaller for the violin plot than for the standard CI graphic.

## Conclusions

• The studies mentioned above don’t give us strong or definitive conclusions; we need replications.
• Curves probably help CI interpretation, especially by discouraging mere dichotomous decision making.
• The plausibility picture, perhaps without shading, may be the simplest and most parsimonious choice.
• Some training and familiarity with any picture that includes one or more curves may be needed for full effectiveness.
• There’s lots of scope for valuable empirical studies, perhaps especially of the plausibility picture.

A final question: ‘plausibility picture‘, ‘plausibility curve‘, ‘likelihood curve‘, ‘relative likelihood curve‘, or what? What’s your preference and why?

I’d love to have comments on these issues, and, especially, suggestions for our CI strategies in ITNS2, the second edition we’re currently working on.

Geoff

## The ASA and p Values: Here We Go Again

The above announcement is from the February ASA (American Statistical Association) newsletter. (See p. 7 for the announcement and the list of 15 members of the Task Force.)

Why won’t statistical significance simply whither and die, taking p<.05 and maybe even p values with it? The ASA needs a Task Force on Statistical Inference and Open Science, not one that has its eye firmly in the rear view mirror, gazing back at .05 and significance and other such relics.

I shouldn’t be so negative: I definitely am glad that ‘reproducibility’ is a focus, even if ‘Open Science’ may suggest a wider view.

To welcome the new Task Force, Andrew Gelman posted an invitation to discussion. His post, which is here, is sensible. Stuart Hurlbert makes some useful early contributions, but most of the 152 (as of now) comments make me tired and depressed as I skim through.

You may recall the background, including:

I suggest that much of what’s needed can be summarised by this basic logic:

• We need Open Science (replication, preregistration, open data and materials, fully detailed publication whatever the results, …).
• Open Science requires replication.
• Replication requires meta-analysis.
• Meta-analysis requires estimation. (Frequentist, Bayesian, or…)

Note that there is NO necessary role for NHST or p values in any of the above. Historically, p values have caused much damage, especially by prompting selective reporting, which biases meta-analysis, perhaps drastically. Simply let them whither and fade into the background.

Beyond all that, we’d love to be moving to more quantitative modelling, which makes estimation even more necessary.

Watch for a call for submissions to the Task Force.

Bring on the revolution…

Geoff

## Not so Difficult? This Parrot ‘Gets’ Statistical Inference

If you have tramped or climbed in New Zealand’s high country, as I did for a couple of months many moons ago, you’ve probably spent hours watching kea exploring or ‘playing’. Kea are large parrots with wicked-looking beaks that are highly social, and notorious among mountaineers for their ability to find food, no matter how hard you try to hide it.

This article in nature communications now presents evidence that kea can make probability judgments. Moreover, they can combine difference types of evidence in forming a probability judgment. In other words, they seem to understand statistical inference.

Here’s the abstract:

(Photo from here.) In the experiments, kea first learned that black, but not orange, sticks would bring a food reward. Then they were offered choices between an unseen stick taken from, for example, the left pot in the photo, and one taken from the right pot. Seeing the pot gave information about the relative probabilities of drawing black and orange from each pot.

I was alerted to this research by this article in The Conversation, which gives a nice summary, and includes some great photos, which are not for reproduction.

Yes, only 6 animals were tested, and I can’t see any sign that the experiments and analyses were preregistered. But full data, and method and analysis scripts are available online, and several experiments providing converging evidence are reported. Maybe replications are planned.

The authors used Bayesian analyses, most basically to assess the numbers of correct trials in blocks of 20 trials, where 10/20 would be expected by chance. For example, in Table 1 (here and scroll down) the proportion correct is given for each of 6 animals in 6 different conditions. Proportions with a Bayes Factor (BF) of 3.0 or more in favour of the ‘better than chance’ hypothesis are shown in bold; 27 of the 36 are bold. This illustrates a Bayesian dichotomous decision making strategy analogous to marking individual results with ‘*’ if their individual p value is less than .05. As usual, I’d hope for estimation results, whether using Bayesian credible intervals or frequentist confidence intervals–rather than either Bayesian or NHST dichotomous classification of results.

So, are kea smarter than some of our students, who seem to have such difficulty grasping the basics of statistical inference? Or is the problem that, at least in the past, we’ve insisted on burying inference under a weird superstructure of NHST and p values?

Who’s up for some experiments to check out how kea can go with confidence intervals?

Geoff

P.S. It really is worth going high in New Zealand, and not only for the kea watching.

## What Psychology Teachers Should Know About Open Science and the New Statistics

Morling, B., & Calin-Jageman, R. J., (2020). What Psychology Teachers Should Know About Open Science and the New Statistics. Teaching of Psychology, 47 (2), 169-179. doi: 10.1177/0098628320901372

First, here’s the overview diagram, a great teaching explainer in itself:

I agree with just about all they say, and note in particular:

• The title refers to ‘psychology teachers‘, not only statistics and methods teachers. In the journal it’s placed in The Generalist’s Corner. This is important: Every teacher of psychology (yep, and lots of other disciplines too) needs to take up Open Science issues when presenting and discussing any research findings. Beth and Bob give lots of advice on good ways to do this.  Authors of *any* psychology textbook take note.
• “Psychological science is experiencing a period of rapid methodological change. (p. 169)” That’s a restrained way to put it–arguably the advent of Open Science is the most exciting and important advance in how science is done for a long time. Bring it on.
• Three questions provide a framework and mnemonic for the new statistics–the three simple questions to the right in the diagram above. They are on point, though I’d consider “How precise?” as an alternative to the second, even if it’s not as straightforward and pithy as “How wrong?”. The three also appear in the title of Bob’s and my TAS article.
• There’s so much more gold: links to great teaching resources for Open Science, simulations, suggestions for classroom dialogue, and more. (Discuss preregistration by playing it out in the classroom: students make predictions, record these, then analyse data and discuss results in the light of their prior expectations.)
• The authors’ passion for teaching, and for the essential changes they are discussing, shines through. They could make more of what I’m sure is their conviction–that the new ways are way more satisfying to teach, and way more readily understood by students. Happier students *and* teachers: What’s not to like!

Points I’m pondering:

### ‘Registration’ or ‘preregistration’?

I posted about this question a couple of months ago. ‘Registration’ is long established in medicine. Why does psychology persist with ‘prereg…’, a longer term, with its internal redundancy? It’s not a big deal, and maybe we’re stuck with the messiness and possible ambiguity of using both terms. Beth and Bob stick with current psychology practice by using ‘prereg…’ throughout, but explain ‘registered reports’–which are simply reports based on preregistered (and refereed) plans.

### Do we have the full story?

I do like the three questions (dot point 3 above), but I also like very much our beginning Open Science question, introduced on p. 9 in ITNS. ‘Do we have the full story?’ can easily be explained as prompting scrutiny of numerous aspects of research, from preregistration through informative statistics (ESs and CIs, of course); full reporting of the method, data, and analyses; to consideration of other relevant studies that may not have been made available.

### Confirmatory or Planned?

My main disagreement with the authors is over their use of confirmatory/exploratory as the distinction between analyses that have been planned, and preferably preregistered, and those that are exploratory. It’s a vital distinction, of course, but ‘confirmatory’, while a traditional and widely-used term, does not capture well the intended meaning. Confirmatory vs exploratory probably originates with the two approaches to using factor analysis. It could make sense to follow an exploratory FA that identified a promising factor structure with a test of that now-prespecified structure with a new set of data. That second test might reasonably be labelled ‘confirmatory’ of that structure, although the data could of course cast doubt on rather than confirm the FA model under test.

By contrast, a typical preregistered investigation, in which the research questions and the corresponding data analysis are fully planned, asks questions about the sizes of effects. It estimates effect sizes rather than seeks to ‘confirm’ anything. Even an evaluation of a quantitative model against data typically focuses on estimating parameters and perhaps estimating goodness of fit, rather than confirming, in some yes/no way, the model. Therefore I regard ‘planned’, rather than ‘confirmatory’, as a more accurate and appropriate term to use in opposition to ‘exploratory’. I’d vote for planned/exploratory as the terms to describe the vital distinction in question.

### Bottom line

It’s a great article, well worth reading and discussing with colleagues.

Geoff

## A Tribute to Wayne Velicer

Wayne Velicer was a giant among quantitative psychologists and health researchers, among other groups. I was very fortunate to be able to call him a colleague and good friend. Sadly, he died too young, in October 2017.

The journal Multivariate Behavioral Research has just published online-before-print a tribute to Wayne. Here’s the reference:

Lisa L. Harlow, Leona Aiken, A. Nayena Blankson, Gwyneth M. Boodoo, Leslie Ann D. Brick, Linda M. Collins, Geoff Cumming, Joseph L. Fava, Matthew S. Goodwin, Bettina B. Hoeppner, David P. MacKinnon, Peter C. M. Molenaar, Joseph Lee Rodgers, Joseph S. Rossi, Allie Scott, James H. Steiger & Stephen G. West (2020) A Tribute to the Mind, Methodology and Mentoring of Wayne Velicer, Multivariate Behavioral Research,

Here’s the abstract:

### Abstract

Wayne Velicer is remembered for a mind where mathematical concepts and calculations intrigued him, behavioral science beckoned him, and people fascinated him. Born in Green Bay, Wisconsin on March 4, 1944, he was raised on a farm, although early influences extended far beyond that beginning. His Mathematics BS and Psychology minor at Wisconsin State University in Oshkosh, and his PhD in Quantitative Psychology from Purdue led him to a fruitful and far-reaching career. He was honored several times as a high-impact author, was a renowned scholar in quantitative and health psychology, and had more than 300 scholarly publications and 54,000+ citations of his work, advancing the arenas of quantitative methodology and behavioral health. In his methodological work, Velicer sought out ways to measure, synthesize, categorize, and assess people and constructs across behaviors and time, largely through principal components analysis, time series, and cluster analysis. Further, he and several colleagues developed a method called Testing Theory-based Quantitative Predictions, successfully applied to predicting outcomes and effect sizes in smoking cessation, diet behavior, and sun protection, with the potential for wider applications. With \$60,000,000 in external funding, Velicer also helped engage a large cadre of students and other colleagues to study methodological models for a myriad of health behaviors in a widely applied Transtheoretical Model of Change. Unwittingly, he has engendered indelible memories and gratitude to all who crossed his path. Although Wayne Velicer left this world on October 15, 2017 after battling an aggressive cancer, he is still very present among us.

I’m one of 17 of his colleagues who contributed a short tribute. Here’s mine:

### Geoff Cumming wrote:

Way back, Wayne invited me to visit his lab. I gave a talk about the iniquities of NHST and p values, and the benefits of confidence intervals. At once it was clear that we shared many views. Wayne enthusiastically brainstormed about what estimation could do in his research. On the spot, he invited me to help develop a paper using confidence intervals to evaluate one of his multi-variable quantitative models. Over several years we exchanged drafts; it was exciting for me to work with such a creative, energetic, and distinguished colleague. The paper appeared in Applied Psychology: An International Review, and has provoked much comment.

My first experience of Wayne as food and wine buff, and wonderfully generous host, occurred when my wife and I visited R.I. while driving an old RV around the U.S. Thereafter, memorable meals—at venues selected by Wayne the expert—became, for me, highlights of American Psychological Association Conventions.

Wayne often mentioned that he loved visiting Australia, where he had close research colleagues, although we managed to meet up here only once. Besides his enduring friendship and expansive hospitality, I remember most warmly his ability to ruffle scientific feathers to such creative and positive effect.

### Using CIs to Assess the Quantitative Predictions Made by a Multivariable Model

The paper I mentioned above (reference below) includes this figure:

This figure also appears in my first book, where I discussed this example on pp. 426-7.

Velicer and colleagues chose  omega2 (vertical axis above) as the main ES, an estimate of the proportion of total variance in smoking status–a measure of a person’s position on the spectrum from regular smoker to successful quitter–attributable to each of a number of predictor variables (horizontal axis) of their Transtheoretical Model of behaviour change.  The grey dots mark the model’s quantitative predictions; the short horizontal lines mark estimates from a large data set, with 95% CIs.  The predictions fall within the CIs for 11 of the 15 variables, which we interpreted as strong support for most aspects of the model.  Because the discrepancies between predictions and data are quantitative, we could examine each and decide whether to modify an aspect of the model, or await further empirical testing.  We discussed our test of the Transtheoretical Model more broadly as an illustration of the value of CIs for model fitting, and looked forward to the development of many more quantitative models in psychology. A focus on CIs then allows, as above, the evaluation of such models against new sets of data.

Velicer, W. F., Cumming, G., Fava, J. L., Rossi, J. S., Prochaska, J. O., & Johnson, J. (2008). Theory testing using quantitative predictions of effect size. Applied Psychology: An International Review, 57, 589-608. doi: 10.1111/j.1464-0597.2008.00348.x

I salute the memory and intellectual legacy of Wayne.

Geoff

## Teaching in the New Era of Psychological Science

A great collection of articles in the latest issue of PLAT. Contents page here. At that page, click to see the abstract of any article.

A big shout out to the wonderful editorial team that assembled this special issue: Susan A. Nolan (Seton Hall University, USA), Tamarah Smith (Gwynedd Mercy University, USA), Kelly M. Goedert (Seton Hall University, USA), Robert Calin-Jageman (Dominican University, USA)

The intro (as above) gives a good overview and summary of all the articles. It’s on open access here. As the editors say, the issue includes two reviews, three articles, and three reports, which together cover a wide range of Open Science issues, all from a teaching perspective.

Here are my haphazard thoughts:

• Anglin and Edlund report a survey of psychology instructors. Overall, most don’t teach much about OS practices, but believe that more should be done. They identify the current incentive structure for researchers as a big problem and high priority for change. Indeed!
• Three articles include discussion of Bayesian approaches to data analysis and interpretation. It’s certainly a big issue how statistical inference practice should and will develop in psychology. Can beginners be introduced successfully to Bayesian methods? (Play with JASP  and read the van Doorn et al. article for an inkling of how things are developing.) Should students be exposed to both conventional and Bayesian approaches to statistical thinking, or will this merely create confusion? If so, at which stage in a student’s statistical education? If first one then the other, which should come first?
• I’m struck by the extent of student involvement and activity in many of the courses and approaches described. Flipped classroom, student-led discussions and classes, expectations of student initiative, and more. Very heartening! Such approaches tie in well with projects that follow the full sequence of a research project, from question conception through all the steps to full reporting and contemplation of further replication. I wish I, way back as a student, had experienced such courses!
• Several articles describe courses and projects in which students collaborate, perhaps across several universities, to carry out a replication of a published study. Involving groups of students means that the replication can be usefully large. Following OS practices means that the results should be publishable, and a useful contribution to the literature. Such projects can also provide an excellent educational experience for upper year undergraduates and perhaps masters students.
• Back in 2015 at the WPA Annual Convention in Las Vegas I was part of a symposium on collaborative replication projects involving students. Many of those attending the symposium were faculty in liberal arts schools, many of whom had little scope to conduct research, but who were expected to train their students in research methods. I felt a palpable enthusiasm in the room for this very new (at that time) idea that their students could participate in worthwhile large-scale collaborative replication projects that could provide excellent training, while being practically achievable within the limited resources available in many cases. Several of the PLAT articles describe how this approach has now developed considerably.
• Taking a broad perspective, the editors “wonder if the approach these authors outlined might also be a way to combat an anti-science climate. If a greater number of students are actually engaged in science (e.g., in replications) rather than just class projects, they may view themselves as part of a larger scientific community, and, in turn, be less likely to have an overall distrust of science.” Bring this on!

Enjoy,

Geoff

## Bushfires and Open Science: A View From Australia

### Our Family Summer in a Time of Fires

We’re just back home after a couple of weeks at the big old family beach house. We had one stinking hot day, 40+ degrees, but, strangely, other days were cool to cold, usually with strong swirly winds. So different from the long spells of searing heat further out to the East. We had only a few beach visits but lots of indoor games and gang self-entertainment by the kids. People came and went, but usually it was a pleasantly chaotic mob of 15 or so people. We watched the Test cricket–beating the Kiwis, yay! And, like the rest of the world, were aghast to see the pictures and hear the reports of the enormous fires up and down the East and South-East coasts.

We were at Anglesea, one of a string of small towns along the Great Ocean Road to the west of Melbourne. There have been big fires down that way in the past, but this year, so far, nothing major, although the peak months of Feb and March are still ahead. We not only enjoyed a family holiday, but–new these last couple of years–also kept a careful eye on the sky to the West, and on the excellent emergency app that pings when any warning is issued for any ‘watch zone’ you care to nominate. So we kept our phones charged and didn’t forget to check the car had a full tank, and drinking water and blankets on board–and we reminded ourselves where the two evacuation areas in the town are, and how we can most quickly get there. And what we musn’t forget if we have to load and leave quickly. Note to self: Find the little old battery radio, to keep nearby to hear emergency warnings if the electricity and phone reception both die.

We drove home through smoke haze, two hours, headlights on, never quite adapting to the stinging smell of smoke. The smoke came, we were told, not from the Victorian fires a few hundred km to the East, but from fires in Tasmania a thousand km to the South. Some of the fires have been burning for months, especially up north, and Sydney has had days of dangerous smoke at various times these last several months. Most ominously, it seems that these fires are on track to increase Australia’s already scandalously high carbon emissions by more than 50% over this July-to-June period, and very likely by 100% or more. More than double! A tipping point, anyone?

Yes, life on this planet is changing and we’re all in it together. But what has this to do with Open Science?I think we can learn by listening to climate scientists–not only about the carbon emission reductions we should have been doing 30 years ago and must, with great urgency, be doing now, but also about what good research practice looks like, under pressures that most of us can only imagine.

BTW, I’ll mention The Conversation (‘Academic rigour, journalistic flair’), which is running what seem to me an outstanding series of pieces on the fires. This week, see the Mon, Tues, Wed, and today’s offerings.

### Climate Scientists at Work–Implications for Open Science

The best I’ve read recently is Gergis, J. (2018) Sunburnt Country. Carlton, Vic, Australia: Melbourne University Press.

In the acknowledgements, the author first thanks the teachers in the professional writing course she undertook before writing the book. The training shows–her writing is beautiful and compelling. Gergis tells the fascinating story of her research on the history of climate in Australia, going back centuries and more. She draws on data series from ice cores, tree growth rings, and coral growth rings, as well as evidence from early indigenous people and European explorers. She assembles overwhelming evidence that Australia is getting hotter and drier and, what’s more, is now experiencing weather extremes far beyond natural variability. And that human activities over the last century and more are by far the most important cause of global heating.

Comparable analyses, bringing together a wide range of data and applying a number of climate models, have been carried out for the Northern Hemisphere, but the work of Gergis and her many colleagues is a first for the Southern Hemisphere. The stories in the two hemispheres are roughly parallel, yet different, most notably by showing evidence that the toxic effect of industrialisation started later down south, not until about the mid-19th-century.

I’ll mention several aspects of this–and much other–climate research:

#### Integration of evidence–similar measures

We are becoming used to using meta-analysis to integrate results based on similar studies using the same or similar measures. Climate scientists do this routinely, sometimes on a very large scale as when different time series of e.g. CO2 concentration are integrated.

#### Integration of evidence–diverse measures

We sometimes talk about converging approaches as providing a powerful strategy for establishing a robust finding. See for example pp. 414-416 in ITNS. Climate scientists do this on a massive scale, as when multiple data series of quite different types are brought together to build an overall picture of change over time.

#### Using multiple quantitative models

Some fields in psychology use quantitative modelling, and its use is spreading slowly across the discipline. In stark contrast, the development of highly complex quantitative models is core business for climate researchers. Psychologists might feel that human behaviour, cognition, and feelings are about as complex as anything can get, but climate scientists are attempting to model and understand a planet’s biosphere, whose complexity is approaching comparable, I think. One of our arguments for the new statistics is that estimation is essential for quantitative modelling, so using estimation is an essential step towards a quantitative discipline. Gergis and her team apply multiple climate models to their diverse data sets to account for what has happened in the past and provide believable predictions for the future. You can guess that these predictions are scary beyond belief, unless our practices change drastically and very soon.

#### Open data and analysis code

It seems to be taken for granted that data sets are openly available, at least to all other researchers. The same for the models themselves, and the analyses carried out by any research group.

#### Reproducibility and peer scrutiny

Gergis describes how, again and again, any analysis or evaluation of a model by her group is subjected to repetition and intense scrutiny, at first by others within the group, then by other researchers. Only after all issues have been dealt with, and everything repeated and re-examined once more, is a manuscript submitted for publication. Which leads to further exacting peer scrutiny, and possible revision, before publication. It’s exhausting and sobering to read about such a painstaking process.

#### The Dark Side

You probably know about the hideous harassment meted out to climate researchers. Any aspect of their work can be attacked, with little or no justification. Vitriolic attacks can be personal. They may have to spend vast time and emotional energy responding to meaningless legal or other challenges. This toxic environment is no doubt one reason for the intense scrutiny I described above. We know all that, but even so it’s moving and enraging to read the trials that Gergis and her group have endured.

#### Large is Good? Small is Good?

Discussions in psychology of p-hacking and cherry-picking almost always assume that large is good: Researchers yearn for effects sufficiently large to be interesting and to gain publication. Open Science stipulates multiple strategies to counter such bias. Climate science offers an interesting contrast: Climate scientists tend to scrutinise their analyses and results, dearly hoping that they have slipped up somewhere and that their estimates of effects are too large. Surely things can’t be this bad? So back they go, checking and double-checking, if anything making their analyses and conclusions more conservative, rather than–as perhaps in psychology–exaggerated.

#### Researchers Have Emotions too

Of course many psychologists are emotionally committed to their research–we can feel disappointment, frustration, and–with luck–elation. I suspect, however, that rarely are these emotions likely to match the strength of emotions that climate scientists sometimes experience. Besides the personal cost of the vitriolic personal attacks, finding results that map out a hideous future for humankind–including our own children and grandchildren–can be devastating. In a forceful opinion piece a few months ago (The terrible truth of climate change), Gergis wrote:

“Increasingly after my speaking events, I catch myself unexpectedly weeping in my hotel room or on flights home. Every now and then, the reality of what the science is saying manages to thaw the emotionally frozen part of myself I need to maintain to do my job. In those moments, what surfaces is pure grief.”

#### And so…

All the above may make our Open Science efforts pale. But, as we try to figure out how to improve the trustworthiness of our own research, I think it’s worth pondering the strategies this other discipline has adopted in its own effort to give us results that we simply must believe.

Geoff

P.S. A warm thank you to those friends and colleagues near and far who have sent enquiries, and messages of concern and support. Very much appreciated. Yes, we are all in this together.

## Banishing “Black/White Thinking”

### eNeuro publishes some teaching guidance

You may recall that eNeuro published a great editorial and a supporting paper by Bob and me–mainly Bob. Info is here.

It has now published a lovely article giving teaching advice about ways to undermine students’ natural tendency to think dichotomously. If I could wave a magic wand and change a single thing about researchers’ thinking and approach to data analysis, I’d ask the magic fairy to replace Yes/No research questions with ‘How much…?’ and ‘To what extent…?’ questions. Then maybe we could at last move beyond blinkered significant/nonsignificant, yes/no thinking to estimation thinking. The article (pic below) is here.

Abstract
Literally hundreds of statisticians have rightly called for an end to statistical significance testing (Amrhein et al., 2019; Wasserstein et al., 2019). But the practice of arbitrarily thresholding p values is not only deeply embedded in statistical practice, it is also congenial to the human mind. It is thus not sufficient to tell our students, “Don’t do this.” We must vividly show them why the practice is wrong and its effects detrimental to scientific progress. I offer three teaching examples I have found to be useful in prompting students to think more deeply about the problem and to begin to interpret the results of statistical procedures as measures of how evidence should change our beliefs, and not as bright lines separating truth from falsehood.

In the abstract (above), I love ‘congenial to the human mind‘. Yes, we seem to have an inbuilt tendency to think in a black-white way. Overcoming this is the challenge, especially when it has been endlessly reinforced during more than half a century of obeisance to p < .05. I also love the ‘vividly‘–surely the best way to get our message across. That’s why I keep banging on about the dance of the p values, and significance roulette. (Search for these at YouTube.)

### Scroll down for the interesting bits

Before you click on the pdf link, scroll right down to see the reviewing history of the ms. Bob was the reviewer. The story is a nice example of constructive peer reviewing. Bob and the editor liked the original and made a number of suggestions for strengthening it. The author adopted many of these, but in some cases explained why they were not adopted.

Note also that there are some PowerPoint slides for download, to help the busy teacher.

Enjoy,

Geoff

## Farewell and Thanks Steve Lindsay

Psychological Science, the journal, has for years pushed hard for publication of better, more trustworthy research. First there was the leadership of Eric Eich, then Steve Lindsay energetically took the baton. Steve is about to finish, no doubt to his great relief. His ‘swan song’ editorial has just come online, with open access:

Steve starts with a generous reference to a talk of mine at Victoria University in Wellington N.Z. No doubt I talked mainly about the huge variability of p values, and ran the ‘dance of the p values‘ demo. (Search for ‘dance of the p values’ at YouTube to find maybe 3 versions.)

Then he gives a brief and modest account of the great strides the journal took under Eric’s and then his own leadership. Indeed, Psychological Science has been vitally important in advancing our discipline! I’m sure, also, that it has had beneficial influence well beyond psychology.

All eyes are now on the incoming editor, Patricia Bauer. To what extent will she keep up the Open Science pressure, the further development of journal policies and practices to keep our field moving towards ever more reproducible and open–and therefore trustworthy and valuable–research?

I join Steve in wholeheartedly wishing her well.

Geoff

## NeuRA Ahead of the Open Science Curve

I had great fun yesterday visiting NeuRA (Neuroscience Research Australia), a large research institute in Sydney. I was hosted by Simon Gandevia, Deputy Director, who has been a long-time proponent of Open Science and The New Statistics.

Neura’s Research Quality page describes the quality goals they have adopted, at the initiative of Simon and the Reproducibility & Quality Sub-Committee, which he leads. Not only goals, but strategies to bring their research colleagues on board–and to improve the reproducibility of NeuRA’s output. My day started with a discussion with this group. They described a whole range of projects they are working on to strengthen research at NeuRA–and to assess how quality is (they hope!) increasing.

For example, Martin Heroux described the Quality Output Checklist and Content Assessment (QuOCCA) tool that they have developed, and are now applying to recent past research publications from NeuRA. In coming years they plan to assess future publications similarly–so they can document the rapid improvement!

I should mention that Martin and Joanna Diong run a wonderful blog, titled Scientifically Sound–Reproducible Research in the Digital Age.

It was clear that the R&Q folks, at least, were very familiar with Open Science issues. Would my talk be of sufficient interest for them? Its title was Improving the trustworthiness of neuroscience research (it should have been given by Bob!), and the slides are here. The quality of the questions and discussion reassured me that at least many of the folks in the audience were (a) on board, but also (b) very interested in the challenges of Open Science.

After lunch my ‘workshop’ was actually a lively roundtable discussion, in which I sometimes managed to explain a bit more about Significance Roulette (videos are here and here), demo a bit of Bob’s new esci in R, or join in brainstorming strategies for researchers determined to do better. My slides are here.

Yes, great fun for me, and NeuRA impresses as working hard to achieve reproducible research. Exactly what the world needs.

Geoff

Top