Last month I (Bob) visited a local elementary school for a “Science Alliance” visit. This is a program in our community to being local scientists into the classroom. I brought the Cartoon Network simulator I have been developing (Calin-Jageman, 2017, 2018). This simulator is simple-enough that kids can use, but complex enough to generate some really cool network behaviors (reflex chains, oscillations, etc.). The simulation can be hooked up to a cheap USB robot, so kids can design the ‘brains’ of the robot, giving it the characteristics they want (fearful–to run away from being touched; aggressive–to track light), etc.

The kids *loved* the activity–the basic ideas were easy to grasp and they were quickly exploring, trying things out, and sharing results with each other. They made their Finches chirp and dance, and in the process discovered recurrent loops and the importance of inhibition.

In developing Cartoon Network, my inspiration was logo, the programming language developed by Seymour Papert and colleagues at MIT. I was a “logo kid”–it was basically the only thing you *could* do on the computer lab my elementary school installed when I was in second grade. Logo was *fun*–you could draw things, make animations…it was a world I wanted to explore. But Logo didn’t make it terribly easy–as you went along you would need/want key programming concepts. I clearly remember sitting in the classroom writing a program to draw my name and being frustrated at having to re-write the commands to make a B at the end of my name when I had already typed them out for the B at the beginning of my name. The teacher came by and introduced me to functions, and I remember being so happy about the idea of a “to b” function. I immediately grasped that I could write functions for every letter once and then be able to have the turtle type anything I wanted in no time at all. Pretty soon I had a “logo typewriter” that I was soooo proud of. I could viscerally appreciate the time I had saved, as I could quickly make messages to print out that would have taken me the whole class-period to code ‘by letter’.

Years later I read Mindstorms, Papert’s explanation of the philosophy behind Logo. This remains, to my mind, one of the most important books on pedagogy, teaching, and technology. Papert applied Piaget’s model of children as scientists (he had trained with Piaget). He believed that if you can make a microworld that is fun to explore, children will naturally need, discover, and understand deep concepts embedded in that world. That’s what I was experiencing back in 2nd grade–I desperately needed functions, and so the idea of them stuck with me in a way that they never would in an artificial “hello world” type of programming exercise. Having grown up a “logo kid”, reading Mindstorms was especially exciting–I could recognize myself in the examples, and connect my childhood experiences to the deeper ideas about learning Papert used to structure my experience.

Papert warned that microworlds must be playful and open-ended. Most importantly a microworld should not be reduced to a ‘drill and skill’ environment where kids have to come up with *the* answer. Sadly, he saw computers and technologies being used that way–to program the kids rather than having the kids program the computers. Even more sad, almost all the “kids can code” initiatives out there have lost this open-ended sense of exploration–they are mostly a series of specific challenges, each with one right answer. They do not inspire much joy or excitement; their success is measured in the number of kids pushed through. (Yes, there are some exceptions, like minecraft coding, etc… but most of the kids code initiatives are just terrible, nothing like what Papert had in mind).

So, what does all this have to do with statistics? Well, the idea of a microworld still makes a lot of sense and is also applicable to statistics education. Geoff’s dance of the means has become rightly famous, I would suggest, because it is a microworld users can explore to sharpen their intuitions about sampling, p values,CIs, and the like. Richard Morey and colleagues recently ran a study where you could sample from a null distribution to help make a judgement about a possible effect. And, in general, the use of simulations is burgeoning in helping researchers explore and better understand analyses (Dorothy Bishop has some great blog posts about this). Thinking of these examples makes me wonder, though–can we do even better? Can we produce a fun and engaging microworld for the exploration of inferential statistics, one that would help scientists of all ages gain deep insight into the concepts at play? I have a couple of ideas… but nothing very firm yet, and even less time to start working on them.. But still, coming up with a logo of inference is definitely on my list of projects to take on.

I’m going to end with 3 examples of thank-you cards I received from the 3rd grade class I visited. All the cards were amazing–they genuinely made my week. I posted these to Twitter but thought I’d archive them here as well.

This kid has some great ideas for the future of AI

“I never knew neurons were a thing at all”–the joy of discovery
“Your job seems awesome and you are the best at it”—please put this kid on my next grant review panel.
  1. Calin-Jageman, R. (2017). Cartoon Network: A tool for open-ended exploration of neural circuits. Journal of Undergraduate Neuroscience Education : JUNE : A Publication of FUN, Faculty for Undergraduate Neuroscience, 16(1), A41–A45. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/29371840
  2. Calin-Jageman, R. (2018). Cartoon Network Update: New Features for Exploring of Neural Circuits. Journal of Undergraduate Neuroscience Education : JUNE : A Publication of FUN, Faculty for Undergraduate Neuroscience, 16(3), A195–A196. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/30254530

The Multiverse! Dances, and More, From Pierre in Paris

Our Open Science superego tells us that we must preregister our data analysis plan, follow that plan exactly, then emphasise just those results as most believable. Death to cherry-picking! Yay!

The Multiverse

But one of the advantages of open data is that other folks can apply different analyses to our data, perhaps uncovering interesting things. What if we’d like to explore systematically a whole space of analysis possibilities ourselves, to give a fully rounded picture of what our research might be revealing?

The figure below shows (a) traditional cherry-picking–boo!, (b) OCD following of fine Open Science practice–hooray!, and (c) off-the-wall anything goes–hmmm.

That fig is from a recent (in fact, forthcoming) paper by Pierre in Paris and colleagues. The paper is here, and the reference is below at the end (Dragicevic et al., 2019). The abstract below outlines the story.

The Multiverse, Live

Pierre and colleagues not only discuss the multiverse idea in that paper, but here they give neat interactive tools that allow any reader of several example papers to do the exploration themselves. Hover the mouse, or click, to explore the outcome of different analyses.


I suggest Sections 4 and 5 in the paper are especially worth reading. Section 4 discusses what’s called explorable multiverse analysis reports (EMARs), with a focus on mapping out just what a rich range of possibilities there often are for alternative analyses.

Then Section 5 grapples with the (large) practical difficulties of building, reviewing, and using EMARs, with the aim of increasing insight into research results. Cherry-picking risks need always to be at the forefront of our thinking. Preregistration of certain proposed uses of an EMAR could be possible, with possibly somewhat reduced cherry-picking risks.

Play Multiverse on Twitter

Matthew Kay, one of the team, gave a great overview in 8 posts to Twitter. See the posts, and a bunch of GIFs in action here.



Pierre Dragicevic, Yvonne Jansen, Abhraneel Sarma, Matthew Kay, Fanny Chevalier. Increasing the Transparency of Research Papers with Explorable Multiverse Analyses. CHI 2019 – The ACM CHI Conference on Human Factors in Computing Systems, May 2019, Glasgow, United Kingdom. 2019, <10.1145/3290605.3300295>.

Joining the fractious debate over how to do science best

At the end of the month (March 2019) the American Statistical Association will publish a special issue on statistical inference “after p values”. The goal of the issue is to focus on the statistical “dos” rather than statistical “don’ts”. Across these articles there are some common themes, but also some pretty sharp disagreements about how best to proceed. Moreover, there is some very strong disagreement about the whole notion of bashing p values and the wisdom of the ASA putting together this special issue (see here, for example).

Fractious argument is the norm in the world of statistical inference, hence the old joke that the plural of “statistician” is a “quarrel”. And why not? Debates about statistical inference get to the heart of epistemology and the philosophy of science–they represent the ongoing normative struggle to articulate how to do science best. Sharp disagreement over the nature of science is the norm–it has always been part of the scientific enterprise and it always will be. It is this intense conversation that has helped define and defend the boundaries of science.

Geoff has long been involved in debates over statistical inference and how to do science best, but this is new to me (Bob). I’m proud of the contribution we submitted to the ASA–I think it’s the best piece I’ve ever written. But I have to say that I go into the debate over inference (and science in general) with some trepidation. First, it is intrinsically gutsy to think you have something to say about how to do science best. Second, I’m the smallest of small-frys in the world of neuroscience–so it’s not like I have notable success at doing science to point to as a support for my claims. Finally, this ongoing debate has a long history and is populated by giants I look up to, most of whom (unlike me) have specialized in studying these topics. In my case, I’ve been learning on the go for the past ten years or so, starting from a foundation that involved plenty of graduate-level stats, but which didn’t even equip me to properly understand the difference between Bayesian and frequentist approaches to statistics.

As I wade into this fraught debate, I thought it might help me to reflect a bit on my own meta-epistemology–to articulate some basic premises that I hold to in terms of thinking about how to fruitfully engage in debate over inference and the philosophy of science. These premises are not only my operating rules, but also my philosophical courage–they explain why I think a noob like me can and should be part of the debate, and why I encourage more of my colleagues in the neurosciences and psychological sciences to tune in and jump in.

There are no knock-out punches in philosophy. This comes from one of my amazing philosophy mentors, Gene Cline. It has taken me a long time to both understand and embrace what he meant. As a young undergrad philosophy major I was eager to demolish–to embarrass Descartes’ naive dualism, to rain hell on Chalmer’s supposedly hard problems of consciousness, and to expose the circular bloviation of Kant’s claims about the categorical imperative. Gene (gradually) helped me understand, though, that if you can’t see any sense in someone’s philosophical position then you’re probably not engaging thoughtfully with their ideas, concerns, or premises (cf Eco’s Island of the Day Before). It’s easy to dismiss straw-person or exaggerated versions of someone’s position, but if you interpret generously and take seriously their best arguments, you’ll find that no deep philosophical debate is easily settled. I initially found this infuriating, but I’ve come embrace it. So I now look with healthy skepticism at those who offer knock-out punches (e.g. (Morey, Hoekstra, Rouder, Lee, & Wagenmakers, 2015)). I hope that in discussing my ideas with others to a) take their claims and concerns seriously, taking on the best possible argument for their position, and b) not to offer my criticisms as a sure and damning refutation… these only seem to exist when we’re not really listening to each other.i

Inference works great until it doesn’t. As Hume pointed out long ago, there is no logical glue holding together the inference engine. Inference assumes that the past will be a good guide to the future, but there is no external basis for this premise, nor could there be (A rare knockout punch in philosophy? Well, even this is still debated). Even if we don’t mind the circularity of induction, we still have to respect the fact that past is not always prelude: inference works great, until it doesn’t (c.f. Mark Twain’s amazing discussion in Life on the Mississippi). So whatever system of inference we want to support we should be clear-eyed that it will be imperfect and subject to error, and that when/how it breaks down will not always be predictable. This is really important in terms of how we evaluate different approaches to statistical inference–none will be perfect under all circumstances, so evaluations must proceed in terms of strengths/weaknesses and boundary conditions. The fact that an approach works poorly in one circumstance is not always a reason to condemn it. We can thoughtfully make use of tools that in under some circumstances are dangerous.

We don’t all want the same things. Science is diverse and we’re not all playing the game in exactly the same way or for the same ends. I see this every year on the floor of the Society for Neuroscience conference, where over 30,000 neuroscientists meet to discuss their latest research. The scope of the enterprise is hard to imagine, and the diversity in terms of what people are trying to do is staggering. That’s ok. We can still have boundaries between science and pseudoscience without having complete homogeneity of statistical, inferential, and scientific approaches. So beware of people telling you what you, as a scientist, want to know. Beware of someone condemning all use of a statistical approach because it doesn’t tell them what they,want to know. That’s my take on a good blog post by Daniel Lakens.

Nullius in verba* Ok – so we have to tread cautiously. But that does not devolve us into sophomoric inferential relativism (everyone’s right in some way; trophies for all!). We can still make distinctions and recognize differences. How? Well, to the extent that there is any “ground truth” in science it is the ability to establish procedures for reliably observing an effect. We could be wrong about what the effect means. But we’re not doing science if we can’t produce procedures that others can use to verify our observations. This is embodied in the founding of the Royal Society, which selected the motto Nullius in verba (verbum), which means “take no one’s word for it” or “see for yourself” (hat tip to a fantastic presentation by Cristobal Young on this). We can evaluate scientific fields for their ability to be generative this way–to establish effects that can be reliably observed and then dissected (not so fast, Psi research). We can also evaluate systems of inference in this way–for their ability (predicted or actual) to help scientists develop procedures to reliably observe effects. By this yardstick some methods of inference will be demonstrably bad (conducting noisy studies and then publishing the statistically significant results as fact while discarding the rest—bad!). But we should expect there to be multiple reasonable approaches to inference, as well as constant space for potential improvement (though usually with other tradeoffs). Oh yeah–this is a very slippery yardstick. It is not easy to discern or predict the fruitfulness of an inferential approach, and there can be strong disagreement about what counts as reliably establishing an effect.

This emphasis on replicability as essential to science cuts a tiny bit against my above point that not all scientists want the same thing. Moreover, in the negative reaction to the replication crisis, I’ve seen some commentaries where there seems to be little concern or regard for the standard of establishing verifiable effects. This, to my mind, stretches scientific pluralism past the breaking point: if you’re not bothered by a lack of replicability of your research, you’re not interested in science.

Authority will only get you so far. The debate over inference has a long history. It’s important not to ignore that . But it is equally important not to use historical knowledge as a cudgel; appeals to authority are not a substitute for good argument. Maybe it is my outside perception, but I feel like quotes from Fisher or Jeffreys or Meehl or sometimes weaponized to end discussion rather than contribute to it.

Ok – so those are my current ideas for how to approach arguments about science and statistical inference: a) embrace real statistical pluralism without letting go of norms and evaluation; b) ground evaluation (as much as possible) in what we think can best foster generative (reproducible) research, c) listen and take the best of what others have to offer, and d) try not to lean too heavily on the Fisher quotes.

At the moment, I’ve landed on estimation as the best approach for the statistical issues I face. I’m confident enough in that choice that I feel good advocating for the use of estimation for other scientists with similar goals. In advocating for estimation, I’m not going to claim a knock-out punch against p values or other approaches, or that the goals estimation can help with are the only legitimate goals to have. Moreover, in advocating for estimation, my goal is not hegemony. Hegemony of misusing p values is where we are currently at, and we don’t need to replace one imperial rule with another. I am helping a journal re-orient its author guidelines towards estimation (with or in place of p values)—but my goal is a diverse landscape of publication options in neuroscience, one where there are outlets for different but fruitful approaches to inference.

Ok – those are my thoughts for now on how to fruitfully debate about statistical inference.  I’m sure I have a lot to learn.  I’m looking forward to the special issue that will soon be out from the ASA and the debate that will surely ensue. 

*Thanks to Boris Barbour for pointing out I misquoted the Royal Society Motto in the original post.

  1. Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2015). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 103–123. doi:10.3758/s13423-015-0947-8

Journal Articles Without p Values

Once we have a CI, a p value adds nothing, and is likely to mislead and to tempt the writer or readers to fall back into mere dichotomous decision making (boo!). So let’s simply use estimation and never report p values, right? Of course, that’s what we advocate in UTNS and ITNS–almost always p values are simply not needed and may be dangerous.

Bob and I are occasionally asked for examples of published articles that use CIs, but don’t publish any p values. Good question. I know there are such articles out there, but–silly me–I haven’t been keeping a list.

I’d love to have a list–please let me know of any that you notice (or, even better, that you have published). (Make a comment below, or email g.cumming@latrobe.edu.au )

Here are a few notes from recent emails on the topic:

From Bob

First, if we look at the big picture, it is clear that estimation is on the rise.  CIs are now reported in the majority of papers both at Psych Science (which enjoins their use) *and* at JEP:General (which encourages them) (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0175583#pone-0175583-g001 ).  That suggests some good momentum.  However, you’ll see that there is no corresponding decline in NHST, so that means folks are primarily reporting CIs alongside p values.  It’s unclear, then, if this change in reporting is producing the desired change in thinking (Fidler et al. 2004 found that in medical journals CIs are reported but essentially ignored). 

As for specific examples…

·         Adam Claridge-Chang’s lab developed cool software for making difference plots, and their work now reports both p values and CIs but focuses primarily on the CIs.  The work is in neurobiology…so probably not the best example for psych graduate students

o   Eriksson, A., Anand, P., Gorson, J., Grijuc, C., Hadelia, E., Stewart, J. C., Holford, M., and Claridge-Chang, A. (2018), “Using Drosophila behavioral assays to characterize terebrid venom-peptide bioactivity,” Scientific Reports, 8, 1–13. https://doi.org/10.1038/s41598-018-33215-2.

·         In my lab, we first started reporting CIs along with p values but focusing discussion on the CIs.  For our latest paper we finally omitted the p values altogether, and yet the sky didn’t fall (Perez et al., 2018).  This is behavioral neuroscience, so a bit closer to psych, but still probably not really a great example for psych students.

o   Perez, L., Patel, U., Rivota, M., Calin-jageman, I. E., and Calin-jageman, R. J. (2018), “Savings memory is accompanied by transcriptional changes that persist beyond the decay of recall,” Learning & Memory, 25, 1–5. https://doi.org/10.1101/lm.046250.117.25.

·         I don’t read much in Psych Science, but here’s one paper that caught my eye that seems sensitive primarily to effect-size issues: 

o   Hirsh-Pasek, K., Adamson, L. B., Bakeman, R., Owen, M. T., Golinkoff, R. M., Pace, A., Yust, P. K. S., and Suma, K. (2015), “The Contribution of Early Communication Quality to Low-Income Children’s Language Success,” Psychological Science, 26, 1071–1083. https://doi.org/10.1177/0956797615581493.

·         Overall, where I the most progress is with the continued rise of meta-analysis and/or large data sets that are pushing effect-size estimates to the forefront of the discussion.  For example,

o   This recent paper examining screen time and mental health in teens.  It uses a huge data set, so the question is not “is it significant” but “how strong could the relationship be”.  They do a cool multiverse analysis, too. 

§  Orben, A., and Przybylski, A. K. (2019), “The association between adolescent well-being and digital technology use,” Nature Human Behaviour, Springer US. https://doi.org/10.1038/s41562-018-0506-1.

o   Or the big discussion on Twitter on if a significant finding of egodepletion of d = .10 means anything

§  https://twitter.com/hardsci/status/970015349499465729

More from Bob

I just came across this interesting article in Psych Science:

Nave, G., Jung, W. H., Karlsson Linnér, R., Kable, J. W., and Koellinger, P. D. (2018), “Are Bigger Brains Smarter? Evidence From a Large-Scale Preregistered Study,” Psychological Science, 095679761880847. https://doi.org/10.1177/0956797618808470.

This paper has p values alongside confidence intervals.  But the sample size is enormous (13,000 brain scans) so basically everything is significant and the real focus is on the effect sizes. 

It strikes me that this would be a great paper for debate about interpreting effect sizes.  Once controlling for other factors, the researchers find a relationship between fluid intelligence and total brain volume, but it is very weak: r = .19 95% CI[.17, .22] with just 2% added variance in the regression analysis.  The researchers describe this as “solid”.  I think it would be interesting to have students debate: does this mean anything, why or why not?  

There’s also some good measurement points to make in this paper—they compared their total brain volume measure to one extracted from by the group that collected the scans and found r = .91… which seems astonishingly low for using the exact same data set.  If about 20% of the variance in this measure is error, it makes me wonder if a relationship with 2% of the variance could be thought of as meaningful. 

Oh, and the authors also break total brain volume down into white matter, gray matter, and CSF.  They find the corrected correlation with fluid intelligence to be r = 0.13, r = 0.06, and r = 0.05, with all 3 being statistically significant.  This, to me, shows how little statistical significance means.  It also make me even more worried about interpreting these weak relationships… I can’t think of any good reason why having more CSF would be associated with higher intelligence.  I suspect that their corrections for confounding variables were not perfect (how could they be) and that the r = .05 represents remaining bias in the analysis.  If so, that means we should be subtracting an additional chunk of variance from the 2% estimate. 

Oh yeah, and Figure 1 shows that their model doesn’t really fit very well to the extremes of brain volume. 

From Geoff

In this post, I note that an estimated 1% of papers in a set of HCI research papers report CIs but don’t seem to have signs of NHST or dichotomous decision making, but I don’t have any citations.

So, between us (ha, almost all Bob), Bob and I can paint a somewhat encouraging picture of progress in adoption of the new statistics, while not being able to pinpoint may articles that don’t include p values at all. As I say, let’s know of any you come across. Thanks!


Teaching The New Statistics: The Action’s in D.C.

The Academy Awards are out of the way, so we can focus on what’s really important: the APS Convention, May 23-26, 2019, in Washington D.C.

For the first time for many years I won’t be there, but new-statistics action continues at the top level. After Tamarah and Bob’s great success last year, APS has invited them back. They will give an update of their workshop on teaching the new statistics–and, of course, Open Science.

The Convention website is here. You can register here. Note that workshops require additional registration–which is not expensive, and even less so for students. Register by 15 April for early-bird rates.

The workshops site is here, with lots of juicy goodies. But Tamarah and Bob’s will undoubtedly be the highlight!


Statistical Cognition: An Invitation

Statistical Cognition (SC) is the study of how people understand–or, quite often, misunderstand–statistical concepts or presentations. Is it better to report results using numbers, or graphs? Are confidence intervals (CIs) appreciated better if shown as error bars in a graph or as numerical values?

And so on. These are all SC questions. For statistical practice to be evidence-based, we need answers to SC questions, and these should inform how we teach, discuss, and practise statistics. Of course.

An SC Experiment

This is a note to invite you–and any of your colleagues and students who may be interested–to participate in an interesting SC study. It is being run by Lonni Besançon and Jouni Helske. It’s an online survey that asks questions about CIs and other displays. It’s easy, takes around 15 minutes, and, as usual, is anonymous. To start, click here.

Feel free to pass this invitation on to anyone who might be interested. I suggest that we should all feel some obligation to encourage participation in SC research, because it has the potential to enhance research. Here’s to evidence-based practice! With cognitive evidence front and central.

Statistical Cognition, Some Background

Ruth Marom, Fiona Fidler, and I wrote about SC some years back. The full paper is here. The citation is:

Beyth-Marom, R., Fidler, F., & Cumming, G. (2008). Statistical cognition: Towards evidence-based practice in statistics and statistics education. Statistics Education Research Journal, 7, 20-39.

Some of Our SC Research

Here are a few examples:

Lai, J., Fidler, F., & Cumming, G. (2012). Subjective p intervals: Researchers underestimate the variability of p values over replication. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 8, 51-62. Abstract is here.

Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but do not guarantee, better inference than statistical significance testing. Frontiers in Quantitative Psychology and Measurement, 1:26. Full paper is here.

Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10, 389-396. Abstract is here.

I confess that all those studies date from pre-Open-Science times, so there was no preregistration, and little or no replication. Opportunity!


A Second Edition of ITNS? Here’s the Latest

Our first blog post about a possible second edition of ITNS is here. All the comments I made there, and the questions I asked, remain relevant. We’ve had some very useful feedback and suggestions, but we’d love more. You could even tell us about aspects of ITNS that you think work well. Thanks!

Meanwhile Routledge has been collecting survey responses from users, adopters, and potential adopters of ITNS. We are promised the collated responses in the next few weeks.

Back to that first post. Scroll down below the post to see some valuable comments, including two substantial contributions from Peter Baumgartner.

Peter has also been in touch with us to discuss various possibilities and, most notably, to send a link to some goodies he has sketched. One of Peter’s interests is effective didactical methods, and he has made positive comments about some of the teaching strategies and exercises in ITNS–especially those that prompt students to take initiative, explore, and discover. He’s keen to see development of these and so, as I mentioned, he has sketched some possibilities.

Peter’s Goodies

Peter’s sketches start here. Then he has a whole sequence of goodies that start here. He asks me to emphasise that he doesn’t claim to be an expert programmer and that he’s largely using tools (including shiny on top of R, and the R tutorial package learnr) that are rather new to him. (So one of the lessons for us is that it may (!) be comparatively easy to use such tools to build new components for ITNS.)

The figure below is from a shinyapp that is one of the last of Peter’s goodies, and shows how a stacked dot plot and a histogram can be linked together, as the user changes the number of bins. ESCI has a version of this, and also allows clicking to display mean, median, and various other aspects of the data set. Peter emphasises that it should be comparatively easy to build such displays in shiny, with the advantage of the full power and flexibility of R.

Thanks Peter! For sure, finding a good way to do more in R is a priority for us. Should we try to replace ESCI entirely? Retain ESCI for some simulations that are useful for learners, while transferring more of the data analysis to R? Is jamovi the way to go, perhaps alongside shiny apps? What’s the best way in R to build detailed, appealing, graphics–with onscreen user controls–such as ESCI attempts to provide?

Please, let’s have your thoughts, either below, or via email to Bob or me. Thanks!


Geoff g.cumming@latrobe.edu.au
Bob rcalinjageman@dom.edu

Play, Wonder, Empathy – Latest Educational Trends, Says The Open University

My long-time friend and colleague Mike Sharples told me about the recently released Innovating Pedagogy 2019 report from The Open University (U.K.). It’s the seventh in an annual series initiated by Mike. Each report aims to describe a number of promising trends in learning and teaching. There’s not much by way of formal evaluation of effectiveness and outcomes, but there are illuminating examples, and leads and links to resources to help adoption and further development.

The 2019 report describes 10 trends, as listed below. At this website you can click for brief summaries of any of the 10 that takes your fancy. There are also links to the previous six reports.

It strikes me that several of the 10 deserve thought, from the point of view of improving how we teach intro statistics. The one that immediately caught my eye was wonder.

I’ve always found randomness, and random sampling variability, to be the source of wonder. People typically don’t appreciate the wonder of randomness, nor do they appreciate that, in the short term, randomness is totally unpredictable and often surprising, even astonishing. In the long term, however, the laws of probability dictate that the observed proportions of particular outcomes will be very close to what we expect.

Prompted by the examples and brief discussion in the report of wonder, I can think of my years of work with the dances (of the means, of the CIs, of the p values, and more) as aiming to bring the wonder of randomness to students. Often we’ve discussed patterns and predictions and the hopelessness of making short-term predictions. We’ve compared the dances we see on screen–dancing before our eyes–with physical processes in the world that we might regard as random. (To see the dances, use ESCI, or go to YouTube and search for ‘dance of the p values’ and ‘significance roulette’.)

I suggest it’s worth poking about in this latest report, and in the earlier reports, for trends that might spark your own thinking about statistics teaching and learning.


The ten 2019 trends:

Playful learningEvoke creativity, imagination and happiness 

Learning with robotsUse software assistants and robots as partners for conversation

Decolonising learningRecognize, understand, and challenge the ways in which our world is shaped by colonialism

Drone-based learningDevelop new skills, including planning routes and interpreting visual clues in the landscape

Learning through wonderSpark curiosity, investigation, and discovery

Action learningTeam-based professional development that addresses real and immediate problems

Virtual studiosHubs of activity where learners develop creative processes together 

Place-based learningLook for learning opportunities within a local community and using the natural environment

Making thinking visibleHelp students visualize their thinking and progress

Roots of empathyDevelop children’s social and emotional understanding

Sizing up behavioral neuroscience – a meta-analysis of the fear-conditioning literature

Inadequate sample sizes are kryptonite to good science–they produce waste, spurious results, and inflated effect sizes.  Doing science with an inadequate sample is worse than doing nothing. 

In the neurosciences, large-scale surveys of the literature show that inadequate sample sizes are pervasive in some subfields, but not in others (Button et al., 2013; Dumas-Mallet, Button, Boraud, Gonon, & Munafò, 2017; Szucs & Ioannidis, 2017).  This means we really need to do a case-by-case analysis of different fields/paradigms/assays. 

Along these lines, Carniero et al. recently published a meta-analysis examining effect sizes and statistical power in research using fear conditioning (Carneiro, Moulin, Macleod, & Amaral, 2018).  This is a really important analysis because fear conditioning has been an enormously useful paradigm for studying the nexus between learning, memory, and emotion.  Study of this paradigm has shed light on the neural circuitry for processing fear memories and their context, on the fundamental mechanisms of neural plasticity, and more.  Fear conditioning also turns out to be a great topic for meta-analsis because even though protocols can vary quite a bit (e.g. mice vs rats!), the measurements is basically always expressed as a change in freezing behavior to the conditioned stimulus. 

Carniero et al. (2018) compiled 122 articles reporting 410 experiments with a fear-conditioning paradigm (yeah…it’s a popular technique, and this represents only the studies reported with enough detail for meta-analysis).   They focused not on the basic fear-conditioning effect (which is large, I think), but on contrasts between fear conditioning in control animals relative to treatments thought to impair or enhance learning/memory. 

On first glance, there seems to be some good news: typical effects reported seem large.  Specifically, the typical “statistically significant” enhancement effect is a 1.5 SD increase in freezing while a typical “statistically significant” impairment effect is a 1.5 SD decrease in freezing.  Those are not qualitative effects, but they are what most folks would consider to be fairly large, and thus fairly forgiving in terms of sample-size requirements.  

That sounds good, but there is a bit of thinking to do now.  Not all studies analyzed had high power, even for these large effect sizes.  In fact, typical power was 75% relative to the typical statistically significant effect.  That’s not terrible, but not amazing.  But that means there is probably at least some effect-size inflation going on here, so we need to now ratchet down our estimate.  That’s a bit hard to do, but one way is to take only studies that are well-powered relative to the initial effect-size estimates and then calculate *their* typical effect size.  This shrinks the estimated effects by about 20%.  But *that* means that typical power is actually only 65%.  That’s getting a bit worrisome–in terms of waste and in terms of effect-size inflation.  And it also means you might want to again ratchet down your estimated effect size… yikes.  Also, thought the authors don’t mention it, you do worry that in this field the very well-powered studies might be the ones where folks used run-and-check to hit a significance target (in my experience talking to folks as posters this is *still* standard operating procedure in most labs).

Based on these considerations, the authors estimate that you could achieve 80% power for what they estimate to be a plausible effect size with 15 animals/group.  That wouldn’t be impossible, but it’s a good bit more demanding than current practice–currently only 12% of published experiments are that size or larger. 

There are some other interesting observations in this paper–it’s well worth a read if you care about these sorts of things.  The take-home I get is: a) this is without a doubt a useful assay that can reveal large enhancement and impairment effects, enabling fruitful mechanistic study, but b) the current literature has at least modestly inadequate sample sizes meaning that a decent proportion of supposedly inactive treatments are probably active and a decent proportion of supposedly impactful treatments are probably negligible.  Even with what is often seen as a pillar of neuroscience, the literature is probably a decent bit dreck.  That’s incredibly sad and also incredibly unproductive: trying to piece together how the brain works is hard enough; we don’t need to make it harder by salting pieces from a different puzzle (not sure if that metaphor really works, but I think you’ll know what i mean).  Finally, doing good neuroscience is generally going to cost more in terms of animals, time, and resources that we’ve been willing to admit… but it’s better than doing cheap but unreliable work.




Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. doi:10.1038/nrn3475
Carneiro, C. F. D., Moulin, T. C., Macleod, M. R., & Amaral, O. B. (2018). Effect size and statistical power in the rodent fear conditioning literature – A systematic review. PLOS ONE, 13(4), e0196258. doi:10.1371/journal.pone.0196258
Dumas-Mallet, E., Button, K., Boraud, T., Gonon, F., & Munafò, M. (2017). Low statistical power in biomedical science: a review of three human research domains. Royal Society Open Science, 4(2), 160254. [PubMed]
Szucs, D., & Ioannidis, J. P. A. (2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biology, 15(3), e2000797. doi:10.1371/journal.pbio.2000797

Sadly, Dichotomous Thinking Persists in HCI Research

A few words about the latest from Pierre Dragicevic. He’s an HCI researcher in Paris who totally gets the need for the new statistics. I’ve written about his work before, here and here. Now, with colleague Lonni Besançon, he reports a study of how HCI researchers have reported statistical inference over the period 2010 – 2018. It’s a discouraging picture, but with glimmers of hope.

The study is:

Lonni Besançon, Pierre Dragicevic. The Continued Prevalence of Dichotomous Inferences at CHI. 2019. It’s here.

Lonni and Pierre took the 4234 articles in the CHI conference proceedings from 2010 to 2018–these proceedings are one of the top outlets for HCI research. They used clever software to scan the text of all the articles, searching for words and symbols indicating how statistical inference was carried out and reported.

I recall the many journal studies that I, with students, carried out, 10 to 20 years ago: We did it all ‘manually’: we scanned articles and filled in complicated coding sheets as we found signs of our target statistical or reporting practices. Analysing a couple of hundred articles was a huge task, as we trained coders, double-coded, and checked for coding consistency. Computer analysis of text has its limitations, but a sample size of 4,000+ articles is impressive!

Here’s a pic summarising how NHST appeared:

About 50% of papers reported p values and/or included language suggesting interpretation in terms of statistical significance. This percentage was pretty much constant over 2010 to 2018, with only some small tendency for increased reporting of exact, rather than relative, p values over time. Sadly, dichotomous decision making seems just as prevalent in HCI research now as a decade ago. 🙁

If you are wondering why only 50% of papers, note that in HCI many papers are user studies, with 1 or very few users providing data. Qualitative methods and descriptive statistics are common. The 50% is probably pretty much all papers that reported statistical inference.

What about CIs? Here’s the picture:

Comparatively few papers reported CIs, and, of those, a big majority also reported reported p values and/or used significance language. Only about 1% of papers (40 of 4234) reported CIs and not p or a mention of statistical significance. The encouraging finding, however, was that the proportion of papers reporting CIs increased from around 6% in 2010 to 15% in 2018. Yay! But still a long way to go.

For years, Pierre has been advocating change in statistical practices in HCI research–he can probably take credit for some of the improvement in CI reporting. But, somehow, deeply entrenched tradition and the seductive lure of saying ‘statistically significant’ and concluding (alas!) ‘big, important, publishable, next grant, maybe Nobel Prize!’ persists. In HCI as in many other disciplines.

Roll on, new statistics and Open Science!