The Shape of a Confidence Interval: Cat’s Eye or Plausibility Picture, and What About Cliff?
- Curves picture how likelihood varies across and beyond a CI. Which is better: One curve (plausibility picture) or two (cat’s eye)? Which should we use in ITNS2?
- Curves can discourage dichotomous decision making based on a belief that there’s a cliff in strength of evidence at each limit of a 95% CI, i.e. at p=.05,
- Explanation and familiarity are probably needed for curves on a CI to encourage estimation, rather than mere dichotomous interpretation.
Variation Across and Beyond a CI
A CI is most likely to land so that some point near the centre is at the unknown but fixed μ we wish to estimate. Less likely is that a point towards a limit is at μ. Of course there’s a 5% chance that a 95% CI lands so that μ is outside the interval. This pattern of relative likelihood is illustrated in Figure 1.2 from ITNS:
The curve illustrates the relative plausibility that various values along the axis are μ. The higher the curve, the better the bet that μ lies here. Keep in mind that our interval is one from the dance and that it’s the interval that varies over replication, while μ is assumed fixed but unknown. This single curve on a CI is the plausibility picture.
In this paper back in 2007 I played around with the black and white images at left, among others, as ways to picture how plausibility varies over and beyond the interval. The black bulge became the cat’s eye picture of a CI, as illustrated by the blue images in Figure 5.9 (below) from ITNS.
The 95% interval, in Figure 1.2 (above, at the top) and the middle of the blue figures, extends to include 95% of the area under the curve, or between the two curves. Similarly for the 99% and 80% CIs.
I don’t say that every graph with CIs needs to picture the cat’s eye, but do suggest that students and researchers would benefit from familiarity with the idea of plausibility changing smoothly across and beyond a CI. See any CI and, in your mind’s eye, see the cat’s eye bulge.
To what extent do researchers and students appreciate that pattern of variation across a CI? Pav Kalinowski and Jerry Lai, who worked with me years ago, investigated this question. This blog post (The Beautiful Face of a Confidence Interval: The Cat’s Eye Picture) describes their findings, with a link to the published results. In short, people’s intuitions are mostly inaccurate and highly diverse, but a bit of training and familiarity with the cat’s eye is encouragingly effective in improving their intuitions about the variation in plausibility. These results prompted use of the cat’s eye in UTNS (pp. 95-102) and ITNS.
More recently, the single curve rather than the two mirror image curves, has found favour: the plausibility picture, rather than the cat’s eye. Below is an example that Bob made in R, which appears in our eNeuro paper (Bob’s blog post is here). The CIs are 90% CIs.
The plausibility picture is shown here only on the CI on the difference, not on the two CIs to the left. It may be better to restrict the shading under the curve to the extent of the CI, and perhaps the curve could be half the height, so as not to be so visually dominant. Perhaps. Below is a variation on the same idea, from this preprint that Bob recently tweeted about.
The lower figure plots the mean and 95% CI, with plausibility picture, for the differences between the three rightmost conditions and the WT-EGFP condition at left. The dabestR package was used to make the figure. (The authors generously describe such a figure as a ‘Cumming estimation plot‘. I’m happy if UTNS and ITNS have popularised the use of a difference axis to picture a difference with its CI, but I later discovered that the idea goes back a while. The earliest examples I know of are in this 1986 BMJ article by Martin Gardner and Doug Altman, which includes two figures showing a difference with its CI on a difference axis, without any curve on that CI.) Let’s know of any earlier examples.
One or Two Curves? Plausibility Picture or Cat’s Eye?
Yes, plausibility picture or cat’s eye? I don’t know of any empirical study of which is more readily understood, or more effective in carrying the message of variation over the extent of a CI and beyond. There’s probably not much in it, so it comes down to a matter of taste. I’m sentimentally attached to the cat’s eye, but admit that the single curve is more visually parsimonious. Simplest would be a single fine line depicting the curve, with no shading. It would need to extend beyond both ends of the CI, but perhaps not by much. Perhaps such a curve is as good as anything. It would be great to have some evidence relevant to these questions. Meanwhile, I’d love to hear your views.
A Cliff at p=.05? At the End of a CI?
If a 95%CI is used merely to note whether or not the interval includes the null hypothesised value, we’re throwing away much of the information it offers and descending to mere dichotomous decision making. Undermining such ideas was one of my main motivations for playing with curves on a CI. In fact, no sharp change occurs exactly at a limit of a CI, as curves should make clear. Dropping the little crossbars at the end of the CI graphic (UTNS included crossbars, ITNS does not) was another attempt to de-emphasise CI limits.
To what extent do people think that a result falling just below p=.05 rather than just above it makes a difference? What about just inside or just outside a CI? Back in 1963 Rosenthal and Gaito, in this article (image below), asked psychology faculty and graduate students about their degree of confidence in a result, on a scale from 0 to 5, for various different p values. They identified a relatively steep drop in confidence either side of .05, and described this result as a cliff.
Here are their averaged results, degree of confidence plotted against p value:
Yes, the steepest part of the curves looks to be from .05 to .10, and results for the graduate students (top 3 curves) show a kink at .05, but the cliff is hardly precipitous. If we were not so indoctrinated about .05, perhaps we’d see these curves as suggesting a relatively steady drop, rather than a sudden cliff. I wonder whether any tendency towards cliff has increased since 1963?
Jerry Lai conducted an online version of this study, with published researchers from psychology and medicine as respondents. One version of his task asked about p values, another about CI figures–which showed intervals overlapping zero to varying extents corresponding to the various p values. His results are summarised here, along with brief mention of other similar studies since 1963. He found a diversity of shapes of curves: Only a few showed a steep cliff, many showed a weak cliff, as in the R & G average results above, some a more-or-less linear decline, and others some other shape. Psychology and medical researchers gave a similar diversity of curves. Results for CIs showed, if anything, more evidence of cliff than did those for p values. Alas!
Jerry’s chosen title for his article was: Dichotomous Thinking: A Problem Beyond NHST. In other words, CIs can easily be used merely to carry out NHST. A 2010 article from my group titled Confidence intervals permit, but do not guarantee, better inference than statistical significance testing reported evidence that researchers in psychology, behavioural neuroscience, and medicine tended to make much better interpretations of results shown with CIs if they avoided thinking about the CIs in terms of NHST.
A remarkable Open Science story
Jerry wondered how the curves for R & G’s individual participants may have varied from the average curves in the figure above. He wrote a very polite letter to Prof Rosenthal. By return of post came a charming and encouraging note to Jerry, enclosing several photocopied sheets of handwritten notes, which neatly set out full details of the experiment and the data for individuals. Yes, there was quite a diversity of curve shapes. Some 50 years on, the original data were still available! Bob Rosenthal was putting to shame many subsequent researchers who could not maintain data beyond the life of a particular computer and/or were not willing to share it with other researchers.
What About a Violin Plot?
I was delighted to see this preprint:
Bob tweeted about it a couple of weeks ago. It reports the results of online statistical cognition surveys. A blog post of ours a year ago, here, invited participation.
The authors asked participants to rate their confidence that an effect was non-zero, given CI figures corresponding to p values ranging from .001 through .04, .05, and .06, and up to .8. The figures included the standard 95% CI graphic, with little crossbars at the ends, and a violin plot as at left. The researchers found that they needed to give some explanation of the violin plot, especially considering that a violin plot usually represents the spread of data points, rather than a CI: it’s usually a descriptive rather than an inferential picture, as here. I suspect that would have been clearer if the violin plot had included the standard CI graphic–as the cat’s eye and plausibility picture do.
Overall, there was a small-to-moderate cliff effect between the .04 and .06 figures. The cliff was rather smaller for the violin plot than for the standard CI graphic.
- The studies mentioned above don’t give us strong or definitive conclusions; we need replications.
- Curves probably help CI interpretation, especially by discouraging mere dichotomous decision making.
- The plausibility picture, perhaps without shading, may be the simplest and most parsimonious choice.
- Some training and familiarity with any picture that includes one or more curves may be needed for full effectiveness.
- There’s lots of scope for valuable empirical studies, perhaps especially of the plausibility picture.
A final question: ‘plausibility picture‘, ‘plausibility curve‘, ‘likelihood curve‘, ‘relative likelihood curve‘, or what? What’s your preference and why?
I’d love to have comments on these issues, and, especially, suggestions for our CI strategies in ITNS2, the second edition we’re currently working on.
Thanks Keith. Another reason why, for CIs as well as more generally, simpler may be better. Yes, I’d certainly like to see empirical investigation of the various pictures. Any budding statistical cognitivists out there?
I my Oxford DPhil thesis I was initially using what was called raindrop plots (cat’s eyes here) to display profile likelihoods. When I almost missed an important non-modality (rise after decrease) I immediately stopped using them and went with standard curve plots. Seemed to be something about the reflection leading to the multi-modality being less noticeable. Something to test?
Does including all the raw data on the plot help with your concerns? (example here: https://www.dropbox.com/s/aeiz6vmpumre2rm/cat_eye_with_data.png?dl=1).
Feels like, in general, showing the raw data is almost always the way to go for just these types of reasons.
Sometimes. In my thesis the parameters were treatments effects based on two groups or more complicated data – so not likely.
On the other hand, the reason we proposed the L’Abbe plot for meta-analysis in 1987 was as a raw data plot in contrast to effect plots like the forest plot.
Thanks, I agree. That all sounds reasonable to me. ‘Likelihood’ certainly does have a precise technical meaning, so I guess we should respect that, even if the aim of the curves is to give an eyeballing guide, to help build reasonable intuitions, rather than something to be interpreted to three decimal places. ‘Plausibility picture’? ‘Plausibility curve’?
Between those, I think I like “plausibility picture” more. I think it nicely conveys that theses should be a heuristic eyeball tool.
I like the terminology of the plausibility curve. We should be cautious to only use “likelihood” when we are actually using a likelihood to compute the interval/curve. A Wald-type interval follows a likelihood curve for a mean difference, but less well for a standardized mean difference, and very poorly for r-squared. (Granted, we should probably generally be using likelihood based intervals anyway.)