1) No need for all that tortured nonintuitive normal/SD dependent tradition to measure distance from the test hypothesis: Just measure the information against the test hypothesis supplied by its P-value p by converting it to the Shannon information (now over 60 years history as “surprisal”, “logworth” and other names including S-value) s = -log(p). Unlike the P-value, the S-value is additive across independent tests (as Fisher exploited), equal-interval scaled, unbounded above so hard to confuse with a posterior probability; and when using base-2 logs has immediate translation into a coin-tossing experiment, e.g., p of 0.03 is s = -log(0.03) = 5 bits of information against the hypothesis, which is the same amount of information as 5 heads in a row supplies against fairness of a coin tossing set-up. The 1-sided 5-sigma physics criterion becomes about 22 bits or 22 heads in a row. And so on.

Yes what I am saying is The New Statistics is already old and in need of update – you should read my 2019 TAS-supplement paper and update your book accordingly:

Greenland, S. (2019). Some misleading criticisms of P-values and their resolution with S-values. The American Statistician, 73 suppl 1, 106-114, open access at

http://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625

2). OK, so we need a new term to refer to the value asserted by H0 and used to calculate the p value. Maybe ‘reference value for p’, or ‘H0 assumed value’? I think it was Bruce Thompson who used the term ‘non-nil null’, which I suspect you would label a contradiction.

1). The longer the distance, the stronger the evidence, of course. If MoE (margin of error) is the half-length of the CI (assumed here for simplicity to be symmetric) then an H0 assumed value that’s one-third of MoE beyond the end of the 95% CI gives approx p=.01, and two-thirds gives approx p=.001. We could no doubt work out the corresponding LR values. (LR is approx 7 for the point estimate vs. an end of the 95% CI, so LR increases from 7 as we move further from the CI.) In summary, strength of evidence increases fairly quickly as we move away from the 95% CI. The 5-sigma, etc, standard represents very very strong evidence. But once we move much beyond, say one MoE from an end of the CI (i.e. 4-sigma), our usual model probably is not a good guide. In practice the uncertainty due to sampling variability (as accounted for by that model) is probably overshadowed by bias or other problems not captured by that model. So in most cases we’re kidding ourselves if we report exact p values below, say, .001. (Accordingly, the APA Publication Manual recommends reporting exact, rather than relative, p values, except that p<.001 is preferred to any smaller exact value of p.) 3). A fair point that ratio measures are harder to represent well and think about clearly. Squared measures similarly. 4). Fair point. In UTNS I described my version as 'the CI-function' and marked the vertical axis also with corresponding p values.

]]>1) What does “some little distance” mean? The CI has to be pretty far from a parameter value to provide “strong” evidence in any sense I can think of. E.g., the 5-sigma requirement in physics corresponds to falling farther from the interval than the interval limits are from the center!

2) Please, the only correct English use of “null” for no difference, effect, or association – check your dictionary. One of the many ways Fisher screwed stats was misusing “null” for any tested value, just as Neyman screwed stats by calling CIs “confidence” intervals – a use which Arthur Bowley call a “confidence trick” in 1934. These abuses of English are every bit as misleading as “significance” for P<0.05.

3) I like cat's eye graphs, but I don't trust most readers to have an accurate mind's eye – especially when looking at ratio measures.

4) Nitpicking, but "Poole's P-value function"? Please, no: As Poole notes the P-value function is not his idea – it goes back at least to Birnbaum 1961. I just think Poole's 1987 exposition (actually in two articles in the Am J Public Health that year) is the clearest and most compelling to date.

Finally, just to emphasize: If I am seriously focusing on a single association, all I would need to see is the P-value function since all CIs and P-values can be read off that. But given that's asking for a bit much, I want to see the main results from it as given by a CI and P-values. And then I also want to see at least a fit P-value or some diagnostics for the model used to create those association-focused statistics (or at least have some assurance the analyst checked the model before giving us the focal results). So in my book the P-value remains a central concept of frequentist analyses.

]]>I fully agree that “nullistic conventions… need to be challenged and broken”. I agree that much of current practice needs drastic improvement, in relation to CIs as well as p values and other techniques. I agree that, if using p values, it can often be valuable and revealing to calculate them for more than one value of the null. I agree that p values around .05, corresponding to null values near an end of a 95% CI, provide only weak evidence against those null values. Yes, in typical situations, a 95% CI provides strong evidence only against null values at least some little distance from the interval.

However, I still contend that a CI is more likely to prove effective as a basis for good understanding and interpretation than one or more p values. (Or than one or more single values, each some transformation of a p value.) Yes, “p values can be calculated across the entire relevant spectrum of parameter values”. In UTNS, p. 105, I included a version of Poole’s p value function that illustrates how the p value varies across and beyond a CI. Also, in Chapter 6 of ITNS we explain how a CI can be used to eyeball the p value for any value of the null that is of interest, anywhere across or beyond the interval. A CI, especially when supplemented (either in the graph, or in the reader’s mind’s eye) with the cat’s eye figure, indicates how the relative strength of evidence against any null of interest varies as that null takes any chosen value across and beyond the interval.

We emphasise in ITNS, and in our TAS article, that an essential part of interpreting any CI is to pay attention to the full extent of the interval. So, for your example CI of [0.997, 2.59], we would want any reader to consider, in particular, the meaning in the research context of each of those interval endpoints. Yes, this is not always done, but it should be, and providing the CI is a good first step to enabling and encouraging that.

Geoff

]]>P-values can be calculated across the entire relevant spectrum of parameter values to visualize the P-value as a function of the tested parameter (Birnbaum 1961; Poole 1987; Modern Epidemiology 2008 Ch. 10, see p. 158-163). Even just one alternative P-value besides the null can provide a drastically altered perspective on the results, making more difficult the kind of dichotomous treatment that confidence intervals leave unchanged.

For example Brown et al. JAMA 2017 tried to pass off a hazard-ratio CI of (0.997, 2.59) as supporting the null. I can’t help thinking how much more difficult it would have been for the authors to present this false conclusion if they had been forced to give the P-values of ~0.78 for HR=1.5 and ~0.37 for HR=2 alongside the null (HR=1) P-value of 0.051.

So based on my reading of the med literature and the fact that a CI forces a dichotomy on the viewer, I think simply replacing P-values with CIs perpetuates the dichotomania problem and does not effectively counter nullistic bias. Presenting multiple P-values for different contextually relevant parameter values (e.g., HR=1, 1.5, 2) as well as CI does address both problems head on.

Even better would be to convert those multiple P-values into S-values (surprisals, Shannon information) s = log2(1/p) to show how weak the evidence against a parameter value is when it falls near the 95% CI, and present the CI instead not as “confidence” intervals but instead as areas of high compatibility between parameter values and the data under the model used to generate the results. Conversion to an information scale would help avoid confusion of frequentist P-values and CI with posterior probabilities and intervals.

]]>Yes, CIs can be, and alas often are, interpreted merely in terms of ‘includes’ or ‘does not include’ the null. This impoverished dichotomous interpretation ignores much of the useful information that a CI provides.

However, I can’t see any evidence or convincing argument that it is interval estimates that perpetuate dichotomous thinking. Far more plausible, I suggest, is that NHST and the way p values and sharp cutoffs are customarily used are major reasons that dichotomous thinking and dichotomous decision making remain so prominent in statistical inference.

Conventional CIs and p values are, usually, based on the same theory, so it is not surprising that, if we make the usual statistical model assumptions, either can be converted into the other. In Chapter 6 of ITNS we give some simple heuristics that guide the eyeballing of an approximate p value, given a CI. And others that guide the approximate eyeballing of a CI, given a p value and the point estimate. The latter is probably the best way to interpret a p value—convert it (plus knowledge of the point estimate) into a CI, which makes the degree of uncertainty salient.

Yes, there is sampling variability in CIs, but the extent of the single CI calculated from our data usually gives a good indication of the extent of that variability. In stark contrast, the p value calculated from our data is a single number that gives no indication of its underlying sampling variability. An exact replication is likely to give a considerably different p value. The same holds for any single value that is a transformation of that p value.

]]>