In addition to the consulting projects I work on, I have been teaching an introductory statistics course online for a number of years. My teaching assistant this semester pointed out a seeming inconsistency that no one had ever raised before. Two widely-used methods in R for calculating the p-value for a two-sided binomial hypothesis test give different answers. Which one is correct?
Let’s suppose the rate of asthma in some population is 14 per 1,000, for a probability of .014. We have a cohort of 500 people with 10 cases, for a rate of 20 per 1,000. Is our cohort different than the population?
For starters, let’s ask whether the rate in our cohort is greater than in the population, making for a one-sided test.
We can use the
pbinom() function or the
binom.test() function and get the same result. In the case of the
pbinom() function, we find the probability of seeing 9 or fewer positive cases and subtract this from 1.
1-pbinom(9,500,.014) binom.test(10,500,.014,alternative="greater") #p=.1681 in both cases
But that is not how the original question was stated. We want to know whether our cohort is different, and so we have to consider the possibility that it was different in both directions – either unusually high, or unusually low. We want to know the probability of seeing a result at least as unusual as 10.
Plenty of sources tell us that we should simply double the p-value calculated originally:
If we try using
binom.test(), however, we get a different answer:
binom.test() is doing something different.
It helps to look at the distribution of outcomes.
library(ggplot2) cases <- 0:17 probs <- dbinom(cases, size=500, prob=0.014) example <- data.frame(cases,probs) ggplot(example,aes(x=cases, y=probs))+ geom_col()+ ylab("Probability")
We see that the distribution is not symmetrical. There is no value corresponding to 10 on the left side of the curve. When we double the p-value originally obtained, we are saying, in effect, “what is the probability of seeing 10 or more cases, or (roughly) 3.5 or fewer cases”. But 3.5 is nonsensical in this context.
In fact, what
binom.test() is doing is rounding down to 3, and what we are seeing is the sum of the probability of 10 or above and the probability of 3 or below.
binom.test(10,500,.014,alternative="two.sided") pbinom(3,500,.014) + (1-pbinom(9,500,.014)) #p=.2484 in both cases
I think three different answers here are defensible:
- Use the double the p-value rule. While in our example there is no value corresponding to the left half of the curve, as the sample gets larger this problem becomes less important. What we end up with is rather like a rounding error. With a p-value of .336, it is the middle value of the three.
- We are looking for the probability of results at least as extreme as observing 10 cases. 4 cases is not as extreme as 10 – it is a more likely outcome. We thus drop down to 3. With a p-value of .259, this is the most liberal answer.
- We recognize that, while 4 is not as extreme as 10, it is very close. The asymmetry is slight. It matches what many would find intuitive. The p-value here is .339 and the most conservative.
Just to illustrate what happens with large sample, let’s increase the sample size to 5000, and let’s assume we found 78 cases, so that the p-values are similar.
Method 2 (calculated two ways):
binom.test(78,5000,.014,alternative="two.sided") pbinom(61,5000,.014) + (1-pbinom(77,5000,.014)) #p=.3350 in both cas
pbinom(62,5000,.014) + (1-pbinom(77,5000,.014)) #p=.3665
The values are all closer together, but their relative order is the same.
Whether an exact p-value in this context is truly useful is an entirely different question!