A tale of power and sample size calculation

Here is a shortened version of a lecture I gave in my introductory statistics class the other day, which I thought could be of interest to a wider audience.

My first job after college was working for a company that cleaned up hazardous waste sites. One day I was asked to calculate how many soil tests we would need to do in order to be confident that the average chromium concentration in the soil was below some regulatory threshold. The tests were expensive, so of course they wanted to do the mimimum possible. I said you couldn’t know that until you had done some tests. Start with 5 or maybe 10 tests, I said, and based on that data you could then project whether you’d be looking at 50 or 100 or however many more, but you couldn’t choose a number based on no evidence whatsoever. Perhaps you could look at similar projects for guidance, but I was not comfortable with that – what did a site in California have to do with a site in New Jersey? I really dug in my heels on his one – I had been asked to do some ethically questionable things before this, but this the one time I wasn’t just going to give them the number they wanted. We billed enough time arguing this point that it probably would have paid for a few tests.

What made me think of this story is when I came across this sentence in a statistics textbook: “The calculation of power is used to plan a study, usually before any data have been obtained, except possibly from a small preliminary study called a pilot study. Also, we usually make a projection concerning the standard deviation without actually having any data to estimate it” (emphasis mine).

Perhaps I was wrong, then! My 22-year old self was known to be wrong on occasion. But I still don’t see how it would have worked. Regardless, it was a question of power, both statistical power, and my boss’s power over me. Power is the flip side of the usual kind of statistical problem, where we are concerned with whether a measured effect is large enough to be important. That is, we normally are focused on minimizing false positives (also known as Type I error). We don’t want to be saying, “yes, it appears that this drug makes a difference” when it fact it makes no difference.

Power is concerned with limiting the risk of missing something important. It’s about minimizing false negatives (also known as Type II error). We don’t want to be saying, “this drug doesn’t do anything” when in fact it is the cure we’ve been seeking. Since, on the whole, Type II error is felt to be less concerning than Type I error, the standard used for Type II error is lower: 80% versus 95%. There is nothing magical about 80%, that’s just the usual number that has been settled upon by the scientific community.

Power calculations are used to determine the minimum sample size needed to have 80% power to detect a particular effect. Since samples cost money, if you aspire for higher power than this, then expect to be met with resistance by your superiors. On the other hand, if you are working for a large hospital or public health agency and you already have thousands of data points collected through your regular recordkeeping, then power calculations are irrelevant – you already have a large enough sample. This was the case when I worked at the New York Department of Health, where I seldom did a power calculation.

There are four ways you can increase your power in a study:

  • You can increase your sample size
  • You can increase the effect size you are hoping to detect. Let’s say your outcome is cholesterol level. You’ll have more power to find a 30 point reduction than a 10 point reduction. But if the drug only yields a 20 point reduction, you would miss that. For this reason, this is seldom a popular option.
  • You can reduce the variance in your pilot study. This is rarely within your control – your samples are what they are.
  • You can increase the Type I error rate. Type I and Type II error are interdependent – when one goes down, the other goes up. But since 95% is so deeply ingrained, this is not a popular choice either.

Thus, it pretty much comes down to sample size, finding the ideal number that is large enough to find the desired effect but small enough that your company or agency is willing to pay for it.

Back to my story: a few days after our debate, while dropping off something on my boss’s desk, I saw that he had done the calculation himself, something he almost never did. He assumed the site was already well below the chromium threshold, assumed there would be almost no variation between tests, and concluded that fewer than soil tests would be necessary. He never showed this to me, and we never discussed it. I left the company soon after that to start graduate school, but I later heard that the very first test showed concentrations well over the threshold, and they had to bring the crew back in to do more excavation, rendering our entire debate meaningless. How can you ever know what you have until you measure it? I am still on the side of my 22-year old self.

Let’s look at a real-world example using the R package pwr. Suppose the asthma prevalance in a sample of boys in smoking households was 2%, compared with a population prevalence of 1.4%. At what sample size do we have 80% power to consider these different?

library(pwr)
## Warning: package 'pwr' was built under R version 4.0.2
pwr.p.test(h = ES.h(p1 = 0.02, p2 = 0.014),
           sig.level = 0.05,
           power = 0.80,
           alternative = "two.sided")
## 
##      proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.04659524
##               n = 3615.126
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided

Setting aside the obscure syntax in the first line, we have p1 and p2 as are our two proportions, and note that the significance level is given as .05 rather than .95 in many other functions. (No doubt is a package out there that does this a bit more cleanly, but if so, I am not aware of it).

We see that the sample size required is 3,615. Conversely, if we already have taken a sample, we can measure its power by swapping n= for power=:

pwr.p.test(h = ES.h(p1 = 0.02, p2 = 0.014),
           sig.level = 0.05,
           n=500,
           alternative = "two.sided")
## 
##      proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.04659524
##               n = 500
##       sig.level = 0.05
##           power = 0.1806347
##     alternative = two.sided

Only 18% power – very low, and consistent with a conventional t-test that would suggest no difference between the groups.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: