A wonderful blog post from a few years ago picked five statistical graphics at random from regular media sources like the New York Times and reproduced them with a handful of lines of R code. I thought it was written by Rafael Irrizary for the Simply Statistics blog, but now I am not seeing it there – perhaps it was taken down.
In any case, for the introductory statistics class that I teach, I picked a graphic more-or-less at random from the Morbidity and Mortality Weekly Report and repeated the exercise. Here is the original, depicting the percentage of adults with kidney disease, as measured by the National Health Interview Survey in the second half of 2020:
And here is my version:
There are minor differences, of course, in fonts and line weights and hues. For these, I was satisfied with an approximate match. I also did not break the y-axis to show that percentages can go all the way to 100, as this adds no real information. It’s also difficult in ggplot – its authors seem to have actively discouraged the practice, considering it unnecessary at best, and problematic if you want to switch to a logarithmic scale.
Ironically, when searching for Rafael Irrizary’s post, I came across one where he advocates banning such “dynamite plots” because they convey so little information. That name comes from the way these can look like a dynamite plunger; less so in this version that shows both the upper and lower confidence bar, rather than only the upper. While this critique was directed at data with continuous outcomes rather than binary outcomes, there are still ways to make this plot better, which I’ll explore in a future post.
The code follows. It requires downloading the 2020 National Health Interview Survey data from here.
#Replicating MMWR March 4, 2022 71(9): 363 #Have you ever been told by a doctor or other health professional that #you had weak or failing kidneys? Data from National Health Interview #Survey, 2nd half of 2020 library(dplyr) library(ggplot2) library(tidyr) #read in the data downloaded from CDC site nhis <- read.csv("C:/epi551/data/adult20.csv",header=T) #select variables to be used from original 618 nhis <- select(nhis,KIDWEAKEV_A,SEX_A,INTV_MON,AGEP_A) #rename to more memorable names nhis <- rename(nhis, age=AGEP_A, sex=SEX_A, month=INTV_MON, kf=KIDWEAKEV_A) #limit to July to December nhis <- filter(nhis, month > 6) #delete missing/unknown/refused values nhis <- filter(nhis, age<=85 & sex<=2 & kf<=2) #convert sex to a categorical value nhis$sex <- factor(nhis$sex,labels=c("Men","Women")) #classify age into categories nhis$ageg <- case_when(nhis$age<=44 ~ 2, nhis$age<=64 ~ 3, nhis$age>64 ~ 4) #obtain stats for sex by age nhis2 <- nhis %>% group_by(sex, ageg, kf) %>% count() %>% group_by(sex, ageg) %>% pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>% mutate(percentage = (kf1/(kf1+kf2)*100)) #obtain stats by sex nhis3 <- nhis %>% group_by(sex, kf) %>% count() %>% group_by(sex) %>% pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>% mutate(percentage = (kf1/(kf1+kf2)*100)) nhis3$ageg <- 1 #obtain stats by age nhis4 <- nhis %>% group_by(ageg, kf) %>% count() %>% group_by(ageg) %>% pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>% mutate(percentage = (kf1/(kf1+kf2)*100)) nhis4$sex <- " Total" #not grouped by anything nhis5 <- nhis %>% group_by(kf) %>% count() %>% pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>% mutate(percentage = (kf1/(kf1+kf2)*100)) nhis5$sex <- " Total" nhis5$ageg <- 1 #combine all stats nhis6 <- rbind(nhis2,nhis3,nhis4,nhis5) #create age group labels nhis6$ageg <- factor(nhis6$ageg, labels=c("Total", "18-44 yrs","45-64 yrs", ">=65 yrs")) #compute confidence intervals nhis6 <- nhis6 %>% rowwise() %>% mutate(lci =(prop.test(kf1,kf1+kf2,p=percentage/100))$conf.int*100, uci =(prop.test(kf1,kf1+kf2,p=percentage/100))$conf.int*100) #create final graph ggplot(data=nhis6, aes(x=sex, y=percentage, fill=ageg)) + geom_col(position=position_dodge2(), color="black", size=1, width=0.75) + scale_fill_manual(values=c("#bdd7e7", "#6baed6", "#3182bd", "#08519c")) + geom_errorbar(aes(x=sex, ymin = lci, ymax = uci), width = 0.3, position = position_dodge(0.75), size=1) + xlab("Sex") + ylab("Percentage") + theme_classic() + labs(fill = "") + theme(legend.position=c(0.12,0.80), legend.text=element_text(face="bold",size=12), axis.text.x=element_text(face="bold",size=12), axis.text.y=element_text(face="bold",size=12), axis.title.x = element_text(face="bold", size = 12), axis.title.y = element_text(face="bold", size = 12), axis.ticks.y = element_line(size = 1), panel.border = element_rect(colour = "black", fill=NA, size=1)) + scale_y_continuous(expand = expansion(mult = c(0, 0.05)), breaks=seq(0,10,2), limits=c(0,10))