Reproducing published graphics in R

A wonderful blog post from a few years ago picked five statistical graphics at random from regular media sources like the New York Times and reproduced them with a handful of lines of R code. I thought it was written by Rafael Irrizary for the Simply Statistics blog, but now I am not seeing it there – perhaps it was taken down.

In any case, for the introductory statistics class that I teach, I picked a graphic more-or-less at random from the Morbidity and Mortality Weekly Report and repeated the exercise. Here is the original, depicting the percentage of adults with kidney disease, as measured by the National Health Interview Survey in the second half of 2020:

And here is my version:

There are minor differences, of course, in fonts and line weights and hues. For these, I was satisfied with an approximate match. I also did not break the y-axis to show that percentages can go all the way to 100, as this adds no real information. It’s also difficult in ggplot – its authors seem to have actively discouraged the practice, considering it unnecessary at best, and problematic if you want to switch to a logarithmic scale.

Ironically, when searching for Rafael Irrizary’s post, I came across one where he advocates banning such “dynamite plots” because they convey so little information. That name comes from the way these can look like a dynamite plunger; less so in this version that shows both the upper and lower confidence bar, rather than only the upper. While this critique was directed at data with continuous outcomes rather than binary outcomes, there are still ways to make this plot better, which I’ll explore in a future post.

The code follows. It requires downloading the 2020 National Health Interview Survey data from here.

#Replicating MMWR March 4, 2022 71(9): 363
#Have you ever been told by a doctor or other health professional that #you had weak or failing kidneys? Data from National Health Interview #Survey, 2nd half of 2020

library(dplyr)
library(ggplot2)
library(tidyr)

#read in the data downloaded from CDC site
nhis <- read.csv("C:/epi551/data/adult20.csv",header=T) 

#select variables to be used from original 618
nhis <- select(nhis,KIDWEAKEV_A,SEX_A,INTV_MON,AGEP_A)

#rename to more memorable names
nhis <- rename(nhis, age=AGEP_A, sex=SEX_A, month=INTV_MON, kf=KIDWEAKEV_A)

#limit to July to December
nhis <- filter(nhis, month > 6)

#delete missing/unknown/refused values
nhis <- filter(nhis, age<=85 & sex<=2 & kf<=2)

#convert sex to a categorical value
nhis$sex <- factor(nhis$sex,labels=c("Men","Women"))

#classify age into categories
nhis$ageg <- case_when(nhis$age<=44 ~ 2,
                       nhis$age<=64 ~ 3,
                       nhis$age>64 ~ 4)

#obtain stats for sex by age
nhis2 <- nhis %>%
  group_by(sex, ageg, kf) %>%
  count() %>%
  group_by(sex, ageg) %>%
  pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>%
  mutate(percentage = (kf1/(kf1+kf2)*100))

#obtain stats by sex
nhis3 <- nhis %>%
  group_by(sex, kf) %>%
  count() %>%
  group_by(sex) %>%
  pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>%
  mutate(percentage = (kf1/(kf1+kf2)*100))
nhis3$ageg <- 1

#obtain stats by age
nhis4 <- nhis %>%
  group_by(ageg, kf) %>%
  count() %>%
  group_by(ageg) %>%
  pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>%
  mutate(percentage = (kf1/(kf1+kf2)*100))
nhis4$sex <- " Total"

#not grouped by anything
nhis5 <- nhis %>%
  group_by(kf) %>%
  count() %>%
  pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>%
  mutate(percentage = (kf1/(kf1+kf2)*100))
nhis5$sex <- " Total"
nhis5$ageg <- 1

#combine all stats
nhis6 <- rbind(nhis2,nhis3,nhis4,nhis5)

#create age group labels
nhis6$ageg <- factor(nhis6$ageg, labels=c("Total", "18-44 yrs","45-64 yrs", ">=65 yrs"))

#compute confidence intervals
nhis6 <- nhis6 %>%
  rowwise() %>% 
  mutate(lci =(prop.test(kf1,kf1+kf2,p=percentage/100))$conf.int[1]*100,
         uci =(prop.test(kf1,kf1+kf2,p=percentage/100))$conf.int[2]*100)

#create final graph
ggplot(data=nhis6, aes(x=sex, y=percentage, fill=ageg)) +
  geom_col(position=position_dodge2(), color="black", size=1, width=0.75) +
  scale_fill_manual(values=c("#bdd7e7",
                             "#6baed6", 
                             "#3182bd",
                             "#08519c")) +
  geom_errorbar(aes(x=sex, ymin = lci, ymax = uci),
  width = 0.3, position = position_dodge(0.75), size=1) +
  xlab("Sex") +
  ylab("Percentage") +
  theme_classic() +
  labs(fill = "") +
  theme(legend.position=c(0.12,0.80),
        legend.text=element_text(face="bold",size=12),
        axis.text.x=element_text(face="bold",size=12),
        axis.text.y=element_text(face="bold",size=12),
        axis.title.x = element_text(face="bold", size = 12),
        axis.title.y = element_text(face="bold", size = 12),
        axis.ticks.y = element_line(size = 1),
        panel.border = element_rect(colour = "black", fill=NA, size=1)) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)), 
                     breaks=seq(0,10,2), limits=c(0,10))

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: