# Improving upon a bar graph

In my preceding post, I showed how to replicate a typical Quickstats graph from the CDC’s Morbidity and Mortality Weekly Report. But such a bar graph is rather information-poor. There are only 24 pieces of data: the mean and the 95% confidence limit for each of the sex-age strata. (You could argue that the upper and lower confidence limits should each count, in which case 36 pieces of data). That’s an improvement over a table, I think, but it still occupies a lot of screen real estate.

Here, all 36 pieces of information are contained within the confidence interval symbols. The shaded bars repeat 12 of these (the means), and add no additional information. Eliminating them yields the figure below. More minimal, yes, though not necessarily that much clearer.

Orienting the bars horizontally and arranging them more logically (grouping age first, then sex, rather than the reverse) helps quite a bit. Now it is obvious that age matters a lot, and sex hardly at all.

But since age is the most important variable, why use such coarse groupings? People in their mid-40s differ considerably from those in their mid-60s – why lump them together? Single year of age is available from the NHIS, why not use it? That idea is taken up below. Meanwhile, here is the code that generated the figure above:

``````#Improving upon MMWR March 4, 2022 71(9): 363
#Have you ever been told by a doctor or other health professional that you had
#weak or failing kidneys? Data from National Health Interview Survey, 2nd half
#of 2020

library(dplyr)
library(ggplot2)
library(tidyr)
library(forcats)

#select variables to be used from original 618
nhis <- select(nhis,KIDWEAKEV_A,SEX_A,INTV_MON,AGEP_A)

#rename to more memorable names
nhis <- rename(nhis, age=AGEP_A, sex=SEX_A, month=INTV_MON, kf=KIDWEAKEV_A)

#limit to July to December
nhis <- filter(nhis, month > 6)

#delete missing/unknown/refused values
nhis <- filter(nhis, age<=85 & sex<=2 & kf<=2)

#classify age into categories
nhis\$ageg <- case_when(nhis\$age<=44 ~ 3,
nhis\$age<=64 ~ 2,
nhis\$age>64 ~ 1)

#obtain stats for sex by age
nhis2 <- nhis %>%
group_by(sex, ageg, kf) %>%
count() %>%
group_by(sex, ageg) %>%
pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>%
mutate(percentage = (kf1/(kf1+kf2)*100))

#obtain stats by sex
nhis3 <- nhis %>%
group_by(sex, kf) %>%
count() %>%
group_by(sex) %>%
pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>%
mutate(percentage = (kf1/(kf1+kf2)*100))
nhis3\$ageg <- 4

#obtain stats by age
nhis4 <- nhis %>%
group_by(ageg, kf) %>%
count() %>%
group_by(ageg) %>%
pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>%
mutate(percentage = (kf1/(kf1+kf2)*100))
nhis4\$sex <- 3

#not grouped by anything
nhis5 <- nhis %>%
group_by(kf) %>%
count() %>%
pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>%
mutate(percentage = (kf1/(kf1+kf2)*100))
nhis5\$sex <- 3
nhis5\$ageg <- 4

#combine all stats
nhis6 <- rbind(nhis2,nhis3,nhis4,nhis5)

#create age group labels
nhis6\$ageg <- factor(nhis6\$ageg, labels=c(">=65 yrs", "45-64 yrs","18-44 yrs",
"Total"))

#create sex group labels
nhis6\$sexf <- factor(nhis6\$sex, labels=c("Men", "Women","Total"))

#compute confidence intervals
nhis6 <- nhis6 %>%
rowwise() %>%
mutate(lci = (prop.test(kf1,kf1+kf2,p=percentage/100))\$conf.int*100,
uci = (prop.test(kf1,kf1+kf2,p=percentage/100))\$conf.int*100)

#reoriented plot
ggplot(data=nhis6, aes(y=ageg, x=percentage, fill=sexf)) +
geom_errorbar(aes(y=ageg, xmin = lci, xmax = uci, color=forcats::fct_rev(sexf)),
width = 0.3, position = position_dodge(0.75), size=1) +
guides(color=guide_legend(title=NULL,reverse = TRUE) ) +
geom_point(position = position_dodge(0.75), aes(color=forcats::fct_rev(sexf))) +
guides(fill="none") +
labs(fill = "") +
ylab("Age Group") +
xlab("Percentage") +
labs(fill = "") +
theme(panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line( size=.1, color="white"))``````

## Getting rid of age groups

Sometimes age groups are helpful in simplifying an analysis, but that applies to five and ten year age groups. Here we have 20 and 25 year age groups. We can drop them altogether and look at the relationship of kidney trouble as a function of age:

The dots are the actual values and the lines are the smoothed values using the default loess smoother in R. Now we have substantially more information being presented in the same amount of space. We see that the 65 and over grouping is not especially helpful, given that 85+ year olds have double the incidence of 65 year olds. (Note that the NHIS lumps 85 and above together across the entire survey). There is a similar difference between the top and bottom of the 45-64 category. Only the 18-44 age group makes sense.

Here is the additional code to create this last figure:

``````#obtain stats for sex by actual age
nhis7 <- nhis %>%
group_by(sex, age, kf) %>%
count() %>%
group_by(sex, age) %>%
pivot_wider(names_from = kf, values_from = n, names_prefix="kf") %>%
mutate(percentage = (kf1/(kf1+kf2)*100))

#where case count was 0, replace missing values with 0
nhis7 <- replace_na(nhis7,list(kf1=0,percentage=0))

#create sex group labels
nhis7\$sexf <- factor(nhis7\$sex, labels=c("Men", "Women"))

ggplot(data=nhis7,aes(x=age,y=percentage,color=sexf)) +
geom_point() +
geom_smooth(se=F) +
ylab("Percentage with kidney problems") +
xlab("Age (years)") +
labs(color = "") +
scale_x_continuous(breaks = seq(20, 85, by = 5)) +
scale_y_continuous(breaks = seq(0, 14, by = 2)) +
theme(panel.grid.minor.x = element_line(color = NA))``````