When I presented my work on visualizing socioeconomic status in the United States at the Portland R users group last month, I got a ton of great feedback. Far more than I ever got within the public health community. But that makes sense – there are relatively few R users in public health (so far), but here was a room full of R users.
At one point I went somewhat into weeds about how difficult it was to obtain the large amount of census data I needed for my project. I wanted populations by census block group for the whole country (all 200,000+ of them), and the Census web site presented me with two options: use their interactive Factfinder website, or work with giant raw data tables. In the first case, it is only possible to obtain data for one state at a time. Including Washington, DC and Puerto Rico, I was going to have to execute the same series of mouse clicks and dropdown menu selections 52 times. Even though I probably could have managed that in 2 hours or so, it was too tedious and prone to error – later on I could find out that I downloaded the wrong table for Texas or year of data for Tennessee. Or worse, I wouldn’t find out, and my subsequent analysis would be wrong.
In the second case, I was obligated to download the entire American Community Survey for my year of interest. The site promised it was 6 GB, but it was something above 20 GB. How much above 20, I do not know, because I was never able to download the entire file. I live on a small island off the coast of Maine, and while the internet coverage is sufficient for most purposes, there are regular micro-interruptions in service, perhaps 10 or 20 seconds every hour. No one knows why. Most sites and apps can handle this without any issue; on a typical hour-long Skype call I’ll have one or two short periods of garbled audio or frozen video. But on the census site, any such interruption obligated me to start over. I tried different browsers and browser settings, but nothing worked. Finally I asked a friend on the mainland to download the file for me and sent him the inelegant code required to extract and link together the small part of the file I needed. With tidycensus, the whole thing would have taken minutes.
The following three lines of code download the variable I need (total population) for the state of New York. I still need to do this 52 times, but there’s an easy solution to that, too, coming in a future post.
library(tidycensus) census_api_key("your api key here", overwrite=T, install="TRUE")
ny <- get_acs(geography = "tract", variables = c(population = "B01003_001"), state="NY")
Tidycensus was written by Kyle Walker. Learn more here. This is also where the Vermont income graph appears.