I am working on a project that involves piecing together residential histories to better explain chronic disease risk. Traditionally, we only record the place a person lives at the time of a diagnosis or death. For many purposes, this is fine. If you want to know the mortality rate among 85 year-olds in Florida, you take the number of 85 year-olds who have died and divide by the total number of 85-year olds. The fact that many of these people moved there from another state is irrelevant. But if you think there is something about the climate of Florida that helps people live longer, then you are going to need to distinguish among long-term and short-term residents.
There is no single data source that contains everyone’s residential history (yet), but we are getting closer to that, for better or worse. Unless you never get mail and always pay in cash, chances are that everywhere you’ve lived since about 2000, or since you’ve turned 18, is in a database somewhere.
As a preliminary step in the project, I’ve written a short R program to help visualize a person’s residential history graphically. The data are invented, but are based on a real-life person who has moved around a lot. There are three data sources, which I’ve called claim, credit, and hospital, meant to correspond to:
- Medical claims, which in this example are sparse and only date to 2000
- Records from a credit reporting source, which are more comprehensive and go back to the early 1980s
- Hospital admissions records, which are all recent except for a single observation from the 1970s.
The hospital and claims sources each capture snapshots in time, while the credit source gives beginning and ending dates for every address it has.
Here’s the preliminary data wrangling. I read in the data, reorganize the dates into start and end dates, then order the records by date:
library(dplyr) library(ggplot2) library(ggstance) library(forcats) d1 <- read.csv("https://www.albany.edu/~fboscoe/blog/addresses.csv") d2 <- d1 d1$date <- as.Date(d1$date1) d2$date <- as.Date(d1$date2) d3 <- rbind(d1,d2) d3 <- mutate(d3, address = forcats::fct_reorder(address, desc(date1)))
Next, a graph in the form of a timeline for each address. I use position_dodgev() to get visual separation between the three possible sources for each address:
ggplot(d3,aes(x=date, y=address, color=source)) + geom_point(position=ggstance::position_dodgev(height=0.35)) + xlab("Date") + ylab("Address") + ggtitle("Sample Residential History") + geom_line(position=ggstance::position_dodgev(height=0.35), size=1)
We can see that these are noisy data, with gaps and overlaps. It is possible that our patient had multiple residences at some points and was homeless in others, but it would be safer to assume she only lived in one place at a time and that the data are wrong. We could come up with some business rules to piece this together. For example, any address represented with only a single data point that overlaps other addresses (as in Pine, Elm, and Sycamore) can be disregarded. We can also assume that starting dates are accurate but ending dates are less so. (I know from looking up my personal results, for example, that at least one data source believes a post office box I rented for one year in the early 1990s is still active). We can assume that the claims data is less reliable since it shows three different addresses for the same date. To decide between Ash and Cherry, maybe you have to flip a coin. There will be errors, of course, but they should be less important when averaged over thousands of individuals.