The first step in writing Titanic R book is obtaining a clean copy of the data. I knew I would not be using the data in the R titanic package because it excludes the crew, who are just as important to the story as the passengers. What I did not expect is that the passenger data themselves would contain errors. This should not have surprised me. Throughout my career, in every project the data have always required some cleanup, and sometimes this has ending up being the whole project.
I thought the Titanic data might be different, first because it is over a century old and there’s been ample time for cleaning, and second because it is used so widely as a teaching tool. A company called Kaggle uses it as the basis for their introductory data challenge, to build a statistical model that predicts which passengers will survive based on age, sex, class, and other variables. At present 21,521 individuals and teams have signed up. Many have written exhaustive posts about the methods they used to make their predictions, and how many of these predictions were wrong owing to the data themselves?
Yes, I realize that since everyone is working from the same flawed data set, it does not matter so much if some the data were off – the relative results should still be valid. Still, this is a group that very much cares about accuracy and precision, that extols the virtues of a model that is 81.2% correct over one that is 81.1%.
It turns out that the data in the R titanic package was the product of a handful of researchers in the 1990s, culminating with the work of Thomas Cason, a student at the University of Virginia, in 1999. He dropped duplicate passengers, filled in missing ages, and “many errors were corrected”. And then, for whatever reason, it was agreed that correcting many errors was the same as correcting all errors.
I don’t know how many errors are in the R data. I do know that I found errors in the first three entries alphabetically, which was good enough for me. First we have Rhoda Mary Abbott, a survivor from third class, whose details were long murky because she had been entered in the passenger manifest as Rosa. The mystery was cleared up in 2004, after Cason’s version of the data was produced. She was traveling with her two teenaged sons, Eugene and Rossmore, neither of whom survived. The R version counts one of these as her husband, which should have been caught in 1999 because the ages make no sense.
So for Rhoda Abbott, we have the wrong name, age (39, not 35), and familial relations. Her sons both have incorrect familial relations. It looks like I’ll be taking a deeper dive into the raw data than planned.