
A student of mine has been working on this problem and, while a small problem, I think we’ve hit upon an innovative solution. When measuring cancer survival, you inevitably have some patients for whom the diagnosis date is not known. Sometimes just the day is missing – for these, you can just choose the 15th and that will have a negligible impact on your analysis. For others, the month is missing. Here, too, the midpoint method is typically employed – if someone dies on December 31, you assume they were diagnosed on July 1, which is the midpoint of the year. The diagnosis year is never missing – year is among the minimum requirements to be included in a database. Date of death, on the other hand, is always known. These come from death certificates, and they always have a date. The fraction of cases with this problem is on the small side – 5%, maybe 10% tops, so the midpoint method is probably good enough. But why settle for good enough when you can come up with something better? The midpoint method assumes you were equally likely to have been diagnosed on any day between January 1 and the day you died. But that is obviously not true. For cancers that have an average survival of 3 months, for example, it makes no sense to credit 6 months of survival time. Instead, the missing diagnosis date should reflect the survival profile of everyone with that cancer, taking into account their age, stage, and gender. So if you have a 75 year old man with late-stage pancreatic cancer who died on February 5, 2017 and an unknown diagnosis date sometime in 2016, you should find all the 75 year old men with late-stage pancreatic cancer who died one calendar year after diagnosis and see what their average survival was. Maybe in this case we would assign November 18 rather than July 1. The amount of bias we’re reducing here is small but I suspect significant. It should make for a good paper.