
One of my larger projects involves helping to come up with a new standard for measuring the completeness of cancer reporting. The concept is pretty straightforward: take the cases that have been reported divided by the total number of cases that exist. You want this to be as close to 100% as possible. If it is, say, only 50%, you will be tempted to conclude that your cancer rates are unusually low and that it has something to do with the healthiness of your population, when it is a problem of missing data. With cancer, this is more of a problem in developing countries, though we see it in the United States with things like influenza and Lyme disease.
It’s a hard problem, because it involves knowing what it is you don’t know. A simple solution is given by the equation above. It is a good starting point, but as we will see, it si too good to be true.
When diseases are only reported to one place (say, a state cancer registry), it is very difficult to know which cases you are missing. There are common-sense approaches you can take – for example, if a hospital reported an average of 300 cases each of the past five years, and this year you only have received 50, you know there is a problem. But if you’ve received 270, is that a problem, or is it just an unusually light year? If diseases are reported to two places independently, on the other hand, then it is possible to use the interaction between the two to estimate how many are missing. You know how many have been reported to both, to only the first source, and to only the second source. What remains are the number reported to neither source. If we call these quantities A, B, C, and D, then:

This is not as complicated as it may look; it simply means that the number reported to neither is related to the number reported to both or to either.
But the problem with this method, and why it is pretty much unusable, is that it requires the two sources to be independent. This is rarely, if ever, be true in practice. Disease reporting is much more likely to be all-or-nothing: a hospital will either report cases everywhere they are supposed to, or not at all. The more this is true, the more that completeness will be overestimated.
We might think of disease registries and vital records as two somewhat independent pathways for reporting cancer. One is collected and reported at diagnosis, the other at death, by different processes that can be widely separated in space and time. We can imagine examples of cancer patients only reported to registries (anyone who is still alive), those only reported on death certificates (such as those who were diagnosed in other countries or were never diagnosed while alive).
But when you plug the numbers into the equation you get plenty of nonsensical results. One state did not have up-to-date vital records available, artificially inflating B and implying very low completeness, when the problem was with vital records. Another state reported zero death-only cases, perhaps also because of flawed vital records data or maybe some other kind of processing error. From the equation it is obvious that if C is zero, then D is zero and completeness is 100%. The reason the method doesn’t work is both because the assumption of independence cannot be verified and because it requires the vital records data to be complete and accurate.