(The allegory of the auto recall)

Suppose a car manufacturer had a reason to believe that a defect in some of its engines were resulting in poor fuel economy, but it did not know which ones. Rather than have every single driver come into the dealership for a diagnostic test – a costly and time-consuming process – it instead invites drivers to make the diagnosis themselves by reporting their current miles per gallon. Only drivers with cars below a specified cutoff value will be invited to come in for the diagnostic test.
The results come in, and the values are unexpectedly all over the place. The company thought that most would be around 30 mpg, the rated fuel economy, and a small fraction would be below 24 owing to the defect. Instead, the data are normally distributed with a mean of 28 mpg and a standard deviation of 4 mpg. This means that one-sixth of the cars are eligible for the diagnostic test, many more than they were counting on. They decide to check the cars in a single city, and strangely, of the first 50 that are tested, none seem to have anything wrong with them.
The company hires Fuelpump, LLC to look at its data, and the first thing it notices is that cars with worse mpg are tending to come from northern states. It seems the survey was taken in the winter, and fuel economy correlates with temperature. They also find an association with higher-mileage cars (excess wear and tear reduces fuel economy), drivers who live in cities (more traffic lights), those who live closer to work (fewer miles at highway speeds), and those who accelerate and decelerate more aggressively (even though the self-reported aggressiveness data is of dubious quality). Even after adjusting for all these factors, there is still a lot of variation. You can take two seemingly identical drivers, and one may report 33 mpg and the other 23.
What about measurement error? While the automaker provided instructions on how to calculate fuel economy, they were not particularly detailed, so as not to be confusing. They said to start with a full tank. For most that is when the hose automatically shuts off, but others like to top off to the nearest dollar or gallon. The instructions said to record the miles driven until the tank was half full, but did that mean where the needle first touches the half-full tick mark, or when it is centered exactly on it? Plus there were those drivers who forgot, but then drove until empty and divided by two. Others drove until the tank was 1/8 or 1/4 full and did the appropriate math. Not all did this correctly.
After synthesizing all the data, the consulting firm found that the majority of the variation in the data was the result of weather, driving behavior, the built environment, and inconsistent data collection. Almost none of it seemed to be from the mechanical defect. It turns out the mechanical defect does not even exist. During a meeting between the manufacturer and Fuelpump, an engineer burst into the conference room to announce that the whole thing was a mix up, something about a conflict between the imperial and metric systems of units.
I am working on a project where I think something like this is going on. States report disease data and the values exhibit wide variation. The lowest-performing handful of states are deemed to be “underreporting” when they could just be the left tail of a distribution arising from factors having nothing whatsoever to do with data quality. The states labeled as underreporters tend to be the same ones year after year, because the factors that make them appear defective (demographics, health care provision, behavioral risk factors) do not change much from year to year. Sometimes these states have funding withheld for not doing a good job, and other times they get extra funding to help them do a better job, depending on whether carrots or sticks are more in vogue. But neither of these is appropriate, because it is not about the job they are doing.
This may be an unsolvable puzzle, but I have not given up entirely quite yet.