2009-04-03

Statistical Sampling and the Census

We have a census coming up next year. And of course our political leaders are doing their best to turn that fact to their own parties' political advantage. Since relatively poor people tend to vote Democratic, and since relatively poor people are the ones who are most often missed during census-taking, Republicans basically want to ignore their existence, while Democrats want to figure out ways to include them.

The political dimension of this is easy to understand, but there is a problem: the Republicans state that the reason why they want to ignore people who aren't counted in the census is because they feel that the numbers will then be more "real" or more accurate, than if statistical methods are used to estimate them, as Democrats and most statisticians want to do. The problem with this is that in fact, ignoring these people is less real and much less accurate than using valid statistical techniques to estimate their numbers.

The reason for this is that the undercounts are not randomly distributed. That is, there is a systematic bias that causes potential Democratic voters to be undercounted more frequently than potential Republican voters. An example might help.

Suppose that Farmer R. is paying Worker D. to pick strawberrys, a penny per berry. At some point, the fruit will have to be counted before D. can be paid. However, counting every berry would take much too long, so instead, they are counted by boxes. Each box holds nominally 100 berries, so in the absence of any additional information, it is reasonable simply to count boxes and pay $1 per box.

However, D. decides that he is doing more work than he has been paid for, so, out of hundreds of boxes of berries, he takes a randomly selected 20 boxes and actually counts all the berries. He discovers that there were 2100 berries in the 20 boxes, or 105 per box on average. The actual count ranged from 90 to 110 in the sample boxes.

Worker D. then goes to Farmer R. and says that he wants to apply the results of his experiment to getting paid: he wants $1.05 per box instead of $1. Farmer R. refuses, since the boxes are plainly marked "100 ct.", and so they must hold 100 berries each, and in any case, why is D. so concerned about a mere 5¢ difference?

Clearly, based on probabilities, since we now know based on the result of our empirical sampling that the expected number of berries per box is 105, then it is most likely that the actual, underlying count of all berries picked is going to be about 105% of the number estimated by counting boxes. That is, it is very likely that using statistical estimation will be more accurate than simply using the nominal counts, even though the number is derived mathematically rather than from a direct count.

Although this is by no means a perfect analogy with the census, the point I wanted to make was that the purpose of statistical estimation is not to favor one side or the other, but rather to reduce the amount of systematic bias in the data.

Let me add that Farmer R. could do his own empirical study, and if he did it correctly, and found a lower count than D.'s study did, then he could promote, and defend, an alternative, lower statistical estimate. The Republicans could absolutely do the same thing for the census data: instead of trying to eliminate statistical estimation, if they think that the Democratic-supported estimates introduce a new source of bias (e.g., overcounting certain potential Democratic voters), then they should do their own empirical studies and use them to support a modified estimate. In fact, if they don't do this, and continue simply to argue for not using statistical estimation at all, then it seems pretty clear to me that they basically accept the premise of systematic pro-Republican bias in the raw counts, and are just trying to preserve an error that benefits them.

Greg Shenaut