I wish this was true. I really do. But Charmer said people should call him out more often, so that's exactly what I'm going to do.
As much as I wish I could embrace this thesis, I think Charmer is making a big mistake, and that's conflating bias and variance.
Nate Silver has an interesting post summarizing how the various polling firms did in their state polls. Generally it looks like he finds most of them had consistently overestimated the (R) vote. He throws out the usual cuffed explanation for this (not enough cell phones, younger people have cell phones, younger people are more (D), bla bla).So far, so good. Cheating accounts for some fraction of the discrepancy. A non-zero fraction, but also a non-one fraction. Even if all the polls are perfectly done, there will be some tiny, tiny correlation between them. Just like it's safe to assume at least one ballot was fraudulent, it's safe to assume at least the estimate of one vote did not cancel out when all the different polls were conducting and aggregated.
But he appears to miss the elephant in the room, which is cheating. But surely no serious, scientific, quantitative genius person such as Nate Silver can possibly forget or just un-scientifically ignore the fact that the final vote contains some nonzero amount of cheating.
Now – to echo a bunch of arguments I made against Silverbating righties – the fact that almost all these polls from all these different polling companies with all different sorts of methodologies find a consistent, systematic “(R) bias” just beggars belief. No quantitative-minded person can just accept that as the result of random chance. Sure, there will be errors and biases but wouldn’t the errors and biases cancel each other out? How likely is it that virtually all polls would come out with an (R) bias? That is an extraordinary claim which requires extraordinary evidence, which Nate Silver does not have.Here's where things get semantically tricky, because "bias" has a particular meaning which is different from the one we use when we talk about politics. At least it does according to the quantitatively-minded persons who taught me quantitative things.
If a system has bias, it will make errors on a particular prediction regardless of the training set (or in the case of polling, sample set) used. Variance, OTOH, are errors that will change for that particular prediction based on which training set/polling sample is selected. Bias is, roughly, a systematic error, while variance is a result of "noise."
For example, the C4.5 decision tree learning algorithm has a bias towards orthogonal classes. If the true decision surface isn't orthogonal in a particular dimension then it will always make some errors because of that. If the true decision surface is orthogonal but slightly amorphous, those will cause variance errors.
(The left is variance. If you're predicting blue o's will be on the left of x=0.5 and red x's on the right, you'll have gotten some wrong just based on the random noise. The right is bias. If you still thought x=0.5 was the dividing line between classes but it turns the true dividing line is the solid purple diagonal line then you're going to make consistent and correlated errors no matter what random sample of x's and o's you select.)
The errors of every predictive system are a combination of bias and variance. In order to avoid bias you need flexibility, but increased flexibility makes you subject to over-fitting. As a result you'll always have some of each kind of each error.
So it's totally possible, in fact likely, that each poll could have correlated bias. It's implausible they'll have correlated variance, but bias is extremely likely. If there is in fact problem with getting Dems on the phone, etc. and every poll is based on randomly phoning people, then yes, you would expect all those phone polls to make correlated errors.
The more parsimonious and scientific inference is that the “(R) bias” Nate Silver has found is, of course, nothing other than an estimate of the (D) cheating advantage. What else could it be, after all? Yes, it could theoretically be something else – but that would require an explanation, and evidence. Surely the null hypothesis is that the “bias” showing up from these polls is just the result of voter fraud.Can we assume that both parties cheat to some extent? There is at least one pro-GOP and one pro-Dem ballot that has been cast fraudulently. In any set of elections, some of them will receive more fraudulent R than D votes. If each party were to cheat the same amount we would see these canceling each other out; there would be no correlation between the cheating levels.
Okay. In order for Charmer's hypothesis to be the most parsimonious, we must conclude that the correlation of cheating exceeds the correlation of poll errors.
I'm totally willing to believe one party cheats more than the other. Fine. But I also have good reason to believe that the poll errors are highly correlated. What I don't have is reason to believe ballot fraud is more highly skewed Dem than is polling error. We're right back where we started: some of the polls' overestimation of GOP votes is due to bias in the polls, and some is due to fraud. Still no reason to conclude the fraction is 100% fraud and 0% polling bias.
In other words, we now have scientific, quant-friendly evidence here that the (D)s get something like a ~1.2% advantage from fucking cheating.This is true only if you assume all the differential is due to cheating. This is the ceiling of that the Dem advantage to cheating is; the true value is lower.
In order to figure that out we would need to know not only that one party cheats more than the other, but the degree to which they do so.
Again, if you have a better explanation why all these polls would come out with a R+1.2 bias on average, you are welcome to advance your argument, along with your evidence. But fair warning, if you mumble some facile BS about ‘cell phones’ or ‘hurricane Sandy’ I’m going to fucking make just as much fun of you as I made of righties who BS’ed stuff about lefties more likely to lie about being likely voters or pollsters ‘using a 2008 turnout model even though (R) enthusiasm is really high’.Cell phones? Sandy? I don't know what the underlying cause is, if indeed there is a single one. But you don't have to be able to solve that responsibility assignment problem in order to know if there is bias.
If every poll is making the same assumption about the distribution of the electorate, for example, assuming there is no correlation between being willing to take 5 minutes to talk to a pollster and supporting Romeny, and that assumption is wrong then they could all very easily make correlated errors because they would have the same bias. Not the same variance, but the same bias. Since all predictive systems will make both bias and variance-derived errors, if the bias terms are correlated, the error terms will be correlated. (Less strongly, due to canceling variances, but still somewhat correlated.)
I think I’ve earned the right to say this because I have been and continue to be consistently on the side of the quants: if you don’t see Nate Silver’s table as, absent other quantified and supported explanations, prima facie evidence of the size of the (D) cheating advantage, then guess what? You’re not on the side of the quants, and you must hate math.I think I've also earned the right to say this, because I have also consistently been on the same side of this issue. So... ummm... there. Take that?
PS If every poll's errors were uncorrelated, why would we see this?
If Charmer's story is correct, and polls are wrong only because of cheating, then it shouldn't matter what kind of poll was conducted. They'd all accurately reflect how people intended to vote, people would go out and vote that way, then the results would be skewed by fraud. All the poll types would show the same GOP overestimation/Dem cheating advantage.
This seems to be pretty clear evidence that the polling process itself introduces correlated errors.
I'm editing the above to reflect a point Charmer (aka RWCG, which I'll call him (?) from now on) made in the comments.
Internet polls more closely reflect the observed outcome? (* Where the observations include effects of fraud.) Interesting. I would like to know more about their process.