Polling and big data in the age of Trump, Brexit, and the Colombian Referendum
Big data has revolutionized nearly every aspect of modern life from how we shop, listen to music, exercise, and even vote. Given the abundance of personal data, it is natural to expect clairvoyant-like accuracy from modern election models. However, recent elections predictions, regard the British Referendum of the EU, the 2016 US Presidential Election, and the 2016 Colombian Referendum, have tragically failed expectations.
The appeal of polling with big data is rooted in the common, and often incorrect, assumption that if enough people are polled, the poll’s error should be negligible. At The Data Incubator, a data science training company, this central misconception about big data is one of the core lessons we try to impart on our students. And we have to look no further than the history of polling to see examples of this.
Though statisticians of the past may not have had access to the computational power we now possess, their mistakes illuminate not only where modern pollsters have gone wrong but illustrate both the promise and potential pitfalls of big data. Despite surveying over 10 million Americans, Literary Digest still incorrectly predicted the defeat of the Democratic incumbent, President Franklin Delano Roosevelt in the 1936 US Presidential Election by an overwhelming margin of error. The magazine only interviewed its subscribers and households with telephones or automobiles - a demographic radically unrepresentative of the entire American electorate. Despite the poll’s large sample size, the sample population was still biased. Even now, when smartphones are ubiquitous and nearly everyone is online, similar bias continues to exist. Even with massive sample sizes and online polling it is still difficult to acquire a truly representative sample of the population.
Even with a large and truly representative sample population, a model’s prediction may still be biased. Potentially widespread dishonesty from survey participants, especially during controversial elections, can tarnish a poll’s outcome. Exit polls during the 1982 California Governor’s Election incorrectly predicted the victory of Tom Bradley, an African American and mayor of Los Angeles, by a substantial margin. Post election analysis attributed this miscalculation to the deliberate deception of white voters to avoid appearing racist. This phenomenon is known as the Bradley Effect. A similar phenomenon known as the “Shy Tory Factor” occurred during the 1992 UK general election when polls incorrectly predicted a Labour majority outcome. In this case, instead of deliberately misleading polls, a significant portion of conservative voters simply refused to reveal their voting preference and consequently skewed predictions.
Polls are not only vulnerable to voters lying about their political preferences, but also their willingness to vote. An overwhelming portion of nonvoters lie about their intention to vote. In an election such as the 2016 US Presidential Election, with particularly unfavorable candidates, and high voter apathy, it is difficult to determine how polling data relates to actual votes. Overestimating or underestimating the willingness of a candidate’s supporters to vote affects on model accuracy.
The previous examples illustrate the assortment of risks and errors associated with polling in a big data age. In any type of statistical sampling, there always exist two broad categories of measurement error: sampling bias (deviations caused by inherent randomness) and statistical error (error from biased sampling techniques). Though big data has the potential to virtually eliminate statistical error, it unfortunately provides no protection against sampling bias and, as we’ve seen, may even compound the problem. This is not to say big data has no place in modern polling, in fact it may provide alternative means to predict election results. However, as we move forward we must consider the limitations of big data and our overconfidence in it as a polling panacea.
Image: Matt Brown.