Prediction, Explanation and the November Surprise


Given the overhyped promise of data science, the "shock" at the broad failure to predict the election outcome was inevitable. Skimming through the media and technical accounts, it looks like a better understanding of prediction and explanation is necessary for less surprises and sounder analytics.

Let's take two examples (oversimplified somewhat to make a point). First, a game-theoretic account derived from observed behavior in a two-player game in which one player gets a sum of money and decides how to share it with another, who can only accept or reject the offer: even though accepting any offer as better than nothing is rational, "we don’t behave rationally ... [but] emotionally ... we reject offers we consider unfair".

    "… there’s been plenty of economic growth inside the U S--vastly increasing the pile of money to be divided. But ... The first player consists of those people who have benefitted from globalization and trade: the “elites”, derisively referred to as “the 1%”. And the second player ... everyone ... who aren’t in those upper income echelons ... are seeing the pile of money in the game growing ever bigger. And ... the other player keeps an ever-larger share of that pile for themselves ... Trump allowed them to channel their feelings into a rejection of the proposal that has been made—on trade, immigration, and globalisation, and dividing up those spoils ...[and they threw] everything out". -- What voters do when they feel screwed--the economic theory.

Second, a complex algorithm that runs a multitude of sophisticated simulations on a "raft of carefully collected public and private polling numbers, as well as ground-level voter and early voting data”. Assume that “the raft” consists of, vote predictors -- vote correlates discovered by computers -- (What didn't Clinton’s data-driven campaign's algorithm named Ada see?).

Suppose (1) an appropriate hypothesis in the form of a correlation at the aggregate level between variables measuring affinity to the first and second player and the vote had been derived in the former case which proved accurate beyond pure chance and (2) the algorithm in the latter case produced an equally accurate prediction. Would you say that both approaches have equivalent explanatory power?

For those who equate prediction with explanation, the answer is yes. For those for whom explanation is about the past and prediction about the future, the question does not come up. But these are views that obscure rather than enlighten.

In both cases there is a data pattern in the form of predictive correlations. In the first case a theory of individual behavior specifies the causal mechanism—the individual behavior—that explains how the pattern is produced, why it exists at the aggregate level. In the second case, the mechanism is of no particular interest and is not specified. In general behavioral predictions with explanations are more reliable than those without.

Now, data patterns discovered by computers rather than theoretically explained can produce insights for theory development -- causal mechanisms -- this is what data mining should be about. That's the context of discovery in science, which requires predictions from the theory inferred from the discovered patterns to be tested in the context of validation on different data. But, unlike in natural science, human behavior is not governed by unchanging universal laws, so it is easier to explain post hoc than to predict. Given the pressure for prediction in industry and politics, the temptation not to bother with the second context is too strong.

In this age of big data, data mining, data lakes and machine learning the important difference between prediction and explanation should be understood and kept firmly in mind when performing analytics and assessing their results.

Related election discussion

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

The Necessity of Foreign Keys

A proper understanding of data fundamentals requires the understanding of the importance of keys and primary keys. This time we take a look at another important type of key -- foreign keys.

Why You Always Need Primary Keys

Database pros should heed this warning: if you ignore that primary keys are mandatory, you can wreak havoc with inferences made from databases, including in analytics.


Re: Surprised?
  • 1/14/2017 10:29:52 AM
NO RATINGS

Understanding emotions might well be a clue to getting a better understanding of what factors to consider in prediction. But, the complexity and scope of that might just be something we won't have a good handle on for some years to come.

Re: Surprised?
  • 12/19/2016 10:26:25 PM
NO RATINGS

Odds only give the probability of an outcome. And unfortunately for gamblers and fortune tellers, those odds are not going to predict an out come. And we have surely learned to not rely of pundits and partisan commentators to predict the future as well. With so many complex variables, it's a wonder that we don't mis-predict more often.

Re: Surprised?
  • 12/5/2016 9:53:08 PM
NO RATINGS

..

DBdebunker writes


Your comment is essentially a POST-HOC explanation of the failure to predict which, as I argue in the piece is much easier than developing a theoretical one from which to derive the prediction IN ADVANCE. 


 

Kinda brings to mind the phrase "Hindsight is 20/20 ..."

..

Re: Surprised?
  • 12/5/2016 5:47:28 PM
NO RATINGS

To reiterate, the point is not this specific election, or even elections in general. I only used the election as a vehicle for stressing the distinction between prediction and explanation and the non-scientific tendencies in data science by skipping  validation.

Your comment is essentially a POST-HOC explanation of the failure to predict which, as I argue in the piece is much easier than developing a theoretical one from which to derive the prediction IN ADVANCE.

Another way to say it is that there can be prediction without explanation that cannot be considered scientific.

 

Re: Surprised?
  • 12/5/2016 5:03:49 PM
NO RATINGS

..

I don't think the Big Polling Fail in this election was a failure of some elaborate prediction theory but a basic failure of data input validity. How the voting population felt about key economic issues, policy issues (like immigration), etc. certainly consituted underlying factors, but polling/surveying basically tries to gauge what the vote outcome is gonna be from how the surveyed population indicate they intend to vote (or which way they're "leaning". etc.).

There's a fundamental assumption that, by and large, how people say they're gonna vote is pretty indicative of how they're actually gonna vote. In this electoral contest, there were some reliability problems with that key assumption.

Also. as some have suggested in other A2 comment threads, a major problem was that pollsters apparently tended to focus excessively on big-city urban voters and to underestimate the importance of a lot of small-city, small-town, and rural voters.

Some observers and savvy pundits have also noted that apparently there was a considerable dichotomy in motivation. So people might have told pollsters that they favored Clinton, and intended to vote for her ... but when time came to actually go out, stand in line, and make the effort, a lot of them just stayed on the couch, in the office or other workplace, or went back to sleep.

There's a brier patch of other factors that likely played a role (e.g., the obstacles to voting erected by many GOP-controlled state governments). On the whole, a lot of factors could not be assessed by conventional polling, and no matter how sophisticated the analytics of the results, these factors generally remained below the polling radar. And if your survey population is skewed to start ... it's a recipe for an analytics train wreck.

..

Re: Surprised?
  • 12/5/2016 4:15:24 PM
NO RATINGS

That's not what my post is about.

Rather, it is about "data science" trying to get away without theorization preceding prediction, i.e., predictions are derived and tested from theory and data analysis is used to test them. Upside down and backwards: theory is inferred from the prediction.

Data mining is OK for insight into theorization, but then predictions must tested on different data, which is not done. That is not scientific.

 

 

 

Surprised?
  • 12/5/2016 3:58:53 PM
NO RATINGS

In the first place, Nate Silver's final prediction was 70% Clinton, 30% Trump.

Most observers seem to treat this as if it was 99% Clinton, 1% Trump. They are overly surprised and this betrays that they simply don't understand what 30% means.

In response to your topic, the prediction method that provides some basis, some explanation for the conclusion gives more information. If we understand more about why people are voting a certain way, we can ask additional questions to better examine the 'why'. If it's emotion, for example; how strong are those emotions? How are those emotions changing with time?

INFORMATION RESOURCES
ANALYTICS IN ACTION
CARTERTOONS
VIEW ALL +
QUICK POLL
VIEW ALL +