24/12/2017

Get the story right: Logistic regression as a tool for journalists querying categorical outcomes

In previous posts, I have explored how linear regression can help journalists to understand how different variables interact in a story. By ferreting out the interactions among several predictor variables for a continuous outcome variable, such as percent of support for President Trump, it is possible to examine how different demographic features influence the likelihood that a cohort will have this outcome.

Methods such as chi-square or Fishers exact test can also be used by journalists to test for a significant association between two categorical variables, such as Trump and Clinton’s placements in primary contests in 2016, shown in contingency format. In doing so, Clinton’s placement could be considered a predictor variable while Trump’s placement could be considered an outcome variable. To test whether these conditions are associated, and you’ve got right angle for your story, logistic regression is needed.

Logistic regression is used to estimate the probability of certain events given a certain pattern of variables. In the below example, I will take Trump’s percentage of the vote in each state and dichotomize it into whether he won or not. Turning continuous variables into categorical variables can be useful and informative in the right circumstances. But it’s important to remember that, in other circumstances, doing so may mask useful information. This will be the topic for a future post.

Logistic regression is often used in medical research to estimate the odds of contracting a disease when exposed to a given factor. Nate Silver at fivethirtyeight.com uses a variation of this method forecast the outcome of elections and sporting events. In the last election, Silver gave Trump a 35% probability of winning based on state and national polling data. We all know how that turned out. Although, his group was right about Clinton winning the popular vote. The logit link makes the function of the probability of event y linear and allows for testing for the effect of predictors on the event likelihood of event y. Ln stands for the natural logarithm. Predictor variables can be continuous or categorical. An example of logistic regression with one predictor variable in R is presented below. Here the data are the 50 states plus DC from last year’s election and demographic data from the state. The outcome variable is whether the state was won by Trump last year. The sole predictor was the percent of the white population in the state. The output from R is presented below. Odds ratios can be found from the regression coefficients by taking the exponential of the coefficient eβ. In this case e0.04135 gives an odds ratio of 1.04. This suggests that for every 1% increase in the percent of white population in a state there is a 4% increase in the chance of Trump winning that state. The Wald or Z statistic was not significant with a p-value of 0.0852. This statistic is conservative meaning that it is more likely to result in a type II error (that is, letting a guilty man go free). A less biased test is the change in the deviance from the null model (the model with no predictors) and the model with the parameters considered. In this case the difference is found by 69.104 – 65.689 = 3.415, df=1. This statistic is distributed as a chi-square distribution with degrees of freedom (df) being the difference in the number of predictors in the two models. In this case, the predictor is closer to significance with a p-value of 0.06. Below is an example of a more complex model with the percent of highly religious population being added as a predictor in addition to the percent that are white. In this case, both predictor variables are now significant with odds ratios of 1.20 for percent white and 1.36 for percent highly religious. Looking at the change in the deviance compared to the previous model (65.689 – 30.796 = 34.893, df = 1, p < 0.001).

With these model parameters we can estimate the probability of Trump winning the state with a given pattern of covariates with the formula: In this case, suppose we have a state with 70% religious but 50% white. What would be Trump’s chance of winning that state? We can plug these numbers into the formula: This model estimates that this hypothetical state has a 61% chance of giving its electoral votes to Donald Trump. Louisiana, with a population that is 62% white and 71% highly religious, had a 96% chance of going to Trump, which it did.

This method can be used in cases when there are more than two possible outcomes. Multinomial logistic regression is used when there are more than two outcomes and the outcomes are not ordered, such as predicting what sports team someone prefers. Ordinal logistic regression is used when there are ordered responses, for example the stages of cancer. As with linear regression, one must be careful to check for outliers and influential observations which could skew the results.