20/6/2017

## What’s your angle? Using regression to understand how different variables in a story interact

In my post in March on why standardization matters for data journalists, I wrote about how it makes variables comparable from one state to another (or one data point to another). The example I used, also pictured in the chart below, showed how two standardized variables (hate groups per million and Trump’s percent of the vote) can be compared in a correlation analysis. It’s an easy mistake to conclude that a correlation between two variables proves that there is a cause and effect between them. One alternative explanation relates to the direction of cause and effect. It could be argued, for instance, the act of voting for Trump caused the proliferation of hate groups. But this argument is doubtful because most of the Southern Poverty Law Center’s list of hate groups was compiled before the election. Another alternative explanation is the possibility of a third variable explaining the relationship. Say, the sentiments in a state that led people to be more likely to vote for Trump also created an environment where hate groups can thrive. Other possible third variables that can mediate this relationship are poverty, education, and a lack of health insurance. One way to address this problem is to use a multiple regression model to assess the collective impact of these variables on the outcome variable of interest.

The chart below shows the scatter of the states with Trump’s percent of the vote on the Y axis and the concentration of hate groups in that state. A best fit simple regression line was added to the data with the equation:

Trump % of the Vote = 42.84 + 2.33*(Hate Group Rate)

This equation states that for every increase in the hate group concentration by 1 group/million there is an increase of 2.33% in Trump’s share of the vote. 20.1% of the variability was accounted for by this relationship. If 100% of the relationship were accounted for, the states on the plot would form a perfect straight line sloping upward. Plugging a hate group number into the above equation would give a predicted percent of the vote for Trump that would fall on the best fit straight line.

*Image: Paul Ricci.*

For another example of how the regression equation is used, think to when you get on an exercise bike or a treadmill at the gym. If it is an old machine that does not ask for the person’s gender, age, or weight or have a heart monitor, it would plug the work the person did on the machine into a regression equation to estimate how many calories they burned. The equation would be computed from a normative sample of people who had their work and caloric output measured precisely. We will return to this example later.

Looking back to our Trump chart above, one might conclude that having a higher population adjusted by number of hate groups in a state caused that state to have a higher percent of the vote for Trump. That could possibly be a mistake.

For the Trump example, a lack of education was cited by Nate Silver of fivethirtyeight.com as a strong predictor of Trump’s percent of the vote. When I added the variable of education (defined as the percent of the state’s population with a bachelor’s degree or higher) to the model I found this equation:

Trump % of the Vote = 81.9 + 1.32*(Hate Group Rate) – 1.22*(% Bachelors or Higher)

This equation estimates that for every unit increase in the hate group rate an additional 1.32% would vote for Trump while for every 1% increase in the percent of the state with a bachelor’s degree or higher, there is an estimated decrease of 1.22% in Trump’s percent of the vote. This model accounted for 58.3% of the variability in the data. Given that the model with hate groups only as a predictor accounted for 20.1% of the variability, this suggests that an additional 38.2% is accounted for by education. From this model, it would appear that education is a stronger predictor of Trump’s vote. The hate group plot with the predicted values for each state by the above equation can be seen below.

*Image: Paul Ricci.*

The blue diamonds on the graph represent the actual values for the states, and the orange squares represent the predicted values for the states by the above regression equation. Alaska and Hawaii are both on the y axis. In this model their actual values are further apart than their predicted values by the equation. The model fits the state of Montana a lot better, as its actual and predicted values are a lot closer.

*Image: Paul Ricci.*

Above is a plot of education versus Trump’s percent of the vote with the predicted values from the equation for each state. The values on this plot more closely resemble a straight line, which indicates that there is a better fit for this variable. If 100% of the variability were accounted for by these two variables then the orange and the blue pointes would all be right on top of each other.

Returning to the example of the exercise bike in the gym. The newer machines do ask the user for their gender, age, and weight to give the user a more precise estimate of how many calories they burned. Gender is an important variable because men and women’s physiologies just aren’t the same. Age is important because metabolism slows as we get older. Weight is important because the bigger you are, the more calories you have to burn. The average heart rate during the workout is important as an added indicator of the user’s metabolism.

Regression is a valuable tool for helping journalists to understand how different variables in a story interact. It can rule out third variables and can explain the variables you are most interested in. However, it cannot rule out other variables not considered in the model. Experimental methodologies are recommended for ascertaining cause and effect relationships, but when working with people these may not be practical or ethical to use. This article merely scratches the surface of issues related to regression and correlation methods. A whole semester course would be needed to cover these methods in detail.