Correcting outliers: The effect of race, education, and the uninsured on Trump’s vote


In my last post, I talked about how regression can be a useful tool to tease apart the different relationships between correlational variables. I also talked about how outliers can be problematic because they can decrease statistical and predictive power of regression models. But sometimes, as you investigate a story, you will have no choice but to interrogate an outlier. For example, let me show you how examining variables that might account for Washington DC’s low Trump vote can reveal another side of America’s health insurance story at the same time.

One way of dealing with an outlier is simply to delete it from the analysis. Doing so decreases statistical power (the probability of finding significant predictor when it does exist) and removes potentially valuable information from the model. Conversely it can be a more fruitful endeavor as valuable information can be gained[M4]  through looking at the characteristics of the outlier. A potentially overlooked covariate can be discovered.  I did this in my post on how Washington, DC differs from the other states and it gave me an idea for another covariate that should be considered in addition the ones already considered: concentration of hate groups, percent uninsured, percent with a bachelors degree or higher, and percent in poverty.

In this post, I found that it is the least white compared to any of the states considered. Only 40.2% of the districts population identifies as white or Caucasian. Hawaii is the only other state that had a smaller percent of white population, at 25.4%.  In the exit poll for last year’s election, 60% of white women without a college education voted for Trump while 71% of white males without a college education did.  74% of nonwhites voted for Clinton.

Adding this variable to the model significantly improved its precision when DC is included, with 78.5% of the variability in Trump’s vote accounted for. The variables for hate groups and percent poverty were not significant, and were therefore excluded to avoid decreasing the model’s statistical power. The variables of percent with bachelors degrees, percent white, and percent uninsured were significant (meaning the p-value is less than 0.05, I will explain in a future post), the others’ weren’t. The output from most statistical packages:

The column labeled “coefficients” gives the estimated values for the regression equation that I spelled out in previous posts. The current equation reads:

Trump % of the vote = 51.55 – 1.11*(% bachelor’s) + 0.31*(% White) + 0.74*(% Uninsured)

This says that when all of the covariates are equal to zero, Trump is predicted to have 51.55% of the vote.  For every 1% increase in the percent with bachelors degrees there is an estimated 1.11% decrease in Trump’s vote. For every 1% increase in the state’s percent of white population, there is an estimated increase of 0.31%, and for every 1% increase in the state’s percent of uninsured. 

The column labeled “standard error” is an estimate of the uncertainty in the coefficients. The column labeled “t stat” is the test statistic for determining whether the coefficients are significantly different from zero. The “p-value” is the estimated probability of observing this estimated coefficient when the true coefficient is zero. By convention, when the p-value is less than 0.05 we conclude that the true coefficient is different from zero. The last two columns show the upper and lower bounds for a 95% confidence interval for a coefficient. The confidence interval says that 95% of the time that the estimates are made, the true coefficient will be between the upper and lower bounds.  In this case, if the upper and lower bounds do not straddle the number zero, that is equivalent to the coefficient being significantly different from zero. A nonzero coefficient means that a positive or a negative relationship exists between the predictor and the outcome variables.

The scatterplot above shows the actual (in the blue diamond) and predicted values (in the red squares) for percent white and percent that voted for Trump, adjusting for percent with bachelors degrees and percent uninsured. The actual and predicted values for DC and Hawaii are very close to each other, which suggests a good fit. One state that fits this model poorly is Vermont, where the actual vote for Trump is 10% lower than the predicted vote. This can be seen directly above the blue diamond for Vermont.

The above scatter plot for percent with a bachelors degree or higher suggests that the fit is not as good a predictor for a Trump vote as it is for the one for percent of white population as the predictor. This is reflected in the greater standard error for this predictor (0.15) than for white population percent (0.06). The prediction for DC is not as good for this predictor as it has the highest disparity. The statistical significance of the coefficient suggests that the overall trend is still significant in the negative direction.

The above scatterplot, using percent uninsured as a predictor, shows even less fit for Trump’s percent of the vote. DC and Alaska are clearly a poor fit for this predictor, as are many other states. The standard error for this predictor shows even less fit (0.26) for the other predictors, though it’s still statistically significant.

Multiple regression is a potentially powerful tool for teasing apart the relationships between predictor variables for a specific outcome when conducted correctly. Adding the right covariates such as race can help alleviate the effects of an outlier such as Washington, DC. It’s always better to include all of the data to give the most complete picture of it as possible.

We now see that as the percent of the population of a state with a bachelors degree or higher increases, the percent vote for Trump decreases. As at the same time, as the percentages of the white and uninsured in a state increase, the percent of Trump’s vote also increases. In the presence of these variables, the concentration of hate groups and the percent of the state in poverty are no longer significant predictors of Trump’s vote.

As Trump and the Republican controlled congress prepare to repeal the Affordable Care Act (ACA or as the GOP says Obamacare), the Congressional Budget Office estimates that 23 million Americans will lose their health insurance in the House version of the bill and an estimated 22 million will lose it in the Senate version. In this model, the uninsured rate in each state is positively correlated with Trump’s percent of the vote. Does Trump believe that increasing the uninsured rate will increase their share of the vote in 2020? 

Poverty was not associated with Trump’s vote in 2016. The decrease in uninsured estimates since the ACA went into effect in 2014 is mostly due to Medicaid expansion for the poorest individuals and subsidies which allow lower income individuals to purchase health insurance. Increasing the number of uninsured may not decrease Trump’s vote but it is unlikely to increase it.

Image: Clayton Shonkwiler.