Why standardization matters: An example of hate groups in the US


Every year the Southern Poverty Law Center (SPLC) publishes a map showing the numbers of hate groups in each state.  They create the list from online searches and from police reports.  In the most recent year, 2016, they reported an increase of 25 groups nationwide between 892 in 2015  to a total of 917 groups in 2016, with most of these anti-Islam groups.  California had the most groups at 79 followed by Florida with 63, Texas with 55, New York with 47, and then Pennsylvania with 40.  While these states have more overall hate groups than any other state, they also have more people than any other state.  As such, I would expect these states to have more of everything to serve their large populations.  I would expect these states to have more NAACP chapters, for instance.  As another example, they should have more churches, but does that necessarily mean that they have more religious fervor than the other states?


Image: The Hate Map.

This is why standardization of data is important.  It makes data comparable.  To use an example from American Cinema, when one asks what is the most popular film of all time at the box office is, the answer most people might give would be Titanic, Avatar or Star Wars Episode VII: The Force Awakens.  If one looks at the total dollars made at the box office they would be right about Star Wars Episode VII.  However, the price of a ticket to go to the movies is considerably higher today than it was in the past.  If one adjusts for the price of a ticket, Gone with the Wind becomes the most popular film of all time, while Star Wars VII is ranked eleventh most popular of all time.  If one looks at other indicators of popularity such as DVD/Blu Ray/TV Netflix viewership, answering this question becomes even more complicated.

Returning to our example of hate groups in the US, I decided to adjust for population by dividing the raw number of groups in a geographic area by the population.  To express this standardized number in whole numbers, I multiplied by one million to express it as groups per million.  For example, the US rate of 2.83 groups per million would be calculated by:


When the raw numbers for each state plus the District of Columbia (DC) are adjusted for population, California had a rate of 2.01 groups/million, which was well below the national rate. In fact, it ranked thirty-fifth among the states.  Following on, Florida had a rate of 3.06 groups/million ranking eighteenth,  Texas had a rate of 1.97 groups/million ranking thirty-sixth,  New York had a rate of 2.38 groups/million ranking thirty-second, and  Pennsylvania (my home state) had a rate of 3.13 groups/million ranking sixteenth.

The states with the five highest rates were Washington, DC with 30.83 groups/million, Montana with 9.59 groups/million, South Dakota with 8.01 groups/million, Idaho with 7.13 groups/million, and Mississippi with 6.02 groups/million.  We can see that Washington, DC is an extreme outlier from the other states with 21 groups for a population of 681,170.  DC is an outlier on many measures, as can be seen in the table below.  We can compare the hate group concentration to other statistics describing the populations in those states.  One statistic of particular interest is the percent of the vote that Donald Trump received.  Percentages are another example of standardizing data to compare election outcomes between states.  The word percent comes from the Latin word for “per hundred.” 

The different variables for each state can be tested for a linear association by computing a correlation coefficient.  In the graph below, Trump’s percent of the vote is on the y axis and the hate group rate in groups per million is on the x axis.  This graph is called a scatterplot.  Using a regression/correlation method, we can find the best fit straight line for the scatterplot.  The line of the graph slopes upward with 20.1% of the variability in Trump’s vote in each state accounted for by the concentration of hate groups in each state.  If 100% of the variability were accounted for by the concentration of hate groups, all of the points on the scatterplot would form a perfect straight line sloping either upwards or downwards.  This means that as the concentration of hate groups increases, so does Trump’s percent of the vote.


As previously stated, DC was an extreme outlier on the concentration of hate groups with 30 groups per million.  It was also an extreme outlier on the percent of the vote Trump received, with 4.1%.  These statistical anomalies, including a high rate of residents with a college education (56.7% compared to 29.8% for the US as a whole) or a higher poverty rate (17.7% vs. 14.7% for the US), make it a potentially fruitful area for sociological study.  However for these reasons, and because it is fundamentally different from the other states in other ways, it was excluded because it would mask the relationship in the other states.  To see the rates for all the states plus DC check out the link here.

One should always be careful not to infer cause and effect relationships for correlation/regression data.  There could be some third variable that might account for the relationship between hate group concentration and Trump’s percent of the vote.  It could be that the sentiments that drove people to create hate groups in their state also motivated them to vote for Trump.  Prediction in regression research does not always mean causation.

The standardization of the totals is important to make data comparable.  If I had correlated the raw number of hate groups with Trump’s percent of the vote (or worse still the total number of votes he received in each state), an erroneous conclusion regarding the relationship between the variables would have been reached.  A variable should always be considered in the context of other relevant variables.