18/1/2017

## Don’t test me: Using Fisher’s exact test to unearth stories about statistical relationships

A common problem faced by data journalists is how to determine if there is a statistical relationship between two categorical variables such as gender, race, or the share of the vote for two candidates in an election. The simplest way to visualize the relationship is to represent the counts for each combination of two variables in a contingency table with the rows representing the levels of one variable and the columns representing the levels of the other variable. The most commonly used statistical test for an association between the row and column variables is the chi-square (χ2) test. The example in the table below illustrates this test using 2016 primary data for the American presidential election.

The columns in the above table shows the primary states won by Hillary Clinton and by Bernie Sanders on the Democratic side and Donald Trump placed in the same primary states on the Republican side. The total number of states in the table is 51 because the District of Columbia is included. For example, the column percent’s show that Trump won 86% of the primary states that Clinton won while he won 55% of the states that Sanders won. Using the chi-square test, journalists can explore relationships between the states won by Trump, Clinton, and Sanders.

The chi-square test is based on calculating expected values for each cell in the table. In the above example, we calculate the expected value (the value for that would be expected if there were no relationship among the variables) for the cell for states where Trump finished third on the Republican side and for states where Bernie Sanders won on the Democratic side by multiplying the row total for where Trump placed third (3) by the column total for states where Sanders won (22). This product is then divided by the total number of observations for (51). The formula for the expected value is given by:

That means that for this cell a value of 1.29 would be expected if the primary states where Trump finished third and Sanders won were completely independent of each other. The observed value for this cell is 2, suggesting a higher count than would be expected. Expected values would be computed for each cell in the table and the difference between the observed and expected values for each cell is computed, squared, divided by the expected value, and summed across the cells in the table according to the formula:

If the value for the chi-square exceeds the chi-square critical value for a given degree of freedom (found by multiplying the number of rows minus one and the number of columns minus one) and p-value, it is concluded that there is an association between the variables.

But there is a problem with the chi-square test.

The test is only an approximation of the distribution of counts in contingency tables. If more than 20% of the cells in the table have an expected value of less than five, the chi-square approximation does not work to test the hypothesis of an association between the row variable and the column variable (as is the case in the table above). Both variables in the table are categorical, which means that the values of the variables can only take certain values such as gender, political affiliation, or placement in an election. A continuous variable is one that could take any value on the number line such as temperature, height, or weight. The major statistical packages will alert the user if this assumption is violated. Violating the assumption causes the observed p-value to be incorrect and can lead to incorrect conclusions being made regarding the presence or absence of an association.

To overcome these limitations, there is an exact alternative to the chi-square test called Fisher’s exact test.

Rather than the chi-square distribution, which is an approximation of the distribution of observed and expected values, Fisher’s exact test is based on the hypergeometric probability distribution, which is the exact distribution of counts in a contingency table.

Here the Ri! are the factorials of the row totals (5!=5*4*3*2*1), Ci! are the factorials of the individual column totals, N! is the factorial of the table total and the aij! are the factorials for the individual cell values. The Πij is the product coefficient of the individual cell values. Such a formula is even more computationally intensive than the chi-square test, especially for tables with many rows and columns. This is why the chi-square test was favored in the past because it took too much memory for computers to run. These days it is less of an issue for computers to run the Fisher’s exact test and it is easy to run in the major statistical packages, such as R, SAS, SPSS, and STATA.

The commands to conduct Fisher’s exact and the chi-square test in R can be seen below, using the US Primary Election table above (yellow for Fisher’s exact test, green for the chi-square test).

The output for the Fishers exact test shows that there is a probability of 0.03653 of observing these table frequencies when there is no association between the rows and columns. The chi-square test output shows a probability of 0.04217 for a relationship in the same table. If we were using the 0.05 p value as the criteria for significance we would find a relationship for both tests in this case though the p-values differ. In a case where the sample size is even smaller than in the example given above, this difference in p-values would be even greater and could lead us to reach the wrong conclusion regarding whether the is a relationship between the variables or not. States which Hillary Clinton won in the primary season were more likely to be won by Donald Trump while states where Bernie Sanders won were more likely to have Trump finish 2nd or 3rd.

As a warning, the p-value should not be used as an indicator of the strength of the association between categorical variables. Either the test is significant or not, which means that either the relationship is present or not. The p-value is sensitive to sample size. Often the odds ratio can be used to estimate the effect size but R only computes it in the fisher.test function for tables with 2 columns and 2 rows.

Fisher’s exact test provides a criterion for deciding whether the differences in observed percentages between two categorical variables in a sample are significant or just due to random noise in the data. In the above example, the 86% of primary states won by Clinton and Trump are significantly different from the 55% of primary won by Sanders and Trump. Journalists should always be careful about making these judgments by just looking at observed percentages or counts because of the subjectivity of such decisions. Subjective decisions can be further clouded by ones preconceived notions about the issues related to the data.