What are the odds? How odds ratios can help journalists understand relationships between events
In my last post, I talked about using Fisher’s exact test to determine if there is a significant association between two categorical variables. At the end of the article, I talked about odds ratios as a measure of effect size between two categorical variables. Journalists investigating the health effects of exposure to an environmental toxin might find odd ratios useful to estimate the likelihood of one event occurring given that another has occurred in the same area. The odds ratio comes from the field of epidemiology as a method of estimating the relative risk of contracting a disease when exposed to a certain agent.
So, say you are investigating the link between smoking and cancer. The table below provides a framework for the estimation of an association between a possible causative agent and a disease of interest (in our case, smoking and cancer) for a hypothetical population. The letter A represents those with the disease and exposure to the agent. The letter B represents those who have the disease but were not exposed to the agent. The letter C represents those who were exposed to the agent but did not contract the disease. Finally, the letter D represents those who do not have the disease and were not exposed. The columns represent the counts for individuals who were either exposed to the agent under (A+C) study or not (B+D). The row totals represent counts for those who either have the disease of interest (A+B) or not (C+D).
The estimate of the odds of contracting the disease when exposed to the agent, compared to those who do not contact the disease is found by the formula:
Algebraic manipulation of the complex fraction makes the fraction on the right possible. The odds ratio is the ratio of concordant cases (either those who were exposed and have the disease or those who were not exposed and did not have the disease) to the discordant cases (either those who were exposed and did not have the disease or those who were not exposed and did have the disease). An odds ratio of one suggests no increased risk of disease if exposed. An odds ratio of more than one suggests an increased risk of disease with exposure while an odds ratio of less than one suggests decreased risk for those exposed.
Similarly, we can apply this analysis to other journalistic questions, like the odds ratio of a candidate’s success in an election. Below, I have provided a table of US Primary Election data to illustrate the computation of the odds ratio in a 2x2 contingency table. By applying an odds ratio analysis to this data, we can find out the likelihood of Donald Trump winning a primary on the Republican side in states where Hillary Clinton won on the Democratic side relative to states where Bernie Sanders won on the same side.
In the above table, the three rows from the table in my previous article were collapsed to two by combining the rows where Trump finished second and third to form the row Trump Lose. The equation for computing the odds ratio is given by:
This 5.21 number suggests that Donald Trump was more than 5 times more likely to win primaries in states where Hillary Clinton won on the Democratic side. The statistical significance of Fisher’s exact test (p=0.024) says that the odds ratio is significantly different from one. In other words, the significant Fisher test says that there is an increased likelihood of Trump winning in states where Clinton won. The odds ratio tells you by how much this likelihood increases.
A problem in computing odds ratios occurs when there is a zero in one of the cells labeled A thru D. A zero in the A or D cell results in an odds ratio of zero. A zero in the B or C cell results in an odds ratio of infinity . As the contingency table is an estimate of the “true” odds ratio, theoretically it can never be zero or infinity for the events of interest. One way of addressing this issue is to add 0.5 to every cell and apply the same formula. In the case of an odds ratio of infinity, another approach is to invert the table so that there is an odds ratio of zero and compute a 95% confidence interval. R can compute these with the fisher.test command that I used in my previous analysis:
The odds ratio provides a useful method for estimating the likelihood of an event occurring given that a possible causative event has occurred relative to the when the causative event has not occurred. In the 2x2 contingency table, the computation of the odds ratio is a fairly simple matter. Of course, the estimates can only be as valid as the data that is collected. In the example presented in this article, all of the Presidential primaries were included for the 50 states plus the District of Columbia. Not included in the table were the results from the other US territories such as Puerto Rico and Guam. But, as there are only six of these territorial contests (mostly won by Trump and Clinton), they would not have altered the odds ratio considerably.
Image: Jorge Franganillo.