P-values are like evidence in a criminal trial


In many of my posts, I talked about predictor variables that are significant in regression models. One question some of you might ask is how does one determine whether a variable is significantly associated with the outcome of interest? 

In a criminal trial the judge and/or jury has to make a decision about the defendant’s guilt or innocence. They are in essence testing a legal hypothesis about guilt or innocence, and should only base their decision on the evidence that is presented at the trial. In effect, they are blind to the truth of the defendant’s guilt or innocence. The null hypothesis is the presumption of the defendant’s innocence until the prosecution proves him or her guilty beyond a reasonable doubt.

Similarly, in statistical research we test statistical hypotheses. Is the mean for treatment outcome group 1 significantly different from group 2?  In my articles on the effect of race on the percent of the vote for Trump in each state, I’ve asked, “is the regression coefficient significantly different from zero?” as the hypothesis. These are examples of two sided hypotheses against a null hypothesis of no effect of treatment. In the former example the null hypothesis says that there is no difference in the two treatments.  In the latter example, the hypothesis of no effect of the percent of the population in a state with a large white population on Trump’s percent of the vote in that state or in other words a regression coefficient would be zero. This is a non-directional hypothesis. In the case of the normal distribution, the graph below shows the value of the test statistic on the x-axis and the density of the probability is the area under the curve. The null hypothesis is rejected if the test statistic falls in the blue region. This rejection is analogous to a guilty verdict in the trial. 

Research hypotheses can take the form of a one tailed or unidirectional hypothesis as shown in the graph below. Another way of stating a one-tailed hypothesis is a hypothesis with a rejection region falling at one end of the test statistic’s distribution. This hypothesis is analogous to treatment 1 being thought to be better than treatment 2 or there being a positive regression coefficient between the percent white population in a state and the percent of the vote for Trump in that state.  If the test statistic falls into the blue region, which is all on one side, then the null hypothesis is rejected.

The size of the blue area in both graphs is determined by how stringent the burden of proof is. The smaller the area, the smaller the probability of rejecting the null hypothesis when it is in fact true. The smaller this probability is, the larger the value of the test statistic has to be to reject null. This is roughly analogous to where the bar is set that a high jumper or a pole vaulter has to clear.

The table below shows the consequences of each decision relative to the unknown truth of the hypothesis. Ho represents the null hypothesis of the presumption of innocence in a trial. H1 represents the alternative or research hypothesis. If one accepts the null hypothesis when it is in fact true this is called a correct acceptance. This is analogous to a truly innocent defendant being found not guilty at trial, as the TV lawyer Matlock always managed to do. If one rejects the null hypothesis when this is true, this is called a type 1 error. This is analogous to an innocent defendant being placed in jail, say, as many in the US believe that Amanda Knox was in Italy. If the null hypothesis is rejected when it is in reality false this is called a type 2 error. This is analogous to a guilty defendant being allowed to go free, as many in the US believed happened to George Zimmerman for shooting Trayvon Martin. The final scenario is rejecting Ho when it is indeed false. This is analogous to convicting a defendant when he or she is indeed guilty, as happened to Oscar Pistorious in South Africa.

Ideally, in hypothesis testing or in criminal trials, type 1 and type 2 errors would never happen. Sadly, we must contend with the possibility of these types of errors. We must make a value judgment as to which of these errors are worse. In the US, the framers of the constitution considered a type 1 error worse as all defendants who have been acquitted of a crime can never be tried again for that crime (but despite this, the US has one of the highest incarceration rates in the world). For researchers, the probability of a type 1 error is under their direct control by deciding the value of the test statistic needed to reject null. The probability of a type 2 error is harder to pin down and is sensitive to things like study design, sample size and the size of the true treatment effect. The graph below shows the distribution of the sample mean under the null hypothesis and the distribution of mean under the alternative. The area in pink shows the probability of a type 1 error. The area in light blue shows the probability of a correct rejection under the alternative. The area in white under the alternative distribution is the probability of a type 2 error. The greater the distance between the midpoints of the two distributions represents the size of the treatment effect.

We all must make decisions based on available evidence which may or may not be conclusive. Journalists are no different. Statistics provide the opportunity to see patterns that are not easy to see with the naked eye. Seeing scientific researchers and statisticians’ decision making process assists data driven journalists in evaluating the evidence they have in writing a story. When the likelihood of observing certain evidence when there is no story is so low, we conclude that there must be a story.

Image: fickleandfreckled.