In STAT 210A class, we are now discussing hypothesis testing, which has brought back lots of memories from my very first statistics course (taken in my third semester of undergrad). Think null hypotheses, -values, confidence intervals, and power levels, which are often covered in introductory statistics courses. In STAT 210A, though, we take a far more rigorous treatment of the material. Here’s the setting at a high level: we use data to infer which of two competing hypotheses is correct. To formalize it a bit: we assume some model and test:
- , the null hypothesis
- , the alternative hypothesis
While it is not strictly necessary for and , we often assume these are true for simplicity.
We “solve” a hypothesis testing problem by specifying some critical function , specified as
In other words, tells us the probability that we should reject .
The performance of the test is specified by the power function:
A closely related quantity is the significance level of a test:
The level here therefore represents the worst chance (among all the possibilities) of falsely rejecting . Notice that is constrained to be in the null hypothesis region ! The reason why we care about is because often represents the status quo, and we only want to reject it if we are absolutely sure that our evidence warrants it. (In fact, we technically don’t even “accept” the null hypothesis; in many courses, this is referred to as “failing to reject”.)
We may often resort to a randomized test, which was suggested by my definition of above which uses a . This is useful for technical reasons to achieve exact significance levels. Let’s look at an example to make this concrete. Suppose that , and that we are testing
- (so here the hypotheses do not partition).
And, furthermore, that we want to develop a test with a significance level of . (Note: yes, this is related to the abundant usage of 0.05 as a -value in many research articles.)
To test, it is convenient to use a likelihood ratio test, which have nice properties that I will not cover here. In our case, we have:
where we have simply plugged in the densities for Binomial random variables and simplified. Intuitively, if this ratio is very large, then it is more likely that we should reject the null hypothesis because describes the data better.
We know that can take on only three values (because ) and that under the null hypothesis (this is important!) the probabilities of taking on , or happen with probability and , respectively.
Knowing this, how do we design the test with the desired significance level? It must be the case that
(There is only one possibility here, so we do not need a “sup” term.)
By using the fact that a likelihood ratio must be defined by a cutoff point where if , we reject, else if we accept (and with equality, we randomize), we see that our test must be and . If we were to equate this with our definitions above, this would be:
with . (The third case never happens here; I just added it to be consistent with our earlier definition.)
And that is why we use randomization. This also gives us a general recipe for designing tests that achieve arbitrary significance levels.