Probability distribution and hypothesis testing

Bioinformatics is an interdisciplinary of biology (clinical medicine), computer science as well as statistics. To enable an in-depth biomedical / bioinformatics research, especially data driven research, researchers ought to have robust and intuitive understanding towards statistics. However, many bioinformaticians of biology background may find statistics a bit hard to comprehend.

For rudimentary bioinformatics analysis, perhaps the most commonly used statistics is hypothesis testing, such as Student's t-test or Chi-squared test. Here let's talk about how these tests are implemented.

Probability distribution

Before discussing hypothesis testing, we have to talk about probability distribution first. Through repetitive experiments, people realize that, in real world, the probability of many event occurrences follow certain patterns. For example, the measurement of some phenotypic traits (size, height or weight) among specific species follows certain distribution pattern. The incubation period of some epidemic disease among patients follows some other distribution pattern. The expression of certain genes among individual follow some other distribution pattern as well. Through observation, people summarizes a series of distributions to which the probability of certain events follow. Once we determine the distribution of event under study, we can make statistical inference based on finite observations.

Hypothesis testing

Hypothesis testing tests if our data follows assumed distribution. Back to Student's t-test, it tests if two sets of value (or one set of value and a single value) are significantly difference. In statistical terms, it tests if the mean of two sets of data are statistically significantly different under normal distribution (here the normal distribution is guaranteed by Central Limit Theorem). The beauty of normal distribution here is: the probability of event occurrence under normal distribution is fixed within specific standard deviation. For example the probability is 68%, 95% and 99.7% within first, second and third standard deviation respectively.

If we look at the formula of Student's t-test. Loosely speaking, it simply calculates how many standard deviations between data mean and the value it compares with (strictly speaking, Z-test may be the more appropriate example here). We then can estimate the probability that these two set of value are derived from same distribution. Assuming the significance level at 95%, if the t-statistics is greater than two standard deviations, we can conclude that they are unlikely to be drawn from same distribution.

Chi-squared test

In terms of Chi-squared test, we usually use it to determine if multiple ways of categorizations of same data are dependent to each other (Chi-square test of independence in contingency tables). For example if the risk of getting cancer ( cancer / healthy) is associated with race (African / Asian).

Again, let's look at Chi-square distribution first. The reason we are discussing normal distribution and Chi-square distribution together is not only because of their frequent use in biomedical research, but also because of their closed relationship. Chi-square distribution describes the sum of squares of multiple independently normally distributed variables. We can view Chi-square distribution as the multivariate generalization of t-test or z-test. This can be seen from its formula as well (we can consider the contingency table as Poisson-distributed counts. In this case, the mean equals to variance.). According to Central Limit Theorem, we can deduce that Chi-square distribution is asymptotic to normal distribution

Once we obtained a better understanding on Chi-square distribution, we can better comprehend its other applications besides independence test of contingency table. Chi-squared test can determine if our data follows assumed distribution (Chi-square test of goodness of fit). Comparing with testing on contingency table, this simply replaces the alternative categorization with the test statistics of data derived from assumed distribution.

Chi-square distribution is used in survival analysis as well. Actually, all three classical hypothesis test used in Cox model, Wald test, likelihood test and log-rank test, are all based on Chi-square distribution (this can be understand from the nature of Chi-square distribution being the multivariate generalization of t-test or z-test). Similarly, the interpretation of these tests under Chi-square distribution is whether any of variables significantly changes the probability of event occurrence (here the event usually refers to death in Cox model of survival analysis).