Probability distribution and hypothesis testing

Bioinformatics is an interdisciplinary field that integrates biology (including clinical medicine), computer science, and statistics. For data-driven biomedical and bioinformatics research, a strong and intuitive grasp of statistical concepts is essential. However, bioinformaticians with a biology background may often find statistics challenging to master.

Among the most fundamental statistical methods used in bioinformatics are hypothesis tests, such as Student’s t-test and the Chi-squared test. Before discussing their implementation, it is crucial to understand the role of probability distributions in statistical inference.

Probability Distributions in Bioinformatics

Many real-world biological and biomedical phenomena follow predictable probability distributions. Through repeated observations and experiments, researchers have identified distinct distribution patterns governing different types of data. For example:

Phenotypic traits (e.g., height, weight, size) within a species often follow a normal distribution.
Incubation periods of certain infectious diseases may follow an exponential or Weibull distribution.
Gene expression levels across individuals commonly follow a negative binomial distribution.

Once the appropriate probability distribution for a given dataset is determined, statistical inference can be performed using a limited set of observations, allowing researchers to make conclusions about broader populations.

Hypothesis Testing

Student’s t-test

Hypothesis testing evaluates whether observed data follows an assumed distribution. Student’s t-test, one of the most widely used hypothesis tests, determines whether the means of two datasets (or one dataset and a reference value) are significantly different. Statistically, it assesses whether two sample means are significantly different under the assumption of a normal distribution, which is often justified by the Central Limit Theorem.

A key property of the normal distribution is that event probabilities are fixed within specific standard deviations:

68% of data points fall within one standard deviation of the mean.
95% fall within two standard deviations.
99.7% fall within three standard deviations.

Mathematically, Student’s t-test calculates how many standard deviations separate the sample mean from the reference value. Loosely speaking, this is similar to a Z-test, but the t-test is specifically designed for situations where the sample size is small and population variance is unknown.

With a significance level of 95%, if the computed t-statistic exceeds two standard deviations, we can conclude that the two datasets are unlikely to have been drawn from the same distribution.

Chi-squared Test

Chi-squared Test of Independence

The Chi-squared test is commonly used to assess associations between categorical variables, such as evaluating whether the risk of developing cancer (cancer vs. healthy) is associated with race (African vs. Asian). This is known as the Chi-squared test of independence in contingency tables.

Relationship Between the Chi-squared and Normal Distributions

The Chi-squared distribution is closely related to the normal distribution. Specifically, it describes the sum of the squares of multiple independent, normally distributed variables. This makes the Chi-squared test a multivariate generalization of the t-test and Z-test.

Additionally, based on the Central Limit Theorem, the Chi-squared distribution is asymptotically normal, meaning that as the number of variables increases, it approaches a normal distribution.

Chi-squared Test of Goodness of Fit

Beyond contingency table analysis, the Chi-squared test can also be used to assess how well a dataset follows an assumed probability distribution—this is known as the Chi-squared test of goodness of fit. Here, instead of comparing different categorical variables, the test compares observed data against an expected statistical distribution.

Chi-squared Distribution in Survival Analysis

The Chi-squared distribution plays a crucial role in survival analysis, particularly in the Cox proportional hazards model. In fact, three classical hypothesis tests used in Cox models—the Wald test, the likelihood ratio test, and the log-rank test—are all based on the Chi-squared distribution. This is because Cox models rely on testing whether any covariate significantly alters the probability of an event occurrence (e.g., patient death in survival analysis).

The power of biological replicates in statistical analysis

MCMC II: Applying MCMC in somatic variant calling

MCMC: Monte Carlo sampling and Markov Chain