The power of biological replicates in statistical analysis
It is not uncommon that we see biological replicates are used in case-controlled studies. The reason is straightforward: there is no biologically identical individual. Whenever we are to investigate the biological difference between individual with different phenotype, say cancerous vs non-cancerous, the biological variation within phenotype (not associated with phenotype of interest) is a confounder to our experiment. We therefore need to collect appropriate amount of samples from each phenotypical population to dilute such confounder and uncover the biological difference between phenotype (replicates are also required to control technical variation which is not the concern of this article). This article will try to dig deeper on the impact of biological replicates on the choices of analytical approach throughout study.
One major usage of biological replicates is in RNA-Seq, more specifically in differential expression (DE) analysis. Negative binomial distribution is widely used to model expression count while capturing both technical and biological variation (See PREVIOUS ARTICLE). There are also methods based on normal distribution, such as limma. Here we need to introduce Central Limit Theorem (CLT) which establishes that, when we have adequate samples size, the test statistic is normally distributed even though sample distribution is not normal. In DE analysis, we are actually asking whether the average expression level (of specific gene) is statistically different. The average expression level, as a test statistic, applies to CLT. Therefore, we can conclude that:
When we are conducting a RNA-Seq experiment without luxury of sequencing a large amount of replicates, we have to stick to the approach that best models the biological scenario of interest, in the case of DE analysis, model based on negative binomial distribution.
When we are taking advantage of large publically available dataset where biological replicates are abundant (such as TCGA), we can save all the trouble and go for sample method, such as Student’s t-test. Similarly, when we are comparing gene expression of scRNA-Seq where each cell represents a biological replicate, we can go for even sampler non-parametric test, such as Wilcoxon test.
Another example would be DNA methylation. To identify Differentially Methylated Site (DMC), some traditional methods apply Fisher's exact test based on hyper-geometric distribution. This approach has recently been challenged by not accounting biological variation (same logic as using Poisson distribution in DE) and alternatively purpose using negative binomial distribution. However, if we have the luxury of acquiring more biological replicates, this depute can be spared.
The power of biological replicates resides on its ability of migrating biological variation, a confounder that we most likely to encounter whenever we seek biological markers in a case-controlled study. In general, the more biological replicates we have, and the less our study be affected by confounders, and the less constrains pose on choosing appropriate statistical approach.