Normalization methods in RNA-Seq analysis

In previous POST, we discussed several possible bias sources and how to correct them. Here Ghost wanna talk about two other bias that must be normalized in every RNA-Seq analysis.

Heteroscedasticity issue

For genes that are highly expressed, their high expression count usually yield large variation from sample to sample within the same group. Besides, the small amount of highly expressed gene usually skew the expression distribution and make the analysis / visualization harder. These issues can be mitigated by proper normalization:

  • Log: Log2(n+1). For low expressed gene, the +1 may introduce dramatic change in transformed value. However, we probably don't care those.

  • VST (DESeq): Take square root.

  • Voom (Limma): Transforms the normalized counts to logarithmic (base 2) scale and estimates their mean–variance relationship to determine a weight (log-counts per million with associated precision weights)

We can see, after the transformation, we get rid of some extreme value. Thus, migrate the heteroscedasticity problem.

Library / lane size issue

Since the total number of reads generated on each library / lane is different, genes in different sample may be assigned to different number of reads although these genes are expressed at the same level. We need to normalize this by:

  • Total count (TC): Gene counts are divided by the total number of mapped reads (or library size) associated with their lane and multiplied by the mean total count across all the samples of the dataset.

  • Upper Quartile (UQ): Very similar in principle to TC, the total counts are replaced by the upper quartile of counts different from 0 in the computation of the normalization factors.

  • Median (Med): Also similar to TC, the total counts are replaced by the median counts different from 0 in the computation of the normalization factors.

  • Relative Log Expression (DESeq): This is based on the hypothesis that most genes are not DE. It normalizes the library size (mapped reads) of each lane (sample) (normalize between samples by scaling raw read counts in each lane according to a single, lane-specific factor), then divides each gene count of each sample by this size factor.

  • Trimmed Mean of M-values (EdgeR / Limma): It is also based on the hypothesis that most genes are not DE.

  • Quantile (Q): First proposed in the context of microarray data, this normalization method consists in matching distributions of gene counts across lanes.

  • Reads Per Kilobase of exon per Million mapped reads (RPKM): normalizing by exonic length and library size. TPM may be a better choose than RPKM/ FPKM since it produce the same qualitative entity between sample after normalization.

Normalization in DESeq

DESeq internally normalizes for library size (Relative Log Expression). Note that neither rlog transformation nor the VST are used by the differential expression estimation in DESeq, which always operates on the raw count data, through generalized linear modeling which incorporates knowledge of the variance-mean dependence. The rlog transformation and VST are offered as separate functionality which can be used for visualization, clustering or other machine learning tasks.

Dispersion shrinkage in DESeq2

Gene count dispersion tends to be large for dataset 1) with small number of biological replicates 2) for genes lowly expressed. An empirical Bayes (prior derived from data itself) dispersion shrinkage is implemented on DESeq2 to handle this

  1. Gene-wide dispersion estimation using MLE.

  2. Fit a curve which is used as prior probability.

  3. Obtain MAP based on the prior

The strength of shrinkage depends on the fisher information, not mean count. the shrinkage is intense when the gene counts widely spread across samples (large dispersion).

This method can be viewed as bias-variance trade-off, reducing variance at the expense of accepting bias towards zero.

Normalization in Limma

Limma was initially build to handle microarray data. It is based on linear model and use TMM library / lane normalization of the edgeR package. Then:

If library size across samples consistent (within 3-fold):

Choose limma-trend (convert count to logCPM in advance)


Choose voom (apply quantile normalization if data is noisy) in which (counts is taken logarithm)

Start standard limma DE analysis using linear model