Technical consideration with cell type decomposition

Ghost recently did some research on cell type decomposition technique, specifically TIL decomposition in tumor sample. TIMER and CIBERSORT came across alone with their technical debates published on correspondence of Genome Biology. Li in Revisit linear regression-based deconvolution methods for tumor gene expression data asserts that, despite heavy regularization on linear model, CIBERCORT's inclusion of cell types with similar expression features causes multicolinearity and consequently yields inaccurate result. On the other hand, Newman in Data normalization considerations for digital tumor dissection claims that TIMER lack of proper normalization of immune cell estimation against total amount of leukocyte.

Here is a brief methodology description of both methods:


  1. Select signature genes (Gi, n=2271) overexpressed in the immune lineage from IRIS

  2. They specifically select six cell type whose expression signatures are not correlated

  3. Select cancer type specific gene negatively correlated with tumor purity (Gp) and then intersect it with Gi to generate G0

  4. Then calculate median expression in all samples available for that cell type for each gene in G0 and create six gene expression vector, each for one cell type.

  5. Apply iterative linear least squares regression (LLSR)


  1. Get expression signature and calculate 2-norm condition number (control multicolinearity and get features with maximal discriminatory power).

  2. Apply SVR regularized by L2 norm. SVR fits data point with 'constant distance cube' using ε-insensitive loss function (feature selection) and L2 regularization to mitigate multicolinearity.

One should keep in mind the subtle yet important distinction between these two tools: CIBERSORT infers the relative abundance of immune subsets in the total leukocyte population. TIMER, on the other hand, calculates the fraction of immune cells with respective to entire tumor microenvironment: everything in the sample instead of immune cells alone.

Statistical assumption for linear model

Both methods apply linear model on RNA-Seq data to dissect immune cells. To use linear model, a number of assumption should be considered to ensure best performance. Now let's go over statistical assumptions for linear model.

Multicolinearity: First and foremost, multicolinearity is one of the focus of TIMER against CIBERSORT in their correspondence. Apparently, some cell types (for example T-helper 1 and T-helper 2) have close lineage and similar expression profile (although they may be functionally different). When we try to seperate them using linear model, their correlated expression features cause multicolinearity and make the model unstable (dependent variable may change dramatically in response to small changes of independent variables). TIMER choose avoid multicolinearity by only modeling six dissimilar cell types (step 2 in TIMER methodology described above). On the other hand, CIBERSORT applies support vector regression (SVR) regularized by L2 norm (same as ridge regression) which supposedly counteract multicolinearity (both expression feature selection by SVR and coefficient shrinkage by L2 penalty).


Compared with lowly expressed gene, highly expressed genes usually have expression value with larger variance among samples. This variance - mean dependence should be handled by normalization methods, such as Logarithm or VST.

Linearity: Besides, linearity is another issue when apply linear model on RNA-Seq data.

  • Library construction: RNA-Seq library construction, unlike microarray, is 0-sum game which causes dependency (Since the total amount of reads in library is fixed, inclusion of one read belonging to one expression signature means the exclusion of another read belonging to another expression signature). Such dependency leads to multicolinearity issue and consequently non-linear relationship. This issue in inherited in the raw data and unlikely to be fixed by normalization method.

  • Data normalization: In RNA-Seq analysis, certain readcount normalization is applied to deal with either library size or heteroscedasticity. Would this data transformation skews the linearity? Some papers point out that, in microarray data, the use of logarithm skews linearity although author of csSAM claims in his correspondence that in reality log-transformed value often yields lower false discovery rate. In RNA-Seq, Benchmarking shows that Kallisto.TPM and Salmon.TPM yield better lineary relationship. Besides, Author of voom suggests logCPM for better linearity preservation.

In sum, as an overdetermined system, inferring cell type composition from expression profile is theoretically feasible. Linearity may be problematic depending on the normalization method. Variables dependency during RNA-Seq library construction inherits in the raw data and may be not fixable. Besides, PCR amplification and data normalization may bring further complication.