Dimension reduction in RNA-Seq based model

Ghost encountered dimension reduction problem while designing a model aiming to describe patient's immuno-oncology (IO) status. The model is supposed to incorporate well-known immune related factors listed in the figure below (for factor selection details, please read THIS and THIS)

Most of these factors can be evaluated by either RNA-Seq or WES (except TIL localization and perhaps PD-L1 staining for CD8 activation). From modeling perspective, this generalization is a dimension reduction process. Expression estimation of over 20000 genes / hundreds of mutation generated from RNA-Seq / WES were converted to a couple of IO related features including immunogenecity, TIL composition or TCR status. These biologically well-defined features tell stories of IO status of corresponding patient.

However, epitope presentation and leukocyte recruitment remain untouched. How do we extract features that describe this process? Now let's compare some common feature extraction methods:

1: Perform a manual curation and select a set of genes known to affect corresponding process.

Cons: Such method is rather arbitrary and tends to include much of the genes that do not contribute much, but rather dilute other well-defined features.

2: Apply well-known dimension reduction technique, such as NMF, PCA or SVD, to extract features.

Cons: Although NMF can provide sparse solution which is often appreciated in RNA-Seq, features we got from such matrix transformation or decomposition loss their biological meaning and cause downstream analysis non-interpretable.

We see both methods have their own drawbacks. Selected gene-level features are not aligned with the rest of features and may compromise the model. On the other hand, features extracted through eigendecomposition often remain biologically non-interpretable.

A elegant way to extract feature

Ghost was amazed by Vesteinn Thorsson's paper which provides an exquisite way for such problem. Now let's briefly see what they did.

  1. 160 IO related expression signatures were manually selected from variety of sources including MSigDB.

  2. first round of reduction: Gene set enrichment analysis was performed on the expression estimation of each tumor sample and generated n_sample * n_signature matrix with enrichment score,

  3. Hierarchical cluster (WGCNA) was performs to generate nine eigen-signature (top of figure above), second round of reduction

  4. To prevent overfitting, cluster validation by predictive strength using Gaussian mixture was applied. Such cross validation gets rid of three robustness clusters. Final features were set (box in figure above)!

  5. Clustering was performed using five final features and generated 6 clusters determined by MAD

The beauty of this method is that it significantly reduces feature dimension while keep the extracted features both statistically and biologically meaningful! From the first step of feature extraction all the way down to the final feature set, researchers managed to keep the features meaningful in each step and, more importantly, clusters generated from the such features become biologically meaningful

In the downstream survival regression, authors applied CoxPH model regularized by elastic net which combines LASSO and ridge regression, another feature selection process during modeling. Elastic net regularization is a smart choice to get rid of redundant features while maintaining highly correlated variables as single unit (grouping effect).


Ridge: as a typical regularization against multicollinearity, Ridge uses L2 norm (Euclideandistance) penalty and shrinking large regression coefficients in order to reduce overfitting.

LASSO: LASSO uses L1 norm (taxicab distance) to force the sum of the absolute value of the regression coefficients to be less than a fixed value, which forces certain coefficients to be set to zero (sparse solution).

We can see from comparison table above that although both methods shrink coefficients, only LASSO provides sparse solution by setting some coefficients to zero. Note that penalty used in LASSO serves the same purpose as of PCA / LDA: dimension reduction.

Elastic net: Elastic net combines L1 and L2 penalties.

  • Feature selection: Selecting feature through LASSO

  • Grouping effect: it overcomes LASSO’s limitation of only including one of correlated variables by adding an absolute square to those variables (group related variables together as single unit) that ought to be zeroed by LASSO alone. Grouping effect make sense specially in certain studies where genes lie in known pathway.