Matrix factorization of high dimension data

In oncology research, it is a common goal to uncover specific molecular pattern of cancer. High-throughput sequencing has been become the routine approach to start with. However, sequencing technique such as RNA-Seq and bisulfite sequencing generates high dimension data and therefore suffers from what we call as 'dimension curse'. How to reduce data dimension while preserving biological pattern has been under active research. We PREVIOUSLY discussed dimension reduction. In addition, variety of matrix factorization technique has been used for other applications such as molecular subtyping and cell type separation. This post reviews this topic in a broader scale.

From technical point of view, The terminology of 'dimension reduction', 'matrix factorization' and 'unsupervised clustering', to some extent, overlap to each other. They all involve uncovering specific molecular patterns each of which is shared among a subset of samples. Let's imagine a RNA-Seq expression matrix of N*M where N represents the No.gene and M represents the No.samples: The principle of matrix factorization is decomposing the original N*M matrix into N*C amplitude matrix and C*M pattern matrix where C represents the amount of patterns.

This transformation of NM into the product of N*C and C*M is the general, while simplified, expression of matrix factorization. Column in amplitude matrix indicates contribution of each gene and row in pattern matrix indicates contribution of each sample. Each column in amplitude matrix represents a unique molecular pattern and these patterns are assigned to each individual in Pattern matrix. Once we cluster these individual in Pattern matrix based the similarity of established molecular pattern. The unsupervised clustering is done.

Matrix factorization as broad concept has variety of isoforms. Widely used ones includes PCA, NMF and ICA. They implement different constrains and therefore yield different result.

PCA

In brief, PCA finds the major variance in data. It transforms the initial data into a new set of variables (columns of amplitude matrix ) with such constrains: 1) Each set of variable explains the data from direction (eigenvector) of largest variance within all data. 2) Each direction represented by a set of variable is orthogonal to each other. In another word, PCA is done through iterative orthogonal transformation and each transformation is on the eigenvector harboring the largest remaining variance. The good things about PCA are:

All patterns are quantitatively ranked. this allows us to compare the tendency of each sample to specific biological process.
Orthogonal transformation ensures the linear uncorrelatedness of each factor. When PCA is done on multivariate Gaussian distributed data such as RNA-Seq, uncorrelatedness means independence as well. However, there may be more than one dependent biological process exist in single component. The implementation of PCA on proper RNA-Seq data allows us to find major biological differences inherited in the top principle components. But multiple biological pathways may be enriched in single principle component.

In a practical way, PCA mainly serves to reduce data dimension by capturing major variance. It ensures uncorrelatedness, but not dependence, between factors.

ICA

Unlike PCA finding linearly uncorrelated factors, ICA finds each statistically independent factors. Therefore, results of ICA may better align with annotated biological pathways. However, ICA does not differentiate weight of each factors.

NMF

Major constrain of NMF is non-negativity. This constrain makes it suitable for certain biological matrix including RNA-Seq and bisulfite sequencing where negative value is not permitted. The drawback is that non-negativity allows NMF to find only over-expressed gene in specific pathway, but not under-expressed ones. Besides, NMF is preferable in mutational signature analysis due to its nature of sparse solution. The underlying biological rationalis that most mutagenes are highly specific in the type of damage they cause.

Limitation in major usage

Accurate matrix factorization depends on relative uniqueness of each pattern and the exclusion of other confounders. However, tumor impurity and clonality may impose non-trivial bias.

One major usage is uncovering molecular characteristics of tumor. This can further guide molecular subtyping of cancer type. If we assume that the major difference of each tumor subtype can be reflected from a number of biological pathways and non-tumor cells share rather similar molecular characteristics among samples, matrix factorization method may yield relatively accurate result.

The power of biological replicates in statistical analysis

MCMC II: Applying MCMC in somatic variant calling

MCMC: Monte Carlo sampling and Markov Chain

Matrix factorization of high dimension data

댓글