Identifying somatic mutation has always been challenging.
The major challenge of calling somatic mutation is heterogeneity of tumor tissue. Unlike germline variants usually having allele frequency (AF) around either 50% or 100% with a little deviation due to wet-lab artifact, tumor cell's heterogeneity makes the somatic AF much more complicated:
clonality: tumor tissue usually consists of multiple subclones and each subclone contains different mutation set.
purity: tumor tissue is usually mixed with normal tissue.
ploidy: tumor tissue usually contains heavy amount of CNV.
With all these complications, the AF of somatic mutation could be at any level. The figure below shows somatic mutation density distribution across different samples.
We can see from left figure that somatic AF distribution varies significantly and basically no common pattern to seek. Right figure shows that, although most somatic variants can be roughly categorized into four frequency intervals depending on clonal / subclone and zygosity, each interval span more than 20% due to purity and ploidy difference. In general, sample with more sublones tends to have higher mutation AF variance, such as LPJ114 and LPJ128.
Germline mutation noise
The figure below shows the amount of somatic mutations per 1 Mb across several cancer types. The range is basically 0.1-15/Mb which is 3-300 mutations on WES,
On the other hand, we know that germline mutation occurs 1/1000bp across genome in average. That is 1/1000 * 3Mb (million bases) = 3000 on WES, at least several hundred times more comparing to somatic mutation.
Both tumor heterogeneity and large amount of germline noise makes clean separation between somatic mutation and germline mutation almost impossible without normal control. Although we can use COSMIC and germline mutation reference set (1000 Genome or ExAC) to filtered some of the germline, the after-filtering dataset will still contain far more germline mutations than somatic mutations.
Tumor-only variant caller: SGZ
Recently Foundation Medicine released a tumor-only variant caller: SGZ. It leverages the allele frequencies of variants of interest and a statistical model of genome-wide copy number and tumor/normal admixture to characterize the mutational state of the variants
Sequencing depth is sufficient
An accurate copy number model
The tumor specimen is sufficiently admixed with the surrounding normal tissue (<90% tumor content) to establish normal baseline
The scheme below shows the workflow of GSZ. At its core, it is implements Markov chain Monte Carlo (MCMC) to categorize mutational state based on the mutation allele frequency and CNV provided.
In general, MCMC makes function of question of interest following assumed distribution. It then performs sampling under Markov process in which the transition probability fits the distribution assumption. As the function converges, we get our function with approximated parameters.
Gibbs sampling (MCMC):
During CNV identification (CBS segmentation), the copy number and minor allele copy number of each segment is calculated and then tumor purity is inferred.
Assume a multivariate function / distribution representing the relation between allele frequency and copy number of each segment and global parameters: tumor purity and ploidy.
Randomly set initial parameters and try to optimize them iteratively under Markov Process.
During parameter optimization, maintain other parameters fixed and draw probability distribution of possible value for that parameter. Then we randomly sample a value based on the probability distribution (Gibbs sampling)
Exclude burn-in after thousands iteration. Calculate mean value for each parameter. This mean-taking process dilutes the outlier value for each parameter and prevent local optimum (applicable for other sampling method in general).
A grid-based method is used to find alternative solutions that can also fit the model by calculating the MSE between the measured and expected CNV.
Finally, 2-tailed binomial test P(y|G; AF germline) / P(y|S; AF somatic) is used to determine mutation state.
1 2 3
1: Inferred during this iteration
2: Currently being inferred
3: Inferred during previous iteration
Unfortunately, this tool only performs the last step if mutational stats categorization and users have to get AF of variant and CNV before using it. This brings much inconvenience to the users. We recently found another alternative tool: PureCN which probably worth a try.