MCMC II: Applying MCMC in somatic variant calling
In PREVIOUS ARTICLE, we introduced theoretical basis of MCMC, its components and how it approximates the posterior distribution of interest via sampling in high dimension data. Here let's put MCMC into actual application in genomic research.
Next Generation Sequencing (NGS) is a relatively complicated process. Multiple experimental steps can introduce technical noise, such as during sequencing. These noise affects the accuracy of cancer mutation detection. Traditional method assumes equal error rate across different genomic loci. However, numerous factors, such as sequence context, yield distinct error rate across different genomic loci. To build a more precise variant caller, it is reasonable to take those factors into consideration and put them in a Bayesian structure model, for instance MCMC.
How do we choose Bayesian model
With advance of NGS technique, mature sequencing platform now demonstrates relatively low sequencing error (~0.1% on illumina sequencer). On the other hand, sequencing depth of a tumor sample is relatively high (mostly > 500). Therefore, we can use Poisson distribution to model sequencing error rate. average error rate is denoted as λ. For each genomic locus p, mutation depth y follows specific error rate:
λ ~ d*µ;
where d is total sequencing depth and µ is variant depth. Given the sequencing depth and variant depth, we have likelihood function:
y ~ Possion(λ)
We choose a Gamma distribution, a conjugate distribution to Poisson, as prior;
y ~ Gamma (α, β)
This prior represents certain specific property of each genomic locus, such as sequence context and mappability. Estimated Gamma distribution will be different across genomic loci. We will use this site specific prior distribution to adjust likelihood function and derive posterior distribution (Gamma-Poisson distribution, alternatively called negative binomial distribution).
Approximating prior distirbution using Gibbs sampler
We can image that these site specific prior distribution is too complicated to be derived in a closed form solution. We therefore utilize MCMC to approximate the prior distribution via sampling:
In practice, we can use white blood cells (WBC) or cfDNA of healthy individual for training distribution of technical noise. With this Bayesian model, we can use approximated posterior distribution for hypothesis test on each variant candidate. Those variants with allele frequency significantly away from our estimated posterior distribution are likely to be true somatic variant.
These modeling strategy is especially suitable for somatic variant detection in cfDNA since those variants are of low allele frequency and suitable control sample is usually unavailable.