MCMC: Monte Carlo sampling and Markov Chain

In PREVIOUS ARTICLE, we mentioned that Bayesian approach is being used for somatic variant detection. We will pick up this topic from here and spend two articles on building a site specific error model using MCMC for variant detection. Before walking into actual application, let's briefly talk about two major components of MCMC: Monte Carlo sampling and Markov Chain.

Monte Carlo sampling

Monte Carlo method is a sampling technique used for statistical estimation when direct calculation is improbable. A simple example is to estimate the area of irregular 'cloud' as shown in Figure below. Due to the irregularity of the cloud, direct calculation of area is impractical. We can instead define a measured square covering the cloud and randomly assign data point within the measured square. After thousands of assignments, the area of shape can be estimated by the fraction of data point within cloud.

We can see that the core of Monte Carlo sample is the random sampling process. This process is usually based on either known probability distribution or acceptance-rejection role.

Sampling based on probability distribution

The simplest case would be sampling with equal probability, aka, through uniform distribution (0,1), just like the area of 'cloud' estimation above. Of course, we can sample through other probability distribution such as Gaussian distribution or Gamma distribution. The choice entirely depends on our understanding of the event we aim to estimate. However, sometimes we may have to estimate things we know nothing about, or things without a typical distribution.

Sampling based on acceptance-rejection rule

Acceptance-rejection role assumes a typical distribution and choose to accept or reject the current sampling based on pre-defined role.

More specifically, a typical distribution f(x) and a constant M are defined so that the distribution of interest h(x) is under Mf(x) is shown in Figure below. A value is randomly sampled under uniform distribution (0, Mf(v)). We then choose to accept this iteration of sampling if u < h(v) and vice versa. Repeat this process thousands of iterations until the distribution of interest is approximated.

However, for data with high dimensions, it is hard to find f(x) and constant M. Here comes in Markov Chain.

Markov Chain

Markov Chain defines event with multiple states and the probability of the event in specific state only depends on its previous state

A simple example of Markov Chain

Say we are conducting an clinical experiment on the efficacy of anti-tumor drug. Mouse administrated with agent can have three states: S = {D, M, I}, where D, M, I represents shrunk, stable and enlarged tumor volume respectively. The transition probability of tumor state after two weeks administration is:

If tumor shrunk after administration, the probability of tumor state being D, M and I would be 0.9, 0.05 and0.05 respectively.

Assuming the initial probability S0 = {D = 0.3, M = 0.6, I = 0.1}, we observe S every two weeks after administration:

S1 = S0 * P = {0.36, 0.515, 0.125}

S2 = S1 * P = {0.40125, 0.455, 0.14375}

S3 = S2 * P = {0.4293, 0.4128, 0.1578}

... ...

Sn-1 = Sn-1 * P = {0.48, 0.32, 0.2}

Sn = Sn * P = {0.48, 0.32, 0.2}

After sufficient iteration, the state probability becomes fixed. In addition, the state probability remains unchanged even independently from initial state probability setup. In another wprd, Markov model eventually converges to stable equilibrium distribution regardless of initial random assignment of state probability.

We walked through the basics of Monte Carlo sampling and Markov Chain in his article. In next article, we will talk about MCMC and its application in variant detection.