Archive

Tags

# From variant calling approach: Frequentist vs Bayesian

April 10, 2018

Frequentist's approach: Assume sequencing error distribution.

Frequentist's method makes the null hypothesis that non-reference allele is caused by sequencing error and the sequencing error follows either binomial or Poisson distribution. Then we fit error rate into proposed distribution:

Finally, we determine the variant based on the p-value. This method assumes same sequencing error rate across all chromosome position.

More advanced method models sequencing error based on base call quality score (phred score) and fits the error rate into Poisson binomial distribution.

Such method tries to make better variant call by assigning different error rate based on actual sequencing chemistry.

Bayesian approach: Take advantage of data from previous study.

In general, Bayesian approach does not assume distribution. It makes inference with the help from previous data. According to Bayes' theorem:

Where G is genotype and D is observed data

Breaking down the formula:

• We can ignore P(D) because it is the same for all possible genotype

• P(G) is the prior probability of genotype in population based on previous large scale studies, for example (1000 Genome, dbSNP or ExAC)

• P(D|G) is the conditional probability of the data given haplotype

When ignoring P(D), we can see that P(D|G) is actually the Frequentist's approach. The only difference between these two approaches is whether a prior probability is taken into consideration.

Now, let's consider such alignment with base call quality score:

ACACGCTAGCTAGCT

TAGCT                                Qscore = 20

CTAACT                                 Qscore = 10

GCTAGC                                   Qscore = 50

Based on Qscore, the corresponding error probability is 0.01, 0.1 and 0.00001. P(D|G) is first decomposed to H1H2 indicating diploid and then computed for each aligned allele.

The probability of genotype being AG given observed data is the conditional probability P(D|AG) weighted by prior probability P(AG) from previous study.

With the same approach, we can compute the P(AA|D) and P(GG|D), finally find the most likely genotype.

We can see from this Bayes process that both base call quality and sequencing depth affect variant calling.