# Probability distribution in NGS data analysis

In previous POST, we discussed some basics of probability distribution and corresponding hypothesis testing. Here let's continue this topic and talk about the usage of probability distribution in NGS data analysis. More specifically, why do we choose what we choose to represent NGS data? What is the logics in our choice? We will take two examples commonly seen in NGS data analysis to make the point.

Let's take somatic variant calling as first example. We know that multiple steps in NGS protocol can generate artifacts, such as sequencing error. The goal of variant calling is to distinguish real somatic variants from these artifacts. Assuming the rate of sequencing error is known, we can then make null hypothesis regarding the distribution of such sequencing error. those mismatches not following the assumed distribution will be assigned as somatic variants. The question is: which probability distribution should we choose to best represent sequencing error? Considering that we are modeling independent sampling process (random sequencing error), Binomial distribution is probably the first one comes to our mind. Then we realize that, in a mature sequencing platform, the rate of sequencing error is pretty low (~0.1%). In addition, the sequencing depth of tumor sample is usually relatively high. To model distribution with such characteristics, Poisson distribution is probably a better choice. In fact, some of the current somatic variant callers take advantage of Poisson model where sigma is used to represent average rate of sequencing error. Given that the rate of sequencing error is different across chromosome, we can move one step further by using basecall quality to represent error rate and building a site specific error model. Now someone may realize that sequencing error is not the only source of artifacts, among which there are at least PCR error and sequencing misalignment. Indeed, Some modern methods apply Bayesian approach to achieve better accurate by integrating more error sources (such as mappability or sequencing context) into prior probability. We will leave this in a separate discussion in the future.

Another example we would like to make is differential expression analysis in RNA-Seq. We try to discover aberrantly expressed genes between group of sample with different phenotypical condition. However, how different is different? Now we need set up distribution of gene expression between samples with same phenotypical condition as null hypothesis. We can think the total reads as a pool and we are randomly sampling reads that comes from particular gene. This is again an Binomial process. Similarly, we have very large n (total reads) and very small p (fraction of reads from this particular gene). Now we have Poisson distribution. We should also consider that the variation of gene expression is influenced by technical variantion during experiment as well as biological variantion between individual. It is likely that we are dealing with a dispersed Poisson distribution where variance > mean. We can confirm this thought through the figure below based on gene expression of scRNA-Seq experiment. Therefore a dispersed Poisson, aka negative bionomial distribution may be a better chioic in this case. We briefly described the thinking process of choosing probability distribution from biological / biotechnology point of view. By making appropriate statistical assumption, we can make more accurate estimation given limited data access.