When comes to a general bioinformatics task, there are always a bunch of tools on which we have to evaluate based on their underlying algorithm. Moreover, when we are about to build a model on our own, we again have to pick from a variety of ML model. How do we know which models may fit better to specific biological context? Can we derive a role of thumb for this model selection?
Let's start with a general comparison: Most of ML model can be broadly classified as either generative model or discriminative model. We can characterize them based on the following attributes:
Generative model learns the model that generates the observed data. It goes back to learn 'everything' about the distribution that most likely to produce what we have observed.
Discriminative model directly discriminative the observed data regardless of the underlying generation process. The model learns the boundary instead of distribution.
Generative model learns joint probability: P(X,Y) which can be converted in Bayesian fashion: P(X|Y)=P(Y|X)P(X).
Discriminative model learns conditional probability: P(Y|X=x).
Parametric / Nonparametric:
Generative model is mostly parametric since it models the underlying distribution.
Discriminative model is mostly nonparametric. Note that, by definition, model with unfixed amount of parameters is defined as nonparametric.
Generative / nonparametric: GMM which learns Gaussian distribution and have unfixed amount of parameters (latent parameters increases depending on the sample size)
Generative / parametric: various Bayes based model
Discriminative / parametric: GLM, LDA and logistic regression
Discriminative / nonparametric: KNN, SVM, K-Mean and CART
Resistance to outlier:
Generative model shows strong resistance to outliers thanks to the soft assignment by probability distribution
Discriminative model is relatively weak resisting to outliers due to its hard classification boundary.
Training data amount:
Generative model has high bias (underfitting) due to the generalization by assumed distribution. This is why regularizations are often required for example in linear regression
Discriminative model has low bias
Generative model has low variance (as we know, bias and variance are trade-off)
Discriminative model tends to have high variance (overfitting). That is why bagging / boosting are used to average out the variance.
The consensus states that discriminative classifiers, as the more 'direct' method, almost always outperforms generative one. However, the best choice eventually depends on which assumption our data fits better. Let's make two examples
An example of Generative vs Discriminative
K-Means and Gaussian mixture model (GMM) are exemplified here. Both methods are used in unsupervised clustering. Since the latent variables (membership of each data point) are unknown and direct MLE is impossible, they applie EM strategy in a very similar fashion to approach the optimal solution.
K-Means: As a discriminative & nonparametric method, K-Means iteratively assigns data membership based on WCSS (within-cluster sum of squares) update centroid to optimal fitting position.
GMM: As generative & nonparametric model, Gaussian mixture iteratively assigns the data membership based on whatever used for loss function and update distribution to optimal fitting position.
In terms of constrains, Gaussian mixture of course requires the data to be normally distributed. K-Means on the other hand has a number of picky constrains.
Variance of distribution is spherical
Actually K-Means requires the data to be normally distributed as well. The figure below shows a funny clusters produced by K-Means out of non-normally distributed data.
Equal size and same scalability
Besides, K-Means requires the data to have equal size and same scalability. Otherwise, the boundary tends to bias to the side with larger amount of data / larger scale just like the middle figure below.
In real world, those assumptions required by K-Means are rarely perfectly met. On the other hadn, GMM appears to be more robust.
However, Both K-Means and GMM have apparent drawback: Both methods apply EM which is heuristic and may yield to local optimum solution. Let's further consider the applicability of deterministic and heuristic method.
An example of Deterministic vs Heuristic
We compare SciClone's heuristic method: Variational Bayes Mixture Model (VB) and PyClone's deterministic method: Markov Chain Monte Carlo (MCMC). VB as a generative & nonparametric & heuristic method applies uses EM in a Bayesian fashion. It, like GMM described earlier, iteratively optimizes posterior (P(A|B)) distribution based on latent variables
Where P(A) is prior and P(B) is observations
Beta and Direchlet distribution as prior of VB
Beta or Direchlet are usually used as prior distribution for the following reasons:
Conjugate prior: For unsupervised clustering, the likelihood is usually represented by binary distribution (binomial, bernoulli or negative binomial) or multiclass model (multinomial or categorical). The prior must belong to the same distribution family. Hence: Beta or Direchlet.
0-1: Ranging from 0 to 1, best representing the probability of probability
Convenient update: Update posterior conveniently when new evidence comes in: Beta(α+a, β+b). The Bayes theorem therefore becomes
While VB generatively & heuristically models the distribution that generates the observable data, PyClone applies MCMC which directly samples (non-generatively) from exact distribution (deterministic). Therefore, PyClone guarantees the global optimal solution.
PyClone's MCMC is computationally intensive. Therefore, VB is often used as surrogate when dataset is large and MCMC is computationally impractical. For clonal evolution inference however, sample size is absolutely bearable by MCMC.
I listed these two examples to illustrate the fact that different method has its own pros and cons when fitting to specific biological context. Given the huge amount of methods available, it is hell of the burden to hit and trial over different statistical methods. By getting a better understanding of general characteristics of broad model class, we can narrow our choice down and then test our data for individual classifier. After all, which model works better depends on the exact constrains our data meet.