Earlier I started a discussion regarding Duplicates issue in NGS. I later had some further thoughts about this topic and would like to mention it here. It is maybe trivial, but still interesting to think about it.
In the early article I mentioned the source of PCR duplicates and the reason we want to remove them before calling variants.
For paired end reads, they are considered to be PCR duplicates if these paired reads align to exact the same chromosome position like depicted in the purple box of the figure below.
One day, I was looking at IGV and several questions come to mind:
For paired end reads in the purple box above, do these so called "PCR duplicates" come from the same cell or different cells?
How does this make difference? Well, they should be treated as duplicates and get removed only if they come from the same source, i.e. same chromosome of the same cell. Otherwise, they shouldn't be considered as duplicates. Statistically, we probably indeed create fair amount of fragments from different cells that are sonicated at same chromosome position at both sides.
There is no way to tell whether these duplicates come from different cells. By removing PCR duplicates, we actually remove all paired end reads that share same sonication site regardless of their origin cells. Will this introduce bias into allele frequency calculation? The answer is probably no because we can still guarantee that each unique fragment from each cell only be sequenced once.
Paired end reads in the red box above, do these adjacent reads come from same cell or different cells?
Human genome is diploid. It is possible, although highly unlikely, that fragments from both chromosomes get selected and appear in the final fastq file. However, if there are more than two reads aligned overlappingly like depicted in the red box above, they must come from multiple cells because human genome is diploid.
Considering the fact that millions of cells are being processed in the NGS library, the probability of above scenario is extremely low. The most probable case is that most of the overlapping aligned reads, if not all, come from different cells.