Figure below summarizes sources of some duplicate types:
Here Ghost gonna focus on the most common type: PCR duplicates and talk about how to treat them in several sequencing type data.
Where PCR duplicates come from and why do we care?
They are introduced when the multiple PCR amplified reads of same origin lay to the different spot on the flowcell.
The reason we care about them is because that they mess up the allele frequency of variant and make our variant calling biased and unreliable.
PCR amplification vs bridge amplification?
Why PCR duplicates are rare even with huge amount of copies generated from heavy PCR amplification?
When pooling library onto chip, we have far more molecules than adapters on chip. Each unique molecule’s chance of getting represented even once (let alone twice) is small.
Then what about those molecules not getting represented? That is fine, we started with many copies of the genome extracted from many cells. Therefore, the overall allele frequency is not screwed by missed reads.
How many rounds of PCR amplification needed to get low PCR duplicates while ensuring enough depth?
Find balance between PCR amp free (low coverage due to low DNA amount) and excessive PCR amp (more PCR dup)
It is all about library complexity. It is garbage in, garbage out. PCR can't increase complexity without introducing duplicates
PCR duplicates in WGS, WES, targeted (amplicon) sequencing and RNA-Seq
WGS usually has relatively low depth. PCR duplicates level is relatively low.
When sequencing depth is high, for example targeted sequencing or sequencing tumor sample, PCR duplicates level is relatively high.
In RNA-Seq, heavy PCR amplification is desired so that low expressed regions won't be diluted by high expressed regions. In such case, some degree of PCR duplicates (mostly from high expressed regions) is unavoidable and we don't want to remove these reads because most of them probably are biological duplicates.
Circumstances where PCR duplicates is not desired
As we said earlier, PCR duplicates should be removed to avoid bias during variant calling. However, not all sequencing techniques support duplicates removal
Amplicon sequencing: All targeted regions are amplified at same position using primers. Duplicates removal will remove all the reads that are not real PCR duplicates, but the reads originated from different cell.
Single end (SE) sequencing: Unlike paired end sequencing providing a pair of reads for duplicates identification, short of position information in SE will remove significant amount of false positive duplicates.
RNA-Seq: In RNA-Seq library, a fair amount of duplicates come from either highly expressed gene (natural duplicates) or PCR duplicates and it is difficult to distinguish them. Generally, as long as we have sufficient amount of DNA to start with, we think most of the duplicates are biological duplicates. This article talks in details of the way to estimate the natural duplicates / PCR duplicates.
Liquid biopsy: For liquid biopsy, we usually start with limited DNA amount. To generate high depth data, we need to perform heavy PCR amplification and sequencing throughout. That unavoidably generate false PCR duplicates (from different DNA fragment in the blood). So conventional duplicate removal doesn't work. Barcoding technique should be applied instead.
Lastly, briefly talk about two other types of duplicates:
A read is called as an optical duplicate if the pair of reads are both on the same tile, and the distance between reads is less than the distance set in Picard's "Mark Duplicates". This can be detected by proximity
Exclusion Amplification duplicates (ExAmp)
Such duplicates only exist on patterned flowcell (HiSeq 4000) when original library molecule leaving the clustering area and floating off to create havoc elsewhere on the flowcell. This can be detected by proximity.
Note that both optical duplicates and ExAmp can be detected by GATK.