A good tool to evaluate RNA-Seq library complexity by estimating PCR duplicates / biological duplica

For RNA-Seq, it has been challenging to distinguish PCR duplicates from biological duplicates. Ghost found a tool that tackles this and answer question:

  • In your RNA-Seq library, whether the duplicates primarily come from PCR duplicates or biological duplicates?

  • Does your RNA-Seq library has enough complexity (DNA material) to continue expression analysis?

Why is duplicates problematic in RNA-Seq?

Certain level of duplicates is unavoidable in RNA-Seq. It has been a lasting arguing about whether duplicates should be removed during analysis (seems like most people tend not to remove it).

People agreeing duplicates removal argue that, without removing duplicates, we bias the expression measurements towards highly expressed genes that produce most of the PCR artifacts.

People disagreeing duplicates removal argue that, most of the duplicates are biological duplicates coming from highly expressed genes and bias the expression measurements towards lowly expressed gene.

It all comes down to this: Whether to remove duplicates depends on if most of the duplicates are biological duplicates or PCR duplicates due to over-amplified library. In another word, it depends on the library complexity.

Ways to tackle this

One way to estimate the natural duplicates / PCR duplicates is by looking at the alignment in IGV. Natural duplicates have relatively smooth distribution across most of the exon (isoform) of highly expressed gene while PCR duplicates may have random depth peak. However, this only gives you a raw estimation.

Alternatively, we can use dupRadar. The idea is that, for a moderately amplified RNA-Seq library, the duplicates should only be produced from those highly expressed genes. On the other hand, if the library has to be over-amplified due to its insufficient DNA material, the duplicates may be produced from all expressed gene. Let's check the heatmaps produced by dupRadar.

The x-axis and y-axis indicate the expression level within every kbp (amount of reads aligned to every kbp) and % duplicates. The left figure shows that duplicates level increases as expression raises. This is how a moderately amplified library looks like. The right figure shows that a library with insufficient DNA material get over-amplified and PCR duplicates dominate the library.

The green dot line can also tell the complexity (input DNA amount) of the library (see where the genomic length & library size normalized index lay on genomic length normalized landmark).

Note that, comparing to PE library, the slope of the heatmap will be much higher if SE library is used because the likelihood of sequence being treated as duplicates increases.