Selectively talk about some scenario you may see in FASTQC report

Um... interesting plot below it is. I believes you all sew all sorts of wired pictures time to time. I wanna bring some sections into spotlight and talk about it.

Caveat: pass / failure criteria of FASTQC is specifically designed for DNA-Seq. Some sequencing (RNA-Seq, targeted sequencing or 16S) naturally have uneven proportion of duplicate reads and bias. This may cause false positive failure in:

  • Per base sequence content

  • Per sequence GC content

  • Sequence duplication levels

Per tile sequence quality

This is a base call quality plot. During sequencing with illimina technology, the flow cell is scanned to capture the fluorescence signal when nucleotide attached. The quality of each base call on the tile is recorded. This figure shows the average quality of base called on each tile. The quality decreases as color turns red.

Several scenario may occur and shown as below:

  • Overall low base call quality

Such loss of quality at random spot may indicate general problem with the run leading to overall low quality base call. most likely reason could be overloading of the flow cell

  • Biased sequence composition

loss of quality at several area can be caused by very biased sequence composition.

  • Sequencing get obstructed by dirt or bubble

Sometime we see loss of quality occur at certain tile and last to the end. This happen when the imaging get obstructed by something as simple as dirt. One often see them in pairs because any obstruction would affect both the top and bottom swaths.

Sequences from these areas can be trimmed.

  • Bubble causes false positive insetion

A temporary loss of quality over a restricted area. This can happen when something like bubble washed into flowcell, preventing sequencing reagents from getting to clusters under the bubble. As the consequence, the sequencing skip chemistry cycle and the last base is repeatedly read until the bubble get washed out and new reagents resume getting to cluster. This means the sequences are artificially extend and the artificial insertion are introduced.

Per base sequence content

In random library you would see little fluctuation between different bases at each position. However, the failure of this can be caused by one of the following scenario:

  • Over-represented sequence

If large amount of identical sequence, for example PCR duplicates, adapter dimers or rRNA, exist in the raw data, bases from these sequences may be enriched at corresponding position.

  • Random priming:

For library generated by ligation of random hexamers, the library always has selection bias in around forst 12bp. Nearly all RNA-Seq libraries have this issue. However, this issue doesn't bias the expression estimation

Per sequence GC content

In random library you would see roughly normal distribution overlapping to the theoretical distribution (blue line). Note that the theoretical distribution based on human genome, data sequenced from other species of unusual GC content may show shifted peak. Abnormal (multiple) peaks may be indicative for contamination of adapter dimer or material from other species

Sequence Duplication Levels

High duplicates level could be either PCR duplicates or biological duplicates. For instance, highly expressed reads are intended to be over sequenced in RNA-Seq. One can take top sequences from Overrepresented Sequences table and BLAST it against nt database to check their origin.

  • Always think about the length of target region and the coverage when evaluating duplication level.

  • It only counts exact match. Poor base call quality may bias duplication level.