Um... interesting plot below it is. I believes you all sew all sorts of wired pictures time to time. I wanna bring some sections into spotlight and talk about it.
Caveat: pass / failure criteria of FASTQC is specifically designed for DNA-Seq. Some sequencing (RNA-Seq, targeted sequencing or 16S) naturally have uneven proportion of duplicate reads and bias. This may cause false positive failure in:
Per tile sequence quality
This is a base call quality plot. During sequencing with illimina technology, the flow cell is scanned to capture the fluorescence signal when nucleotide attached. The quality of each base call on the tile is recorded. This figure shows the average quality of base called on each tile. The quality decreases as color turns red.
Several scenario may occur and shown as below:
Such loss of quality at random spot may indicate general problem with the run leading to overall low quality base call. most likely reason could be overloading of the flow cell
loss of quality at several area can be caused by very biased sequence composition.
Sometime we see loss of quality occur at certain tile and last to the end. This happen when the imaging get obstructed by something as simple as dirt. One often see them in pairs because any obstruction would affect both the top and bottom swaths.
Sequences from these areas can be trimmed.
A temporary loss of quality over a restricted area. This can happen when something like bubble washed into flowcell, preventing sequencing reagents from getting to clusters under the bubble. As the consequence, the sequencing skip chemistry cycle and the last base is repeatedly read until the bubble get washed out and new reagents resume getting to cluster. This means the sequences are artificially extend and the artificial insertion are introduced.
Per base sequence content
In random library you would see little fluctuation between different bases at each position. However, the failure of this can be caused by one of the following scenario:
If large amount of identical sequence, for example PCR duplicates, adapter dimers or rRNA, exist in the raw data, bases from these sequences may be enriched at corresponding position.
For library generated by ligation of random hexamers, the library always has selection bias in around forst 12bp. Nearly all RNA-Seq libraries have this issue. However, this issue doesn't bias the expression estimation
Per sequence GC content
In random library you would see roughly normal distribution overlapping to the theoretical distribution (blue line). Note that the theoretical distribution based on human genome, data sequenced from other species of unusual GC content may show shifted peak. Abnormal (multiple) peaks may be indicative for contamination of adapter dimer or material from other species
Sequence Duplication Levels
High duplicates level could be either PCR duplicates or biological duplicates. For instance, highly expressed reads are intended to be over sequenced in RNA-Seq. One can take top sequences from Overrepresented Sequences table and BLAST it against nt database to check their origin.