Be careful of the poly-G sequence from NextSeq run
We may occasionally find unexpected amount of poly-G reads in raw data generated from NextSeq. See FASTQC figure below as a typical example.
In the figure of k-mer content, all kinds of G-enriched k-mer peeks almost throughout the reads. The poly-G probably also shows up in the overrepresented sequences table.
What causes poly-G?
As shown in left figure below, HiSeq and MiSeq use four color method during the basecalling. Each color represents one base type. Once all 4 imaging cycles are completed, the base is determined based on the emission wavelengths of the dyes in each filter channel.
Unlike HiSeq and MiSeq, NextSeq and NovaSeq system have been switched to two color method. A+C filter channel and T + C filter channel replace the traditional four channel system. In the two channel system, "A" is called when both channels emit and "G" is called when no channel emit. This new method effectively reduces imaging cycle from 4 to 2.
This detection method switch causes the poly-G problem due to its poor ability to handle low quality basecall. In the four channel system, when emission intensity is weak, ambiguous or even undetectable, the system arbitrarily assigns a likely base with poor base quality. However, the two channel system can't distinguish "G" and "no signal" because both situations result in no channel emission.
What causes "no signal"?
Enough of the cluster has degraded or stalled that the remaining signal is too weak to make confident call
Something physically blocks the imaging of the flowcell (air bubbles, dirt on the surface etc)
In illumina's bridge amplification, the fragment get attached to the flowcell and base synthesis extends from one end. However, some fragments may fail to form bridge and results in poor basecall quality in R2
Can we guess, out of these possibilities, which one associates with our data?
When enough of the cluster has degraded, your poly-G enriched reads probably come from cluster which is degraded throughout the entire fragment and both filter channels are prone to "no signal". In such case, you probably have little "C" in your poly-G enriched reads (some C migrate to either A or T and some A or T migrate to G).
When something physically block the imaging of the flowcell, Per tile sequence quality in the FASTQC probably fail. Please refer THIS POST for more details.
If the poly-G reads is caused by poor bridge formation. The poly-G reads are most likely appear in R2.
Note that the poly-G problem can't be trimmed using traditional adapter trimming tools. Those tools trace back the reads until the base with good quality and cut it. However, poly-G reads may have poor basecall quality in the middle while pretty good basecall quality in the end of the read, or even prefect basecall quality across entire read. One can filter out poly-G reads by setting "G" percentage threshold.