Roughly, RNA-Seq may have the following bias sources:
Appears at first 12 base of per base sequence content. It is caused by primer preference during PCR amplification. Although it is normal, be cautious for adapter contamination if it is too much. This bias will not cause variant detection bias since the allele frequency does not change. It is only a concern when comparing coverage between different region of a genome, such as comparing the level of two transcript.
Introduced from poly(A) selection during reverse transcription.
rRNA depletion: This is rather a common phenomenon that a portion of rRNA didn't get depleted and left in library. Here are a couple of places we can check this:
GC content in FASTQC: Looking for unusual peak in the graph. But keep in mind that this abnormality may due to other reason as well, such as bacterial contamination.
Duplicates level FASTQC: Although higher duplicates (hundreds) is normal for RNA-Seq, too much is sign for rRNA undepletion. BLAST some of the top hits to check if they are rRNA.
Picard RnaSeqMetrics: This shows the percentage of reads aligned on each genomic element including rRNA.
RNA-Seq sample tends to have more natural reads duplicates (tens or hundreds are OK) coming from highly expressed fragments or rRNA (rRNA undepleted). Besides, we usually perform slightly heavy PCR just to get fair amount of read of lowly expressed gene. These natural duplicates mix with PCR duplicates and forbid us to do duplicates removal.
Of course, we don't have to do PCR if you have sufficient sample material to work with. But usually we don't.
Depending on the sample extraction and cancer type, tumor purity can as high as 95% or as low as 50%. Studies have shown that tumor purity, as an intrinsic property of the tumor ( depends on tissue/location/microenvironment ), introduces bias to differential expression and co-expression analysis.
Except the bias listed above, another technical variation can come from batch effect when technicians split a large number of samples into several batches and operate them separately. Such technical variation may confound with biological variation during RNA-Seq
To check if batch effect is introduced:
Label the samples based on possible handling batch (sample data, extraction data and etc).
Run principle component analysis (PCA) or hierarchical clustering.
Plot the PCA / clustering result and see if any batch group clusters together.
If batch effect is observed, we have in general three choices for downstream analysis:
Continue without doing anything, just be aware that the batch effects will introduce some in-group variance than may make it more difficult to discover genes with smaller changes between the biological groups.
Try to remove the variation caused by the handling batches, then continue as normal with the downstream analysis. This option mean that you will change the data, and potentially remove variation in the data that shouldn't be removed. On the positive side, it will allow you to compare biological groups with more replicates and hence more statistical power.
Analyze the data within each batch, then do meta-analysis on top to see if the same genes are identified in each batch. This option means that you are not changing the data before doing the analysis, but has the drawback that the statistical power will be reduced
ComBat and SVA
The SVA package includes the popular ComBat function for directly modeling batch effects when they are known.
However, there may be potentially a large number of environmental and biological variables that are unmeasured. For these cases, SVA may be more appropriate for removing these artifacts. It is also possible to use both SVA and ComBat to remove both known batch effects and other potential latent sources of variation.
We can see that, after removing batch effect, samples from different batches clustered together.
Note: Don't use ComBat on raw counts (DESeq). The corrected values are only appropriate for tools like Limma, since they're no longer integer counts. Alternatively, you can handle batch effect within DESeq by adding batch as parameter into experiment design.