A few elusive information in BAM / SAM format
bam file may store more alignment information than you think. As probably the most important format in NGS analysis, almost all sorts of variants are identified from it. It is critical that we get a full image of bam format.
Here I like to mention some of the subtle tags in bam format. They may seems less critical than others. These info have been super useful for certain customized analysis or in times of retrospective check.
M in CIGAR, NM tag and MD tag
CIGAR shows matching information of corresponding alignment. According to SAM specification ( https://samtools.github.io/hts-specs/SAMv1.pdf ), "M" indicates alignment match. However, it doesn't differentiate between "match" and "mismatch". To get the mismatch, you need to go to the end of record and check "NM" and "MD" tag for the number of match and mismatch position respectively. In latest version of SAM format, "X" is introduced to indicate "mismatch" to avoid confusion.
"S" (Soft-clipped) and "H" (hard-clipped) reads in CIGAR
Clipped reads are those only partially aligned to the reference. Unaligned sequence of the read is marked as either "S" or "H". If a read can be partially aligned to multiple places, the best alignment is marked as "S" and considered as primary alignment while the rest are marked as "H" and considered as non-primary alignment.
Also, clipped sequences are treated differently in that:
Clipped sequences are ignored when calling small variation, such as SNP or Indel.
Clipped sequences are ignored when calculating depth of coverage.
Clipped sequences are considered when calling structural variation, such as CNV or fusion. Actually these clipped reads are indicative for structural variation.
MAPQ vs AS
Both MAPQ and AS indicate mapping quality. AS only considers the goodness of corresponding alignment alone while MAPQ judges from all alignments that read aligned to.
You can have high AS and low MAPQ if the read aligns perfectly at multiple positions, and you can have low AS and high MAPQ if the read aligns with mismatches but still the reported position is still much more probable than any other.
MAPQ should be the parameter used during mapping quality based filtering.
SA tag marks chimeric alignment
First we need to be clear of two concepts: linear alignment and chimeric alignment.
Linear Alignment is alignment of a read to a single reference sequence that may include insertions, deletions, skips and clipping, but may not include direction changes.
chimeric alignment An alignment of a read that cannot be represented as a linear alignment. The alignment of a read splited and aligned to different region.
From the definition we can see that chimeric alignments are clipped alignment and mark with either "S" or "H" in CIGAR depending on whether the read can be aligned to multiple places. For these chimeric alignments, an SA tag is appended at the end of the record pointing to other chimeric alignments of that read.
To summarize what have been explained with an example:
E00526:...:1028 83 chr11 11427 60 86S64M NM:i:1 MD:Z:63G0 AS:i:63 XS:i:0 SA:Z:chr11,114270502,+,59S52M39S,60,1
E00526...:1028 323 chr11 11427 60 59H52M39H NM:i:1 MD:Z:45A6 AS:i:47 XS:i:21 SA:Z:chr11,114270672,-,86S64M,60,1
E00526...:1028 163 chr11 11427 60 47S64M39S NM:i:1 MD:Z:53A10 AS:i:59 XS:i:0SA:Z:chr11,114270502,-,98S52M,60,1;
E00526...:1028 435 chr11 11427 60 98H52M
These are paired end reads partially aligned to two different places (the sequence names have been shortened for clarity).
For 1st and 3ed alignment:
FLAG indicates primary alignments ( picard has a FLAG calculator at: https://broadinstitute.github.io/picard/explain-flags.html )
CIGAR marks clipped sequences with "S".
For 1st and 3rd alignment, MD tag marks mismatch of G and A at read position 64 and 54 respectively.
SA tag points to other chimeric alignment, in this case, 2nd and 4th record.
On the other hand, For 2nd and 4th record:
FLAG indicates non-primary alignments.
CIGAR marks clipped sequences with "H".
For 2nd alignment, MD tag marks mismatch of A at read position 56.
SA tag points to other chimeric alignment, in this case, 1st and 3ed record.
different aligner writes bam file differently
It is important to realize that different aligners may implement bam format differently. For example MAPQ, a tag recording the number of alignment for corresponding read:
255 = uniquely mapped reads
0 = multiply mapped reads
It set MAPQ value based on the number of mismatches of various qualities, and the number of multi-mapping reads
It follows sam spec. 255 indicates not available.
STAR / Tophat:
255 = Uniquely mapping for STAR while 50 = Uniquely mapping for TopHat
3 = Maps to 2 locations in the target
2 = Maps to 3 locations in the target
1 = Maps to 4-9 locations in the target
0 = Maps to 10 or more locations in the target
Besides, different aligners may store different content in default. For example STAR / tophat keep only mapped records while bowtie / bwa keep both mapped and unmapped records in bam.
Difference like these are far more than I can listed here. Therefore, my recommendation is reading the documentation of aligner carefully before interpreting its bam output.