Assuming we suspect that a retrovirus or artificial vector integrated into human genome, how can we find out the exact chromosome location of the integration using NGS based method?
First we need have our vector containing sample sequenced. If we have an estimation of where the integration sites are, we can go with targeted sequencing that only focuses on the region of interest. However, In most cases we don't have this knowledge in advance and have to sequence entire human genome (WGS). However, handling WGS data is time consuming specially when we have to align WGS reads to large genome like human. We can get around with it by focusing reads that partially align to vector sequence
1. Align reads to vector sequence
Once we got FASTQ data, align it to vector sequence. For reads that span integration site, one end of their sequence come from vector and another end come from human genome. We can imagine that, when we align them to the vector sequence, they should end up as partial alignment and appears as soft-clipped reads in bam file. To get these partially aligned reads, we would favor local alignment over global alignment, i.e. bwa-aln or bowtie (bowtie allows us to set local alignment mode).
2. Simplify the bam file
Above 99% of the reads originate from human genome and they will end up unaligned. Only very small portion (length of the vector sequence / length of human genome) of the reads are aligned to vector sequence. To make our lift easier, we extract aligned reads and then sort them by reference position.
3. Calculate insert size
At this point, the filtered bam file is all we need to determine the vector integration site. If we open the sorted bam file, we can see partially aligned reads (soft-clipped clustered at both end of the file. To calculate insert size, we can check the alignment of paired reads and get a estimation of insert size (chr_pos2 - chr_pos1 + read length) based on the chromosome position of their alignment. This inset size will be used in next step.
4. Locate integration site spanning reads
Finally, we took each soft-clipped read and BLAT ( http://genome.ucsc.edu/cgi-bin/hgBlat?command=start ) against human genome. The best hit is usually the clipped sequenced aligning to human genome and alignment chromosome position is vector integration site.
However, things could be a little complicated, for example a small portion of the soft-clipped reads may in fact be misalignment. Below are some criteria that may give you clue of which soft-clipped reads are more likely to be real:
Only soft-clipped reads within vector sequence end +/- insert size are valid to use. If vector is 4000bp long and insert size is 400, Only soft-clipped reads within alignment position 0-400 and 3600-4000 are valid to use.
Based on the position of soft-clipped alignment and insert size, we can estimate the status (fully aligned, soft-clipped or unaligned) and the position of the its paired read. If the paired read is out of expectation, we may want to drop this soft-clipped read and move on to the next.
For each integration site, its supporting soft-clipped reads should come from both beginning and ending side of the vector sequence. Unless the sequencing depth is low and one side of the vector integration site happen to have no reads cover, then we can only consider such position as possible integration site.
For each soft-clipped read, the top hit of BLAT result should be partial alignment and the alignment breaking position should be consistent with soft-clipped position.
As the sequencing depth increases, we may have to deal with a large number of soft-clipped reads. It could be a painful task to manually BLAT each of them. An easier way is grep those reads, generate a fasta file and BLAT them in batch.