DNA fragment size, insert size and pre-merging

During the standard library preparation, we usually use physical or enzymatic methods to fragment DNA. How different fragment size affects downstream analysis and how do we choose optimal fragment size?

Generally speaking, the optimal fragment size depends on the goal of analysis, the region of interest and sequencing technique. Let's split fragment size to three scenario and discuss them separately.

Fragment longer than 2*read length

Longer DNA fragment creates an insert region (region between the ending point of both paired end reads). The insert size provides additional information for:

  • Structural variation (SV) detection: Longer fragment size increases the likelihood that a fragment spans across SV breakpoint. The expected length of insert size provides additional information for determining SV.

  • RNA-Seq: Same as above, longer fragment size increases the likelihood of covering exon junction. Also, it gives more splicing information.

Between read length and 2*read length

Shorter DNA fragment size is generally preferable when captured region is short during capture based sequencing.

  • Shorter fragment size reduces the proportion of reads aligned to off-target region specially when targeting large number of region.

  • Shorter fragment size produces more even depth of coverage (middle region won't be covered less).


For amplicon sequencing where DNA fragment size is fixed, we can merge paired end data into single end data using tools like PEAR. Note that pre-merging generally not applicable for probe based sequencing where DNA fragment size varies. Because you are risking throwing away part of the paired end reads that don't overlap.

During merging overlapped sequences, tools adjust base quality depending on whether the overlapped base consistent. This saves the aligner and variant caller's burden of error correction and base quality adjustment

Pre-merging helps to detect large Indel. Normal aligner can only call insertions in cigar strings if the insertion is shorter than read length. So the longer the reads, the longer the insertions that can be called.

Shorter than read length

This is usually abnormal situation. It is kind of waste and adapter sequence will be sequenced at the ending side of the reads.