1. Introduction
SOAP3-dp, like its predecessor SOAP3, is a GPU-based software for aligning short reads to a reference sequence. It improves SOAP3 in terms of both speed and sensitivity by skillful exploitation of whole-genome indexing and dynamic programming on a GPU. SOAP3 is limited to find alignments with at most 4 mismatches, while SOAP3-dp can find alignments involving mismatches, INDELs, and small gaps. The number of reads aligned, especially for paired-end data, typically increases 5 to 10 percent from SOAP3 to SOAP3-dp. More interestingly, SOAP3-dp's alignment time is much shorter than SOAP3, as it is found that GPU-based dynamic programming when coupled with indexing can be much more efficient. For example, when aligning length-100 single-end reads with the human genome, SOAP3 typically requires tens of seconds per million reads, while SOAP3-dp takes only a few seconds.
The alignment program in this package is optimized to process data sets with multi-millions of short reads by using a multi-core CPU and a GPU concurrently. The hardware requirement and usage of SOAP3-dp is similar to that of SOAP3 (see next section for details). Roughly speaking, SOAP3-dp first aligns reads using SOAP3 with a small number of mismatches only; unaligned reads are further aligned using index-assisted dynamic programming (semi-global alignment with affine gap penalty). The default setting finds alignments with similarity down to 75%. Users can control the alignment similarity via five dynamic programming parameters in the .INI file (which correspond to the scores of a match, mismatch, gap opening and gap extension and the cutoff threshold; default: 1, -2, -3, -1, and 30 for read length 100). SOAP3-dp has an option to disable the dynamic programming, which will make SOAP3-dp to function exactly the same as SOAP3 (i.e., aligning with mismatches only).
SOAP3-dp version 2.3 has the following new features:
- A new option for user to output the BWA-like MAPQ score (inside the soap3-dp.ini file), which is enabled by default. The scale and the range of the scores are similar to the scores reported by BWA. This would be useful if one would like to further process the alignment results by using the software like GATK, which was tuned according to the MAPQ scores reported by BWA.
- An alternative version for Amazon EC2 and machines with memory less than 24GB. The default setting is tuned such that it uses a SA value sampling frequency of 4 for building the 2BWT index, and uses 6 threads for running SOAP3-dp.
SOAP3-dp version 2.2 has the following new features:
- Improved performance when aligning longer Illumina reads (i.e. length between 150 and 300).
- Increased sensitivity by adding one more step to align each end of the unmapped paired-end reads separately.
SOAP3-dp version 2.1 has the following new features:
- SOAP3-dp can now support at most 65,000 reference sequences (chromosomes).
SOAP3-dp version 2.0 has the following new features:
- A more accurate mapping quality score. The mapping quality score indicates how reliable the resulting alignment is. (More information about mapping quality score is mentioned in the section 2.3 of the manual, and one may check the accuracy of SOAP3-dp from the chart in the following section).
- The alignment results can be outputted in BAM format.
- Allow to specify which GPU card for running SOAP3-dp (useful when there exists more than one GPU card in a machine).
- Allow to share the index in the memory (CPU side) among all the running instances of SOAP3-dp.
- The input format for multiple sets of reads have been updated. The user can have a greater flexibility, like each set of reads can have different range of insert size and different location for outputting the alignment results.
The algorithms & software in this package were developed by the algorithms research group of the University of Hong Kong (T.W. Lam, L.K. Lee, C.M. Liu, Ruibang Luo, H.F. Ting, Thomas Wong, Edward Wu, S.M. Yiu & Jianqiao Zhu), in collaboration with BGI (Yingrui Li, Bingqiang Wang & Chang Yu) and Peking University (Ruiqiang Li)
