1. Introduction

SOAP3-dp, like its predecessor SOAP3, is a GPU-based software for aligning short reads to a reference sequence. It improves SOAP3 in terms of both speed and sensitivity by skillful exploitation of whole-genome indexing and dynamic programming on a GPU. SOAP3 is limited to find alignments with at most 4 mismatches, while SOAP3-dp can find alignments involving mismatches, INDELs, and small gaps. The number of reads aligned, especially for paired-end data, typically increases 5 to 10 percent from SOAP3 to SOAP3-dp. More interestingly, SOAP3-dp's alignment time is much shorter than SOAP3, as it is found that GPU-based dynamic programming when coupled with indexing can be much more efficient. For example, when aligning length-100 single-end reads with the human genome, SOAP3 typically requires tens of seconds per million reads, while SOAP3-dp takes only a few seconds.

The alignment program in this package is optimized to process data sets with multi-millions of short reads by using a multi-core CPU and a GPU concurrently. The hardware requirement and usage of SOAP3-dp is similar to that of SOAP3 (see next section for details). Roughly speaking, SOAP3-dp first aligns reads using SOAP3 with a small number of mismatches only; unaligned reads are further aligned using index-assisted dynamic programming (semi-global alignment with affine gap penalty). The default setting finds alignments with similarity down to 75%. Users can control the alignment similarity via five dynamic programming parameters in the .INI file (which correspond to the scores of a match, mismatch, gap opening and gap extension and the cutoff threshold; default: 1, -2, -3, -1, and 30 for read length 100). SOAP3-dp has an option to disable the dynamic programming, which will make SOAP3-dp to function exactly the same as SOAP3 (i.e., aligning with mismatches only).

SOAP3-dp version 2.3 has the following new features:

  • A new option for user to output the BWA-like MAPQ score (inside the soap3-dp.ini file), which is enabled by default. The scale and the range of the scores are similar to the scores reported by BWA. This would be useful if one would like to further process the alignment results by using the software like GATK, which was tuned according to the MAPQ scores reported by BWA.
  • An alternative version for Amazon EC2 and machines with memory less than 24GB. The default setting is tuned such that it uses a SA value sampling frequency of 4 for building the 2BWT index, and uses 6 threads for running SOAP3-dp.

SOAP3-dp version 2.2 has the following new features:

  • Improved performance when aligning longer Illumina reads (i.e. length between 150 and 300).
  • Increased sensitivity by adding one more step to align each end of the unmapped paired-end reads separately.

SOAP3-dp version 2.1 has the following new features:

  • SOAP3-dp can now support at most 65,000 reference sequences (chromosomes).

SOAP3-dp version 2.0 has the following new features:

  • A more accurate mapping quality score. The mapping quality score indicates how reliable the resulting alignment is. (More information about mapping quality score is mentioned in the section 2.3 of the manual, and one may check the accuracy of SOAP3-dp from the chart in the following section).
  • The alignment results can be outputted in BAM format.
  • Allow to specify which GPU card for running SOAP3-dp (useful when there exists more than one GPU card in a machine).
  • Allow to share the index in the memory (CPU side) among all the running instances of SOAP3-dp.
  • The input format for multiple sets of reads have been updated. The user can have a greater flexibility, like each set of reads can have different range of insert size and different location for outputting the alignment results.

The algorithms & software in this package were developed by the algorithms research group of the University of Hong Kong (T.W. Lam, L.K. Lee, C.M. Liu, Ruibang Luo, H.F. Ting, Thomas Wong, Edward Wu, S.M. Yiu & Jianqiao Zhu), in collaboration with BGI (Yingrui Li, Bingqiang Wang & Chang Yu) and Peking University (Ruiqiang Li)

2. Performance & Reference

Below is the performance of SOAP3-dp when running on GTX 580 (3.2 GB RAM) to align 25 million length-100 reads (or read pairs) to the human genome.

  • Human reference genome: 37.1
  • Read data: Sequence read archive (SRA) accession #: SRR211279: volume 25 M (x 2), read length 100, average insert size 300, SD 30
  • Read file format: FASTQ
  • Output option: all best alignment

Performance comparison of SOAP3-dp with SOAP3, BWA and Bowtie2:


  25 M single-end reads   25 M paired-end reads
Overall time % of reads aligned   Overall time % of read pairs aligned % of read pairs properly paired
SOAP3-dp
(default)
233 sec 96.8%   492 sec 97.4% 96.9%
SOAP3
(4 mismatches)
312 sec 92.2%   774 sec 86.6% 86.6%
BWA
(default)
2,694 sec 93.2%   7,101 sec 95.3% 93.6%
Bowtie2
(default)
2,777 sec 96.7%   5,057 sec 96.5% 94.6%

The overall time reported above is comprised of three components, namely, indexing loading, read loading and alignment time. The index loading time of SOAP3 and SOAP3-dp is the same (~134 seconds).


Accuracy comparison of SOAP3-dp with SOAP3, BWA and Bowtie2:


To assess the accuracy of SOAP3-dp, we simulated 5 million pairs of reads of length 100bp from human genome.

  • Simulation program: Mason
  • Insert size: average 350 with standard deviation 50.
  • Two haplotypes. For each haplotye: SNP and indel rate: 0.1%, Size of indel ~ [1,6].
  • Error model: 0.4% mutation; 0.1% insertion; 0.1% deletion.

We calculated the cumulative number of correct and incorrect alignments from high to low mapping quality, and considered an alignment correct if the leftmost position was within 50 nt of the position assigned by the simulator on the same strand. The resulting statistics is as follows:




Contact:

HKU: T.W. Lam, Thomas Wong; BGI: Yingrui Li; Peking U: Ruiqiang Li

3. Copyright

Copyright © 2012, Department of Computer Science, The University of Hong Kong

SOAP3-dp is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

SOAP3-dp is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

Remark: SAM-tools v0.1.18 is included in SOAP3-dp package to facilitate outputting alignment result into SAM output format. We have slightly modified the original code of SAM-tools to make it compilable under g++. Please see http://samtools.sourceforge.net/ for details of this package.