‘Stampy’ algorithm software
A statistical algorithm for sensitive and fast mapping of Illumina sequence reads.
Existing read mapping software
High-volume sequencing of DNA and RNA is now within reach of any research lab and is quickly becoming established as a key research tool.
In many workflows, each of the short sequences (reads) resulting from a sequencing run are first mapped (aligned) to a reference sequence, to infer the genomic location that the read derived from. This is a challenging task because of the high data volumes and often large genomes.
Existing read mapping software either excels in speed (eg BWA2, Bowtie3, Eland4) or sensitivity (eg Novoalign5), but not in both. In addition, performance often deteriorates in the presence of sequence variation, particularly for short insertions and deletions (indels).
Speed and sensitivity of new software
Oxford researchers have developed a read mapper, Stampy, which uses a hybrid mapping algorithm and a detailed statistical model to achieve both speed and sensitivity, particularly when reads include sequence variation. The result is a higher useable sequence yield and improved accuracy compared to existing software.
With ever increasing throughput of sequencing machines, time and memory efficient algorithms need no justification. However, sequence error rates are low, so why is sensitivity important?
One answer is that reduced sensitivity in the presence of variation leads to unwanted mapping biases, particularly for reads from regions of higher divergence and for reads containing indels.
Similarly, improved sensitivity may enable analyses that are otherwise impossible. For example, to analyse samples that are divergent from available reference genomes, or to help identify unknown splice donor and acceptor sites in mRNA-seq experiments.
Finally, in any experiment a fraction of reads will exhibit elevated error rates. Being able to reliably include data from these reads improves the power of downstream analyses and reduces the total cost of sequencing.
Hashing
To achieve good sensitivity, Stampy uses a hash table, representing the location of selected 15-mers in the reference genome. The hash table uses a novel data structure, which results in improved search times compared to standard implementations and efficient use of the available memory.
about this technology