Papapetrou, Panagiotis and Athitsos, V. and Kollios, G. and Gunopulos, D. (2009) Reference-based alignment in large sequence databases. Proceedings of the VLDB Endowment 2 (1), pp. 205-216. ISSN 2150-8097.
Abstract
This paper introduces a novel method, called Reference-Based String Alignment (RBSA), that speeds up retrieval of optimal subsequence matches in large databases of sequences under the edit distance and the Smith-Waterman similarity measure. RBSA operates using the assumption that the optimal match deviates by a relatively small amount from the query, an amount that does not exceed a prespecified fraction of the query length. RBSA has an exact version that guarantees no false dismissals and can handle large queries efficiently. An approximate version of RBSA is also described, that achieves significant additional improvements over the exact version, with negligible losses in retrieval accuracy. RBSA performs filtering of candidate matches using precomputed alignment scores between the database sequence and a set of fixed-length reference sequences. At query time, the query sequence is partitioned into segments of length equal to that of the reference sequences. For each of those segments, the alignment scores between the segment and the reference sequences are used to efficiently identify a relatively small number of candidate subsequence matches. An alphabet collapsing technique is employed to improve the pruning power of the filter step. In our experimental evaluation, RBSA significantly outperforms state-of-the-art biological sequence alignment methods, such as q-grams, BLAST, and BWT.
Metadata
Item Type: | Article |
---|---|
School: | Birkbeck Faculties and Schools > Faculty of Science > School of Computing and Mathematical Sciences |
Depositing User: | Administrator |
Date Deposited: | 11 Jun 2013 10:47 |
Last Modified: | 09 Aug 2023 12:33 |
URI: | https://eprints.bbk.ac.uk/id/eprint/7440 |
Statistics
Additional statistics are available via IRStats2.