Bathie, G. and Charalampopoulos, Panagiotis and Starikovskaya, T. (2024) Longest common extensions with wildcards: trade-off and applications. Leibniz International Proceedings in Informatics (LIPIcs) 308 , 10:1-19:17. ISSN 1868-8969.
|
Text
esa_final.pdf - Published Version of Record Available under License Creative Commons Attribution. Download (933kB) | Preview |
Abstract
We study the Longest Common Extension (LCE) problem in a string containing wildcards. Wildcards (also called "don't cares" or "holes") are special characters that match any other character in the alphabet, similar to the character "?" in Unix commands or "." in regular expression engines. We consider the problem parametrized by G, the number of maximal contiguous groups of wildcards in the input string. Our main contribution is a simple data structure for this problem that can be built in O(n (G/t) log n) time, occupies O(nG/t) space, and answers queries in O(t) time, for any t ∈ [1 .. G]. Up to the O(log n) factor, this interpolates smoothly between the data structure of Crochemore et al. [JDA 2015], which has O(nG) preprocessing time and space, and O(1) query time, and a simple solution based on the "kangaroo jumping" technique [Landau and Vishkin, STOC 1986], which has O(n) preprocessing time and space, and O(G) query time. By establishing a connection between this problem and Boolean matrix multiplication, we show that our solution is optimal up to subpolynomial factors when G = Ω(n) under a widely believed hypothesis. In addition, we develop a new simple, deterministic and combinatorial algorithm for sparse Boolean matrix multiplication. Finally, we show that our data structure can be used to obtain efficient algorithms for approximate pattern matching and structural analysis of strings with wildcards. First, we consider the problem of pattern matching with k errors (i.e., edit operations) in the setting where both the pattern and the text may contain wildcards. The "kangaroo jumping" technique can be adapted to yield an algorithm for this problem with runtime O(n(k+G)), where G is the total number of maximal contiguous groups of wildcards in the text and the pattern and n is the length of the text. By combining "kangaroo jumping" with a tailor-made data structure for LCE queries, Akutsu [IPL 1995] devised an O(n√{km} polylog m)-time algorithm. We improve on both algorithms when k ≪ G ≪ m by giving an algorithm with runtime O(n(k + √{Gk log n})). Secondly, we give O(n√G log n)-time and O(n)-space algorithms for computing the prefix array, as well as the quantum/deterministic border and period arrays of a string with wildcards. This is an improvement over the O(n√{nlog n})-time algorithms of Iliopoulos and Radoszewski [CPM 2016] when G = O(n / log n).
Metadata
Item Type: | Article |
---|---|
Additional Information: | 32nd Annual European Symposium on Algorithms (ESA 2024) |
School: | Birkbeck Faculties and Schools > Faculty of Science > School of Computing and Mathematical Sciences |
Depositing User: | Panagiotis Charalampopoulos |
Date Deposited: | 17 Oct 2024 13:03 |
Last Modified: | 17 Oct 2024 15:59 |
URI: | https://eprints.bbk.ac.uk/id/eprint/54397 |
Statistics
Additional statistics are available via IRStats2.