The Complexity of the Co-Occurrence Problem
Let S be a string of length n over an alphabet Σ and let Q be a subset of Σ of size q ≥ 2. The 'co-occurrence problem' is to construct a compact data structure that supports the following query: given an integer w return the number of length-w substrings of S that contain each character of Q at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an O(√(nq)) space data structure that x2014 with some minor additions x2014 supports queries in O(loglog n) time [CPM 2021]. Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter d, giving a simple data structure that uses O(d) space and supports queries in O(loglog n) time. The preprocessing algorithm does a single pass over S, runs in expected O(n) time, and uses O(d) space in addition to the input. Furthermore, we show that O(d) space is optimal and that O(loglog n)-time queries are optimal given optimal space. Secondly, we bound d = O(√(nq)), giving clean bounds in terms of n and q that match the state of the art. Furthermore, we prove that Ω(√(nq)) bits of space is necessary in the worst case, meaning that the O(√(nq)) upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.
READ FULL TEXT