The Complexity of the Co-Occurrence Problem

by   Philip Bille, et al.

Let S be a string of length n over an alphabet Σ and let Q be a subset of Σ of size q ≥ 2. The 'co-occurrence problem' is to construct a compact data structure that supports the following query: given an integer w return the number of length-w substrings of S that contain each character of Q at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an O(√(nq)) space data structure that x2014 with some minor additions x2014 supports queries in O(loglog n) time [CPM 2021]. Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter d, giving a simple data structure that uses O(d) space and supports queries in O(loglog n) time. The preprocessing algorithm does a single pass over S, runs in expected O(n) time, and uses O(d) space in addition to the input. Furthermore, we show that O(d) space is optimal and that O(loglog n)-time queries are optimal given optimal space. Secondly, we bound d = O(√(nq)), giving clean bounds in terms of n and q that match the state of the art. Furthermore, we prove that Ω(√(nq)) bits of space is necessary in the worst case, meaning that the O(√(nq)) upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.


page 1

page 2

page 3

page 4


Run Compressed Rank/Select for Large Alphabets

Given a string of length n that is composed of r runs of letters from th...

Shortest Unique Palindromic Substring Queries in Semi-dynamic Settings

A palindromic substring T[i.. j] of a string T is said to be a shortest ...

Optimal Heaviest Induced Ancestors

We revisit the Heaviest Induced Ancestors (HIA) problem that was introdu...

GaKCo: a Fast GApped k-mer string Kernel using COunting

String Kernel (SK) techniques, especially those using gapped k-mers as f...

Space-Efficient Data Structures for Lattices

A lattice is a partially-ordered set in which every pair of elements has...

Acceleration of FM-index Queries Through Prefix-free Parsing

FM-indexes are a crucial data structure in DNA alignment, for example, b...

Simulating the DNA String Graph in Succinct Space

Converting a set of sequencing reads into a lossless compact data struct...

Please sign up or login with your details

Forgot password? Click here to reset