The Complexity of the Co-Occurrence Problem

06/21/2022
by   Philip Bille, et al.
0

Let S be a string of length n over an alphabet Σ and let Q be a subset of Σ of size q ≥ 2. The 'co-occurrence problem' is to construct a compact data structure that supports the following query: given an integer w return the number of length-w substrings of S that contain each character of Q at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an O(√(nq)) space data structure that x2014 with some minor additions x2014 supports queries in O(loglog n) time [CPM 2021]. Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter d, giving a simple data structure that uses O(d) space and supports queries in O(loglog n) time. The preprocessing algorithm does a single pass over S, runs in expected O(n) time, and uses O(d) space in addition to the input. Furthermore, we show that O(d) space is optimal and that O(loglog n)-time queries are optimal given optimal space. Secondly, we bound d = O(√(nq)), giving clean bounds in terms of n and q that match the state of the art. Furthermore, we prove that Ω(√(nq)) bits of space is necessary in the worst case, meaning that the O(√(nq)) upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/08/2017

Run Compressed Rank/Select for Large Alphabets

Given a string of length n that is composed of r runs of letters from th...
research
04/15/2022

Shortest Unique Palindromic Substring Queries in Semi-dynamic Settings

A palindromic substring T[i.. j] of a string T is said to be a shortest ...
research
02/02/2023

Optimal Heaviest Induced Ancestors

We revisit the Heaviest Induced Ancestors (HIA) problem that was introdu...
research
04/24/2017

GaKCo: a Fast GApped k-mer string Kernel using COunting

String Kernel (SK) techniques, especially those using gapped k-mers as f...
research
02/13/2019

Space-Efficient Data Structures for Lattices

A lattice is a partially-ordered set in which every pair of elements has...
research
05/10/2023

Acceleration of FM-index Queries Through Prefix-free Parsing

FM-indexes are a crucial data structure in DNA alignment, for example, b...
research
01/29/2019

Simulating the DNA String Graph in Succinct Space

Converting a set of sequencing reads into a lossless compact data struct...

Please sign up or login with your details

Forgot password? Click here to reset