Constant delay algorithms for regular document spanners

03/14/2018
by   Fernando Florenzano, et al.
0

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner's output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by general variable set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner, providing a fine grained analysis of the classes of document spanners that support efficient enumeration of their results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2020

Grammars for Document Spanenrs

A new grammar-based language for defining information-extractors from te...
research
04/12/2023

Skyline Operators for Document Spanners

When extracting a relation of spans (intervals) from a text document, a ...
research
03/15/2020

Grammars for Document Spanners

A new grammar-based language for defining information-extractors from te...
research
10/12/2020

Constant-delay enumeration algorithms for document spanners over nested documents

Some of the most relevant document schemas used online, such as XML and ...
research
08/30/2019

Annotated Document Spanners

We introduce annotated document spanners, which are document spanners th...
research
09/15/2017

Foundations of Complex Event Processing

Complex Event Processing (CEP) has emerged as the unifying field for tec...
research
08/30/2019

Weight Annotation in Information Extraction

The framework of document spanners abstracts the task of information ext...

Please sign up or login with your details

Forgot password? Click here to reset