Data Extraction via Semantic Regular Expression Synthesis

05/17/2023
by   Qiaochu Chen, et al.
0

Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.

READ FULL TEXT

page 1

page 15

research
08/16/2019

Sketch-Driven Regular Expression Generation from Natural Language and Examples

Recent systems for converting natural language descriptions into regular...
research
03/05/2021

Syntactic and Semantic-driven Learning for Open Information Extraction

One of the biggest bottlenecks in building accurate, high coverage neura...
research
05/29/2023

Search-Based Regular Expression Inference on a GPU

Regular expression inference (REI) is a supervised machine learning and ...
research
10/26/2018

Synthesizing Symmetric Lenses

Lenses are programs that can be run both "front to back" and "back to fr...
research
12/28/2020

FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

Form validators based on regular expressions are often used on digital f...
research
08/15/2023

The Regular Expression Inference Challenge

We propose regular expression inference (REI) as a challenge for code/la...
research
07/06/2012

Syntactic vs. Semantic Locality: How Good Is a Cheap Approximation?

Extracting a subset of a given OWL ontology that captures all the ontolo...

Please sign up or login with your details

Forgot password? Click here to reset