DocTr: Document Transformer for Structured Information Extraction in Documents

07/16/2023
by   Haofu Liao, et al.
0

We present a new formulation for structured information extraction (SIE) from visually rich documents. It aims to address the limitations of existing IOB tagging or graph-based formulations, which are either overly reliant on the correct ordering of input text or struggle with decoding a complex graph. Instead, motivated by anchor-based object detectors in vision, we represent an entity as an anchor word and a bounding box, and represent entity linking as the association between anchor words. This is more robust to text ordering, and maintains a compact graph for entity linking. The formulation motivates us to introduce 1) a DOCument TRansformer (DocTr) that aims at detecting and associating entity bounding boxes in visually rich documents, and 2) a simple pre-training strategy that helps learn entity detection in the context of language. Evaluations on three SIE benchmarks show the effectiveness of the proposed formulation, and the overall approach outperforms existing solutions.

READ FULL TEXT

page 7

page 8

page 12

page 13

page 14

page 15

page 16

research
06/02/2021

A Span Extraction Approach for Information Extraction on Visually-Rich Documents

Information extraction (IE) from visually-rich documents (VRDs) has achi...
research
08/06/2021

StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Structured text understanding on Visually Rich Documents (VRDs) is a cru...
research
03/27/2019

Graph Convolution for Multimodal Information Extraction from Visually Rich Documents

Visually rich documents (VRDs) are ubiquitous in daily business and life...
research
05/29/2022

Anchor Prediction: A Topic Modeling Approach

Networks of documents connected by hyperlinks, such as Wikipedia, are ub...
research
06/05/2023

ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

Structured text extraction is one of the most valuable and challenging a...
research
05/10/2021

DocReader: Bounding-Box Free Training of a Document Information Extraction Model

Information extraction from documents is a ubiquitous first step in many...
research
09/10/2020

OCR Graph Features for Manipulation Detection in Documents

Detecting manipulations in digital documents is becoming increasingly im...

Please sign up or login with your details

Forgot password? Click here to reset