VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations

by   Peng Zhang, et al.

Document layout analysis is crucial for understanding document structures. On this task, vision and semantics of documents, and relations between layout components contribute to the understanding process. Though many works have been proposed to exploit the above information, they show unsatisfactory results. NLP-based methods model layout analysis as a sequence labeling task and show insufficient capabilities in layout modeling. CV-based methods model layout analysis as a detection or segmentation task, but bear limitations of inefficient modality fusion and lack of relation modeling between layout components. To address the above limitations, we propose a unified framework VSR for document layout analysis, combining vision, semantics and relations. VSR supports both NLP-based and CV-based methods. Specifically, we first introduce vision through document image and semantics through text embedding maps. Then, modality-specific visual and semantic features are extracted using a two-stream network, which are adaptively fused to make full use of complementary information. Finally, given component candidates, a relation module based on graph neural network is incorported to model relations between components and output final results. On three popular benchmarks, VSR outperforms previous models by large margins. Code will be released soon.


page 1

page 2

page 3

page 4


Document Layout Analysis with Aesthetic-Guided Image Augmentation

Document layout analysis (DLA) plays an important role in information ex...

Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis

Recognizing the layout of unstructured digital documents is crucial when...

Neural Graph Matching for Modification Similarity Applied to Electronic Document Comparison

In this paper, we present a novel neural graph matching approach applied...

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

Documents often contain complex physical structures, which make the Docu...

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding

For understanding generic documents, information like font sizes, column...

LAMBERT: Layout-Aware language Modeling using BERT for information extraction

In this paper we introduce a novel approach to the problem of understand...

MatchVIE: Exploiting Match Relevancy between Entities for Visual Information Extraction

Visual Information Extraction (VIE) task aims to extract key information...

Please sign up or login with your details

Forgot password? Click here to reset