GeoLayoutLM: Geometric Pre-training for Visual Information Extraction

by   Chuwei Luo, et al.

Visual information extraction (VIE) plays an important role in Document Intelligence. Generally, it is divided into two tasks: semantic entity recognition (SER) and relation extraction (RE). Recently, pre-trained models for documents have achieved substantial progress in VIE, particularly in SER. However, most of the existing models learn the geometric representation in an implicit way, which has been found insufficient for the RE task since geometric information is especially crucial for RE. Moreover, we reveal another factor that limits the performance of RE lies in the objective gap between the pre-training phase and the fine-tuning phase for RE. To tackle these issues, we propose in this paper a multi-modal framework, named GeoLayoutLM, for VIE. GeoLayoutLM explicitly models the geometric relations in pre-training, which we call geometric pre-training. Geometric pre-training is achieved by three specially designed geometry-related pre-training tasks. Additionally, novel relation heads, which are pre-trained by the geometric pre-training tasks and fine-tuned for RE, are elaborately designed to enrich and enhance the feature representation. According to extensive experiments on standard VIE benchmarks, GeoLayoutLM achieves highly competitive scores in the SER task and significantly outperforms the previous state-of-the-arts for RE (, the F1 score of RE on FUNSD is boosted from 80.35% to 89.45%). The code and models are publicly available at


page 11

page 12


Learning Rich Representation of Keyphrases from Text

In this work, we explore how to learn task-specific language models aime...

Vision Grid Transformer for Document Layout Analysis

Document pre-trained models and grid-based models have proven to be very...

Integrally Pre-Trained Transformer Pyramid Networks

In this paper, we present an integral pre-training framework based on ma...

An Empirical Investigation Towards Efficient Multi-Domain Language Model Pre-training

Pre-training large language models has become a standard in the natural ...

Can Fine-tuning Pre-trained Models Lead to Perfect NLP? A Study of the Generalizability of Relation Extraction

Fine-tuning pre-trained models have achieved impressive performance on s...

XDoc: Unified Pre-training for Cross-Format Document Understanding

The surge of pre-training has witnessed the rapid development of documen...

RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension

In this work, we investigate extending the comprehension of Multi-modal ...

Please sign up or login with your details

Forgot password? Click here to reset