StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

03/01/2023
by   Yuechen Yu, et al.
0

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2023

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Visually-rich Document Understanding (VrDU) has attracted much research ...
research
05/16/2023

Sequence-to-Sequence Pre-training with Unified Modality Masking for Visual Document Understanding

This paper presents GenDoc, a general sequence-to-sequence document unde...
research
11/27/2022

MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding

Document images are a ubiquitous source of data where the text is organi...
research
09/03/2023

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

We propose a novel end-to-end document understanding model called SeRum ...
research
03/25/2022

Multimodal Pre-training Based on Graph Attention Network for Document Understanding

Document intelligence as a relatively new research topic supports many b...
research
03/09/2022

Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

In this work, we propose Text-Degradation Invariant Auto Encoder (Text-D...
research
10/11/2021

Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

In the genome biology research, regulatory genome modeling is an importa...

Please sign up or login with your details

Forgot password? Click here to reset