Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

by   Wenjin Wang, et al.

The pre-training-fine-tuning paradigm based on layout-aware multimodal pre-trained models has achieved significant progress on document image question answering. However, domain pre-training and task fine-tuning for additional visual, layout, and task modules prevent them from directly utilizing off-the-shelf instruction-tuning language foundation models, which have recently shown promising potential in zero-shot learning. Contrary to aligning language models to the domain of document image question answering, we align document image question answering to off-the-shell instruction-tuning language foundation models to utilize their zero-shot capability. Specifically, we propose layout and task aware instruction prompt called LATIN-Prompt, which consists of layout-aware document content and task-aware descriptions. The former recovers the layout information among text segments from OCR tools by appropriate spaces and line breaks. The latter ensures that the model generates answers that meet the requirements, especially format requirements, through a detailed description of task. Experimental results on three benchmarks show that LATIN-Prompt can improve the zero-shot performance of instruction-tuning language foundation models on document image question answering and help them achieve comparable levels to SOTAs based on the pre-training-fine-tuning paradigm. Quantitative analysis and qualitative analysis demonstrate the effectiveness of LATIN-Prompt. We provide the code in supplementary and will release the code to facilitate future research.


page 1

page 2

page 3

page 4


CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment

CLIP has shown a remarkable zero-shot capability on a wide range of visi...

DocPrompt: Large-scale continue pretrain for zero-shot and few-shot document question answering

In this paper, we propose Docprompt for document question answering task...

Information Extraction from Documents: Question Answering vs Token Classification in real-world setups

Research in Document Intelligence and especially in Document Key Informa...

GRASS: Unified Generation Model for Speech-to-Semantic Tasks

This paper explores the instruction fine-tuning technique for speech-to-...

Few-Shot Upsampling for Protest Size Detection

We propose a new task and dataset for a common problem in social science...

ILLUME: Rationalizing Vision-Language Models by Interacting with their Jabber

Bootstrapping from pre-trained language models has been proven to be an ...

Self-Chained Image-Language Model for Video Localization and Question Answering

Recent studies have shown promising results on utilizing pre-trained ima...

Please sign up or login with your details

Forgot password? Click here to reset