Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

11/14/2019
by   Ronghang Hu, et al.
22

Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the scene. Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question. However, existing approaches for TextVQA are mostly based on custom pairwise fusion mechanisms between a pair of two modalities and are restricted to a single prediction step by casting TextVQA as a classification task. In this work, we propose a novel model for the TextVQA task based on a multimodal transformer architecture accompanied by a rich representation for text in images. Our model naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter- and intra- modality context. Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification. Our model outperforms existing approaches on three benchmark datasets for the TextVQA task by a large margin.

READ FULL TEXT

page 1

page 4

page 7

page 8

research
10/06/2020

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

Image text carries essential information to understand the scene and per...
research
12/16/2021

KAT: A Knowledge Augmented Transformer for Vision-and-Language

The primary focus of recent work with largescale transformers has been o...
research
06/16/2022

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

Multimodal sentiment analysis in videos is a key task in many real-world...
research
07/23/2020

Spatially Aware Multimodal Transformers for TextVQA

Textual cues are essential for everyday tasks like buying groceries and ...
research
08/03/2023

Multimodal Neurons in Pretrained Text-Only Transformers

Language models demonstrate remarkable capacity to generalize representa...
research
09/13/2019

A Gated Self-attention Memory Network for Answer Selection

Answer selection is an important research problem, with applications in ...
research
12/07/2020

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

When describing an image, reading text in the visual scene is crucial to...

Please sign up or login with your details

Forgot password? Click here to reset