Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

by   Tao Tu, et al.

GuessWhat?! is a two-player visual dialog guessing game where player A asks a sequence of yes/no questions (Questioner) and makes a final guess (Guesser) about a target object in an image, based on answers from player B (Oracle). Based on this dialog history between the Questioner and the Oracle, a Guesser makes a final guess of the target object. Previous baseline Oracle model encodes no visual information in the model, and it cannot fully understand complex questions about color, shape, relationships and so on. Most existing work for Guesser encode the dialog history as a whole and train the Guesser models from scratch on the GuessWhat?! dataset. This is problematic since language encoder tend to forget long-term history and the GuessWhat?! data is sparse in terms of learning visual grounding of objects. Previous work for Questioner introduces state tracking mechanism into the model, but it is learned as a soft intermediates without any prior vision-linguistic insights. To bridge these gaps, in this paper we propose Vilbert-based Oracle, Guesser and Questioner, which are all built on top of pretrained vision-linguistic model, Vilbert. We introduce two-way background/target fusion mechanism into Vilbert-Oracle to account for both intra and inter-object questions. We propose a unified framework for Vilbert-Guesser and Vilbert-Questioner, where state-estimator is introduced to best utilize Vilbert's power on single-turn referring expression comprehension. Experimental results show that our proposed models outperform state-of-the-art models significantly by 7 Oracle, Guesser and End-to-End Questioner respectively.


page 12

page 13

page 14

page 15

page 16

page 17

page 18


Modeling Coreference Relations in Visual Dialog

Visual dialog is a vision-language task where an agent needs to answer a...

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Visual dialog is a challenging vision-language task, where a dialog agen...

History for Visual Dialog: Do we really need it?

Visual Dialog involves "understanding" the dialog history (what has been...

Learning to Ground Visual Objects for Visual Dialog

Visual dialog is challenging since it needs to answer a series of cohere...

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

In an open-world setting, it is inevitable that an intelligent agent (e....

Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Stickers with vivid and engaging expressions are becoming increasingly p...

Guessing State Tracking for Visual Dialogue

The Guesser plays an important role in GuessWhat?! like visual dialogues...

Please sign up or login with your details

Forgot password? Click here to reset