Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations

10/07/2020
by   Wanrong Zhu, et al.
0

A major challenge in visually grounded language generation is to build robust benchmark datasets and models that can generalize well in real-world settings. To do this, it is critical to ensure that our evaluation protocols are correct, and benchmarks are reliable. In this work, we set forth to design a set of experiments to understand an important but often ignored problem in visually grounded language generation: given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance? Empirically, we study several multi-reference datasets and corresponding vision-and-language tasks. We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task; that metric-wise, CIDEr has shown systematically larger variances than others. Our evaluations on reference-per-instance shed light on the design of reliable datasets in the future.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2021

VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

Neural module networks (NMN) have achieved success in image-grounded tas...
research
01/14/2021

Enabling Robots to Draw and Tell: Towards Visually Grounded Multimodal Description Generation

Socially competent robots should be equipped with the ability to perceiv...
research
10/07/2020

A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions

Recent models achieve promising results in visually grounded dialogues. ...
research
09/07/2023

DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners

State-of-the-art visual grounding models can achieve high detection accu...
research
10/14/2020

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Humans learn language by listening, speaking, writing, reading, and also...
research
09/22/2021

COVR: A test-bed for Visually Grounded Compositional Generalization with real images

While interest in models that generalize at test time to new composition...
research
06/30/2011

Grounded Semantic Composition for Visual Scenes

We present a visually-grounded language understanding model based on a s...

Please sign up or login with your details

Forgot password? Click here to reset