Thinking Hallucination for Video Captioning

09/28/2022
by   Nasib Ullah, et al.
0

With the advent of rich visual representations and pre-trained language models, video captioning has seen continuous improvement over time. Despite the performance improvement, video captioning models are prone to hallucination. Hallucination refers to the generation of highly pathological descriptions that are detached from the source material. In video captioning, there are two kinds of hallucination: object and action hallucination. Instead of endeavoring to learn better representations of a video, in this work, we investigate the fundamental sources of the hallucination problem. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. To alleviate these problems, we propose two robust solutions: (a) the introduction of auxiliary heads trained in multi-label settings on top of the extracted visual features and (b) the addition of context gates, which dynamically select the features during fusion. The standard evaluation metrics for video captioning measures similarity with ground truth captions and do not adequately capture object and action relevance. To this end, we propose a new metric, COAHA (caption object and action hallucination assessment), which assesses the degree of hallucination. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets, especially by a massive margin in CIDEr score.

READ FULL TEXT

page 2

page 14

research
08/12/2022

An investigation on selecting audio pre-trained models for audio captioning

Audio captioning is a task that generates description of audio based on ...
research
09/06/2018

Object Hallucination in Image Captioning

Despite continuously improving performance, contemporary image captionin...
research
05/09/2022

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Significant progress has been made on visual captioning, largely relying...
research
03/08/2020

Object-Oriented Video Captioning with Temporal Graph and Prior Knowledge Building

Traditional video captioning requests a holistic description of the vide...
research
12/01/2016

Video Captioning with Multi-Faceted Attention

Recently, video captioning has been attracting an increasing amount of i...
research
06/01/2022

CLIP4IDC: CLIP for Image Difference Captioning

Image Difference Captioning (IDC) aims at generating sentences to descri...

Please sign up or login with your details

Forgot password? Click here to reset