Taking an Emotional Look at Video Paragraph Captioning

03/12/2022
by   Qinyu Li, et al.
0

Translating visual data into natural language is essential for machines to understand the world and interact with humans. In this work, a comprehensive study is conducted on video paragraph captioning, with the goal to generate paragraph-level descriptions for a given video. However, current researches mainly focus on detecting objective facts, ignoring the needs to establish the logical associations between sentences and to discover more accurate emotions related to video contents. Such a problem impairs fluent and abundant expressions of predicted captions, which are far below human language tandards. To solve this problem, we propose to construct a large-scale emotion and logic driven multilingual dataset for this task. This dataset is named EMVPC (standing for "Emotional Video Paragraph Captioning") and contains 53 widely-used emotions in daily life, 376 common scenes corresponding to these emotions, 10,291 high-quality videos and 20,582 elaborated paragraph captions with English and Chinese versions. Relevant emotion categories, scene labels, emotion word labels and logic word labels are also provided in this new dataset. The proposed EMVPC dataset intends to provide full-fledged video paragraph captioning in terms of rich emotions, coherent logic and elaborate expressions, which can also benefit other tasks in vision-language fields. Furthermore, a comprehensive study is conducted through experiments on existing benchmark video paragraph captioning datasets and the proposed EMVPC. The stateof-the-art schemes from different visual captioning tasks are compared in terms of 15 popular metrics, and their detailed objective as well as subjective results are summarized. Finally, remaining problems and future directions of video paragraph captioning are also discussed. The unique perspective of this work is expected to boost further development in video paragraph captioning research.

READ FULL TEXT

page 3

page 8

research
04/06/2019

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

We present a new large-scale multilingual video description dataset, VAT...
research
01/19/2021

ArtEmis: Affective Language for Visual Art

We present a novel large-scale dataset and accompanying machine learning...
research
02/12/2021

Annotation Cleaning for the MSR-Video to Text Dataset

The video captioning task is to describe the video contents with natural...
research
04/15/2022

It is Okay to Not Be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection

Datasets that capture the connection between vision, language, and affec...
research
05/02/2020

Are Emojis Emotional? A Study to Understand the Association between Emojis and Emotions

Given the growing ubiquity of emojis in language, there is a need for me...
research
09/05/2023

A method for Selecting Scenes and Emotion-based Descriptions for a Robot's Diary

In this study, we examined scene selection methods and emotion-based des...
research
07/26/2018

Move Forward and Tell: A Progressive Generator of Video Descriptions

We present an efficient framework that can generate a coherent paragraph...

Please sign up or login with your details

Forgot password? Click here to reset