Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning

06/03/2019
by   Wei Zhang, et al.
0

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) in a novel encoder-decoder-reconstructor architecture, which leverages both forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder component makes use of the forward flow to produce a sentence description based on the encoded video semantic features. Two types of reconstructors are subsequently proposed to employ the backward flow and reproduce the video features from local and global perspectives, respectively, capitalizing on the hidden state sequence generated by the decoder. Moreover, in order to make a comprehensive reconstruction of the video features, we propose to fuse the two types of reconstructors together. The generation loss yielded by the encoder-decoder component and the reconstruction loss introduced by the reconstructor are jointly cast into training the proposed RecNet in an end-to-end fashion. Furthermore, the RecNet is fine-tuned by CIDEr optimization via reinforcement learning, which significantly boosts the captioning performance. Experimental results on benchmark datasets demonstrate that the proposed reconstructor can boost the performance of video captioning consistently.

READ FULL TEXT

page 1

page 12

research
03/30/2018

Reconstruction Network for Video Captioning

In this paper, the problem of describing visual contents of a video sequ...
research
12/20/2020

Guidance Module Network for Video Captioning

Video captioning has been a challenging and significant task that descri...
research
04/04/2019

An End-to-End Baseline for Video Captioning

Building correspondences across different modalities, such as video and ...
research
01/16/2020

Delving Deeper into the Decoder for Video Captioning

Video captioning is an advanced multi-modal task which aims to describe ...
research
05/22/2022

GL-RG: Global-Local Representation Granularity for Video Captioning

Video captioning is a challenging task as it needs to accurately transfo...
research
01/04/2022

Variational Stacked Local Attention Networks for Diverse Video Captioning

While describing Spatio-temporal events in natural language, video capti...
research
03/05/2018

Less Is More: Picking Informative Frames for Video Captioning

In video captioning task, the best practice has been achieved by attenti...

Please sign up or login with your details

Forgot password? Click here to reset