Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

03/31/2018
by   Jingwen Wang, et al.
0

Dense video captioning is a newly emerging task that aims at both localizing and describing all events in a video. We identify and tackle two challenges on this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predominantly generate temporal event proposals in the forward direction, which neglects future video context. We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions. Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. We solve this problem by representing each event with an attentive fusion of hidden states from the proposal module and video contents (e.g., C3D features). We further propose a novel context gating mechanism to balance the contributions from the current event and its surrounding contexts dynamically. We empirically show that our attentively fused event representation is superior to the proposal hidden states or video contents alone. By coupling proposal and captioning modules into one unified framework, our model outperforms the state-of-the-arts on the ActivityNet Captions dataset with a relative gain of over 100

READ FULL TEXT

page 1

page 11

page 12

page 13

research
04/03/2018

End-to-End Dense Video Captioning with Masked Transformer

Dense video captioning aims to generate text descriptions for all events...
research
04/08/2019

Streamlined Dense Video Captioning

Dense video captioning is an extremely challenging task since accurate a...
research
06/22/2018

RUC+CMU: System Report for Dense Captioning Events in Videos

This notebook paper presents our system in the ActivityNet Dense Caption...
research
04/23/2018

Jointly Localizing and Describing Events for Dense Video Captioning

Automatically describing a video with natural language is regarded as a ...
research
07/18/2022

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

Dense video captioning aims to generate corresponding text descriptions ...
research
06/25/2018

Best Vision Technologies Submission to ActivityNet Challenge 2018-Task: Dense-Captioning Events in Videos

This note describes the details of our solution to the dense-captioning ...
research
11/28/2016

Bidirectional Multirate Reconstruction for Temporal Modeling in Videos

Despite the recent success of neural networks in image feature learning,...

Please sign up or login with your details

Forgot password? Click here to reset