An End-to-End Baseline for Video Captioning

04/04/2019
by   Silvio Olivastri, et al.
0

Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this task using an encoder-decoder strategy. The (video) encoder is traditionally a Convolutional Neural Network (CNN), while the decoding (for language generation) is done using a Recurrent Neural Network (RNN). Current state-of-the-art methods, however, train encoder and decoder separately. CNNs are pretrained on object and/or action recognition tasks and used to encode video-level features. The decoder is then optimised on such static features to generate the video's description. This disjoint setup is arguably sub-optimal for input (video) to output (description) mapping. In this work, we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a two-stage training setting, we first initialise our architecture using pre-trained encoders and decoders -- then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNet-v2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process. We evaluate our End-to-End (EtENet) Networks on the Microsoft Research Video Description (MSVD) and the MSR Video to Text (MSR-VTT) benchmark datasets, showing how EtENet achieves state-of-the-art performance across the board.

READ FULL TEXT
research
03/30/2018

Reconstruction Network for Video Captioning

In this paper, the problem of describing visual contents of a video sequ...
research
11/01/2018

A sequential guiding network with attention for image captioning

The recent advances of deep learning in both computer vision (CV)and nat...
research
06/03/2019

Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning

In this paper, the problem of describing visual contents of a video sequ...
research
06/07/2020

NITS-VC System for VATEX Video Captioning Challenge 2020

Video captioning is process of summarising the content, event and action...
research
09/19/2018

MTLE: A Multitask Learning Encoder of Visual Feature Representations for Video and Movie Description

Learning visual feature representations for video analysis is a daunting...
research
11/10/2020

End-to-end optimized image compression for machines, a study

An increasing share of image and video content is analyzed by machines r...
research
12/09/2017

Deep Koalarization: Image Colorization using CNNs and Inception-ResNet-v2

We review some of the most recent approaches to colorize gray-scale imag...

Please sign up or login with your details

Forgot password? Click here to reset