Diverse Video Captioning by Adaptive Spatio-temporal Attention

08/19/2022
by   Zohreh Ghaderi, et al.
6

To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures, an adapted transformer for a single joint spatio-temporal video analysis as well as a self-attention-based decoder for advanced text generation. Furthermore, we introduce an adaptive frame selection scheme to reduce the number of required incoming frames while maintaining the relevant content when training both transformers. Additionally, we estimate semantic concepts relevant for video captioning by aggregating all ground truth captions of each sample. Our approach achieves state-of-the-art results on the MSVD, as well as on the large-scale MSR-VTT and the VATEX benchmark datasets considering multiple Natural Language Generation (NLG) metrics. Additional evaluations on diversity scores highlight the expressiveness and diversity in the structure of our generated captions.

READ FULL TEXT

page 10

page 24

page 26

page 27

page 28

page 29

page 30

page 31

research
01/04/2022

Variational Stacked Local Attention Networks for Diverse Video Captioning

While describing Spatio-temporal events in natural language, video capti...
research
08/04/2021

Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers

Video captioning is an essential technology to understand scenes and des...
research
08/08/2021

Discriminative Latent Semantic Graph for Video Captioning

Video captioning aims to automatically generate natural language sentenc...
research
06/11/2019

Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning

Video captioning aims to automatically generate natural language descrip...
research
05/06/2022

Dual-Level Decoupled Transformer for Video Captioning

Video captioning aims to understand the spatio-temporal semantic concept...
research
09/18/2023

Collaborative Three-Stream Transformers for Video Captioning

As the most critical components in a sentence, subject, predicate and ob...
research
04/04/2020

Multi-Variate Temporal GAN for Large Scale Video Generation

In this paper, we present a network architecture for video generation th...

Please sign up or login with your details

Forgot password? Click here to reset