Collaborative Three-Stream Transformers for Video Captioning

by   Hao Wang, et al.

As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods.


page 10

page 11

page 12

page 13


Text with Knowledge Graph Augmented Transformer for Video Captioning

Video captioning aims to describe the content of videos using natural la...

SEM-POS: Grammatically and Semantically Correct Video Captioning

Generating grammatically and semantically correct captions in video capt...

Diverse Video Captioning by Adaptive Spatio-temporal Attention

To generate proper captions for videos, the inference needs to identify ...

Areas of Attention for Image Captioning

We propose "Areas of Attention", a novel attention-based model for autom...

GL-RG: Global-Local Representation Granularity for Video Captioning

Video captioning is a challenging task as it needs to accurately transfo...

Deep Video Restoration for Under-Display Camera

Images or videos captured by the Under-Display Camera (UDC) suffer from ...

MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake Detection

The rapid progress in the ease of creating and spreading ultra-realistic...

Please sign up or login with your details

Forgot password? Click here to reset