Semantic Grouping Network for Video Captioning

02/01/2021
by   Hobin Ryu, et al.
0

This paper considers a video caption generating network referred to as Semantic Grouping Network (SGN) that attempts (1) to group video frames with discriminating word phrases of partially decoded caption and then (2) to decode those semantically aligned groups in predicting the next word. As consecutive frames are not likely to provide unique information, prior methods have focused on discarding or merging repetitive information based only on the input video. The SGN learns an algorithm to capture the most discriminating word phrases of the partially decoded caption and a mapping that associates each phrase to the relevant video frames - establishing this mapping allows semantically related frames to be clustered, which reduces redundancy. In contrast to the prior methods, the continuous feedback from decoded words enables the SGN to dynamically update the video representation that adapts to the partially decoded caption. Furthermore, a contrastive attention loss is proposed to facilitate accurate alignment between a word phrase and video frames without manual annotations. The SGN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1 score on MSVD and MSR-VTT datasets, respectively. Extensive experiments demonstrate the effectiveness and interpretability of the SGN.

READ FULL TEXT

page 3

page 7

research
06/18/2015

"The Sum of Its Parts": Joint Learning of Word and Phrase Representations with Autoencoders

Recently, there has been a lot of effort to represent words in continuou...
research
04/07/2022

HunYuan_tvr for Text-Video Retrivial

Text-Video Retrieval plays an important role in multi-modal understandin...
research
10/06/2022

Video Referring Expression Comprehension via Transformer with Content-aware Query

Video Referring Expression Comprehension (REC) aims to localize a target...
research
08/05/2021

O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

Video captioning combines video understanding and language generation. D...
research
04/29/2016

Distance Metric Learning for Aspect Phrase Grouping

Aspect phrase grouping is an important task in aspect-level sentiment an...
research
06/13/2019

Contrastive Bidirectional Transformer for Temporal Representation Learning

This paper aims at learning representations for long sequences of contin...

Please sign up or login with your details

Forgot password? Click here to reset