Dense Video Captioning Using Unsupervised Semantic Information

12/15/2021
by   Valter Estevam, et al.
0

We introduce a method to learn unsupervised semantic visual information based on the premise that complex events (e.g., minutes) can be decomposed into simpler events (e.g., a few seconds), and that these simple events are shared across several complex events. We split a long video into short frame sequences to extract their latent representation with three-dimensional convolutional neural networks. A clustering method is used to group representations producing a visual codebook (i.e., a long video is represented by a sequence of integers given by the cluster labels). A dense representation is learned by encoding the co-occurrence probability matrix for the codebook entries. We demonstrate how this representation can leverage the performance of the dense video captioning task in a scenario with only visual features. As a result of this approach, we are able to replace the audio signal in the Bi-Modal Transformer (BMT) method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual signal with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in captioning compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.

READ FULL TEXT
research
05/17/2020

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Dense video captioning aims to localize and describe important events in...
research
07/03/2023

AVSegFormer: Audio-Visual Segmentation with Transformer

The combination of audio and vision has long been a topic of interest in...
research
01/14/2021

Exploration of Visual Features and their weighted-additive fusion for Video Captioning

Video captioning is a popular task that challenges models to describe ev...
research
10/28/2022

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Audio captioning is the task of generating captions that describe the co...
research
03/17/2020

Multi-modal Dense Video Captioning

Dense video captioning is a task of localizing interesting events from a...
research
04/07/2023

Graph Attention for Automated Audio Captioning

State-of-the-art audio captioning methods typically use the encoder-deco...
research
12/20/2022

METEOR Guided Divergence for Video Captioning

Automatic video captioning aims for a holistic visual scene understandin...

Please sign up or login with your details

Forgot password? Click here to reset