Integrating both Visual and Audio Cues for Enhanced Video Caption

by   Wangli Hao, et al.

Video caption refers to generating a descriptive sentence for a specific short video clip automatically, which has achieved remarkable success recently. However, most of the existing methods focus more on visual information while ignoring the synchronized audio cues. We propose three multimodal deep fusion strategies to maximize the benefits of visual-audio resonance information. The first one explores the impact on cross-modalities feature fusion from low to high order. The second establishes the visual-audio short-term dependency by sharing weights of corresponding front-end networks. The third extends the temporal dependency to long-term through sharing multimodal memory across visual and audio modalities. Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD). It is worth mentioning that sharing weight can coordinate visual-audio feature fusion effectively and achieve the state-of-art performance on both BELU and METEOR metrics. Furthermore, we first propose a dynamic multimodal feature fusion framework to deal with the part modalities missing case. Experimental results demonstrate that even in the audio absence mode, we can still obtain comparable results with the aid of the additional audio modality inference module.


page 6

page 7


An Attempt towards Interpretable Audio-Visual Video Captioning

Automatically generating a natural language sentence to describe the con...

On the Benefits of Early Fusion in Multimodal Representation Learning

Intelligently reasoning about the world often requires integrating data ...

EGOFALLS: A visual-audio dataset and benchmark for fall detection using egocentric cameras

Falls are significant and often fatal for vulnerable populations such as...

The Influence of Audio on Video Memorability with an Audio Gestalt Regulated Video Memorability System

Memories are the tethering threads that tie us to the world, and memorab...

Skating-Mixer: Multimodal MLP for Scoring Figure Skating

Figure skating scoring is a challenging task because it requires judging...

Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing

Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize ...

Denoising Bottleneck with Mutual Information Maximization for Video Multimodal Fusion

Video multimodal fusion aims to integrate multimodal signals in videos, ...

Please sign up or login with your details

Forgot password? Click here to reset