Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

by   Yuanyuan Jiang, et al.

Audio-visual event localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each video segment separated from the full video (which can be regarded as the segment-level representations of events). However, they ignore the semantic consistency of the event within the same full video (which can be considered as the video-level representations of events). In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE task. Specifically, we propose an event semantic consistency modeling (ESCM) module to explore the video-level semantic consistency of events. It consists of two components: cross-modal event representation extractor (CERE) and intra-modal semantic consistency enhancer (ISCE). CERE is proposed to obtain the event semantic representation at the video level including, audio and visual modules. Furthermore, ISCE takes the video-level event semantic representation as the prior knowledge to guide the model to focus on the semantic continuity of the event within each modality. Moreover, we propose a new negative pair filter loss to encourage the network to filter out the irrelevant segment pairs and a new smooth loss to further increase the gap between different categories of events under the weakly-supervised setting. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully and weakly supervised settings, thus verifying the effectiveness of our method.


page 1

page 3

page 8


Multi-Modulation Network for Audio-Visual Event Localization

We study the problem of localizing audio-visual events that are both aud...

Investigating Modality Bias in Audio Visual Video Parsing

We focus on the audio-visual video parsing (AVVP) problem that involves ...

Positive Sample Propagation along the Audio-Visual Event Line

Visual and audio signals often coexist in natural environments, forming ...

Decompose the Sounds and Pixels, Recompose the Events

In this paper, we propose a framework centering around a novel architect...

Dual-modality seq2seq network for audio-visual event localization

Audio-visual event localization requires one to identify theevent which ...

Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

This paper proposes an approach to Dense Video Captioning (DVC) without ...

Contrastive Positive Sample Propagation along the Audio-Visual Event Line

Visual and audio signals often coexist in natural environments, forming ...

Please sign up or login with your details

Forgot password? Click here to reset