Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

06/21/2021
by   Yuanbo Hou, et al.
0

Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, and the representations are fused by attention based on semantic similarity to shape the acoustic representations through the probability of anchor vocalization. Experiments show the proposed audio-visual multi-branch network far outperforms the audio-only model in challenging acoustic environments, indicating the cross-modal information fusion based on semantic correlation is sensible and successful.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2020

Rule-embedded network for audio-visual voice activity detection in live musical video streams

Detecting anchor's voice in live musical streams is an important preproc...
research
03/05/2022

Audio-visual speech separation based on joint feature representation with cross-modal attention

Multi-modal based speech separation has exhibited a specific advantage o...
research
12/02/2020

Cross-Modal Terrains: Navigating Sonic Space through Haptic Feedback

This paper explores the idea of using virtual textural terrains as a mea...
research
04/19/2022

Audio-Visual Wake Word Spotting System For MISP Challenge 2021

This paper presents the details of our system designed for the Task 1 of...
research
07/13/2022

MM-ALT: A Multimodal Automatic Lyric Transcription System

Automatic lyric transcription (ALT) is a nascent field of study attracti...
research
03/26/2021

Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis

Measuring the acoustic characteristics of a space is often done by captu...
research
06/29/2017

Vision-based Detection of Acoustic Timed Events: a Case Study on Clarinet Note Onsets

Acoustic events often have a visual counterpart. Knowledge of visual inf...

Please sign up or login with your details

Forgot password? Click here to reset