Audio-Visual Grouping Network for Sound Localization from Mixtures

by   Shentong Mo, et al.

Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each image. Due to the mixed property of multiple sound sources in the original space, there exist rare multi-source approaches to localizing multiple sources simultaneously, except for one recent work using a contrastive random walk in the graph with images and separated sound as nodes. Despite their promising performance, they can only handle a fixed number of sources, and they cannot learn compact class-aware representations for individual sources. To alleviate this shortcoming, in this paper, we propose a novel audio-visual grouping network, namely AVGN, that can directly learn category-wise semantic features for each source from the input audio mixture and image to localize multiple sources simultaneously. Specifically, our AVGN leverages learnable audio-visual class tokens to aggregate class-aware source features. Then, the aggregated semantic features for each source can be used as guidance to localize the corresponding visual regions. Compared to existing multi-source methods, our new framework can localize a flexible number of sources and disentangle category-aware audio-visual representations for individual sound sources. We conduct extensive experiments on MUSIC, VGGSound-Instruments, and VGG-Sound Sources benchmarks. The results demonstrate that the proposed AVGN can achieve state-of-the-art sounding object localization performance on both single-source and multi-source scenarios. Code is available at <>.


page 3

page 7

page 12


Class-Incremental Grouping Network for Continual Audio-Visual Learning

Continual learning is a challenging problem in which models need to be t...

Mix and Localize: Localizing Sound Sources in Mixtures

We present a method for simultaneously localizing multiple sound sources...

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Audiovisual scenes are pervasive in our daily life. It is commonplace fo...

Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer

Never having seen an object and heard its sound simultaneously, can the ...

Sound Localization and Separation in Three-dimensional Space Using a Single Microphone with a Metamaterial Enclosure

Conventional approaches to sound localization and separation are based o...

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

Self-supervised sound source localization is usually challenged by the m...

Multi-goal Audio-visual Navigation using Sound Direction Map

Over the past few years, there has been a great deal of research on navi...

Please sign up or login with your details

Forgot password? Click here to reset