AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

09/19/2023
by   Yuan Tseng, et al.
0

Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with evaluation code and a model submission platform to encourage further research in audio-visual learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/04/2020

Does Visual Self-Supervision Improve Learning of Speech Representations?

Self-supervised learning has attracted plenty of recent research interes...
research
05/23/2023

Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

Self-supervised learning general-purpose audio representations have demo...
research
12/08/2020

I'm Sorry for Your Loss: Spectrally-Based Audio Distances Are Bad at Pitch

Growing research demonstrates that synthetic failure modes imply poor ge...
research
03/06/2022

HEAR 2021: Holistic Evaluation of Audio Representations

What audio embedding approach generalizes best to a wide range of downst...
research
08/01/2023

AnyLoc: Towards Universal Visual Place Recognition

Visual Place Recognition (VPR) is vital for robot localization. To date,...
research
11/23/2021

Towards Learning Universal Audio Representations

The ability to learn universal audio representations that can solve dive...
research
11/28/2022

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

Current computer vision models, unlike the human visual system, cannot y...

Please sign up or login with your details

Forgot password? Click here to reset