Audio Retrieval with Natural Language Queries: A Benchmark Study

12/17/2021
by   A. Sophia Koepke, et al.
0

The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries. Code, audio features for all datasets used, and the SoundDescs dataset are publicly available at https://github.com/akoepke/audio-retrieval-benchmark.

READ FULL TEXT

page 1

page 3

page 8

research
05/05/2021

Audio Retrieval with Natural Language Queries

We consider the task of retrieving audio using free-form natural languag...
research
09/28/2022

Audio Retrieval with WavText5K and CLAP Training

Audio-Text retrieval takes a natural language query to retrieve relevant...
research
03/19/2023

Audio-Text Models Do Not Yet Leverage Natural Language

Multi-modal contrastive learning techniques in the audio-text domain hav...
research
03/25/2022

Audio-text Retrieval in Context

Audio-text retrieval based on natural language descriptions is a challen...
research
11/22/2020

QuerYD: A video dataset with high-quality textual and audio narrations

We introduce QuerYD, a new large-scale dataset for retrieval and event l...
research
02/23/2023

Data leakage in cross-modal retrieval training: A case study

The recent progress in text-based audio retrieval was largely propelled ...
research
02/28/2023

Audio Retrieval for Multimodal Design Documents: A New Dataset and Algorithms

We consider and propose a new problem of retrieving audio files relevant...

Please sign up or login with your details

Forgot password? Click here to reset