Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?

by   Etienne Labbé, et al.

Automated Audio Captioning (AAC) aims to develop systems capable of describing an audio recording using a textual sentence. In contrast, Audio-Text Retrieval (ATR) systems seek to find the best matching audio recording(s) for a given textual query (Text-to-Audio) or vice versa (Audio-to-Text). These tasks require different types of systems: AAC employs a sequence-to-sequence model, while ATR utilizes a ranking model that compares audio and text representations within a shared projection subspace. However, this work investigates the relationship between AAC and ATR by exploring the ATR capabilities of an unmodified AAC system, without fine-tuning for the new task. Our AAC system consists of an audio encoder (ConvNeXt-Tiny) trained on AudioSet for audio tagging, and a transformer decoder responsible for generating sentences. For AAC, it achieves a high SPIDEr-FL score of 0.298 on Clotho and 0.472 on AudioCaps on average. For ATR, we propose using the standard Cross-Entropy loss values obtained for any audio/caption pair. Experimental results on the Clotho and AudioCaps datasets demonstrate decent recall values using this simple approach. For instance, we obtained a Text-to-Audio R@1 value of 0.382 for Au-dioCaps, which is above the current state-of-the-art method without external data. Interestingly, we observe that normalizing the loss values was necessary for Audio-to-Text retrieval.


page 1

page 2

page 3

page 4


Automated Audio Captioning and Language-Based Audio Retrieval

This project involved participation in the DCASE 2022 Competition (Task ...

Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer

In this work, we propose to study the performance of a model trained wit...

Audio Retrieval with WavText5K and CLAP Training

Audio-Text retrieval takes a natural language query to retrieve relevant...

Language-based Audio Retrieval Task in DCASE 2022 Challenge

Language-based audio retrieval is a task, where natural language textual...

Crowdsourcing and Evaluating Text-Based Audio Retrieval Relevances

This paper explores grading text-based audio retrieval relevances with c...

Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval

The amount of audio data available on public websites is growing rapidly...

A Cross-Verification Approach for Protecting World Leaders from Fake and Tampered Audio

This paper tackles the problem of verifying the authenticity of speech r...

Please sign up or login with your details

Forgot password? Click here to reset