Clotho: An Audio Captioning Dataset

10/21/2019
by   Konstantinos Drossos, et al.
0

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/22/2019

Crowdsourcing a Dataset of Audio Captions

Audio captioning is a novel field of multi-modal translation and it is t...
research
02/22/2022

Hidden bawls, whispers, and yelps: can text be made to sound more than just its words?

Whether a word was bawled, whispered, or yelped, captions will typically...
research
04/15/2018

Transcribing Lyrics From Commercial Song Audio: The First Step Towards Singing Content Processing

Spoken content processing (such as retrieval and browsing) is maturing, ...
research
06/27/2020

Listen carefully and tell: an audio captioning system based on residual learning and gammatone audio representation

Automated audio captioning is machine listening task whose goal is to de...
research
07/06/2020

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

Audio captioning is the task of automatically creating a textual descrip...
research
05/13/2022

Joint Generation of Captions and Subtitles with Dual Decoding

As the amount of audio-visual content increases, the need to develop aut...
research
05/15/2023

A Whisper transformer for audio captioning trained with synthetic captions and transfer learning

The field of audio captioning has seen significant advancements in recen...

Please sign up or login with your details

Forgot password? Click here to reset