Audio Captioning with Composition of Acoustic and Semantic Information

05/13/2021
by   Ayşegül Özkaya Eren, et al.
0

Generating audio captions is a new research area that combines audio and natural language processing to create meaningful textual descriptions for audio clips. To address this problem, previous studies mostly use the encoder-decoder based models without considering semantic information. To fill this gap, we present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings. We extract semantic embedding by obtaining subjects and verbs from the audio clip captions and combine these embedding with audio embedding to feed the BiGRU-based encoder-decoder model. To enable semantic embeddings for the test audios, we introduce a Multilayer Perceptron classifier to predict the semantic embeddings of those clips. We also present exhaustive experiments to show the efficiency of different features and datasets for our proposed model the audio captioning task. To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings. Extensive experiments on two audio captioning datasets Clotho and AudioCaps show that our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics and using the semantic information improves the captioning performance. Keywords: Audio captioning; PANNs; VGGish; GRU; BiGRU.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2021

Audio Captioning Using Sound Event Detection

This technical report proposes an audio captioning system for DCASE 2021...
research
06/05/2020

Audio Captioning using Gated Recurrent Units

Audio captioning is a recently proposed task for automatically generatin...
research
08/05/2021

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Automated audio captioning aims to use natural language to describe the ...
research
01/10/2022

Local Information Assisted Attention-free Decoder for Audio Captioning

Automated audio captioning (AAC) aims to describe audio data with captio...
research
09/01/2023

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

Automated Audio Captioning (AAC) involves generating natural language de...
research
06/30/2017

Automated Audio Captioning with Recurrent Neural Networks

We present the first approach to automated audio captioning. We employ a...
research
06/01/2023

Encoder-decoder multimodal speaker change detection

The task of speaker change detection (SCD), which detects points where s...

Please sign up or login with your details

Forgot password? Click here to reset