Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

by   Darius Petermann, et al.

Emulating the human ability to solve the cocktail party problem, i.e., focus on a source of interest in a complex acoustic scene, is a long standing goal of audio source separation research. Much of this research investigates separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. In this paper, we focus on the cocktail fork problem, which takes a three-pronged approach to source separation by separating an audio mixture such as a movie soundtrack or podcast into the three broad categories of speech, music, and sound effects (SFX - understood to include ambient noise and natural sound events). We benchmark the performance of several deep learning-based source separation models on this task and evaluate them with respect to simple objective measures such as signal-to-distortion ratio (SDR) as well as objective metrics that better correlate with human perception. Furthermore, we thoroughly evaluate how source separation can influence downstream transcription tasks. First, we investigate the task of activity detection on the three sources as a way to both further improve source separation and perform transcription. We formulate the transcription tasks as speech recognition for speech and audio tagging for music and SFX. We observe that, while the use of source separation estimates improves transcription performance in comparison to the original soundtrack, performance is still sub-optimal due to artifacts introduced by the separation process. Therefore, we thoroughly investigate how remixing of the three separated source stems at various relative levels can reduce artifacts and consequently improve the transcription performance. We find that remixing music and SFX interferences at a target SNR of 17.5 dB reduces speech recognition word error rate, and similar impact from remixing is observed for tagging music and SFX content.


page 1

page 4

page 11


The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

The cocktail party problem aims at isolating any source of interest with...

Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation

This paper presents a joint source separation algorithm that simultaneou...

Upsampling layers for music source separation

Upsampling artifacts are caused by problematic upsampling layers and due...

An Improved Measure of Musical Noise Based on Spectral Kurtosis

Audio processing methods operating on a time-frequency representation of...

Breaking Speech Recognizers to Imagine Lyrics

We introduce a new method for generating text, and in particular song ly...

A Hands-on Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation

This paper describes a hands-on comparison on using state-of-the-art mus...

Does Phase Matter For Monaural Source Separation?

The "cocktail party" problem of fully separating multiple sources from a...

Please sign up or login with your details

Forgot password? Click here to reset