Joined Audio-Visual Speech Enhancement and Recognition in the Cocktail Party: The Tug Of War Between Enhancement and Recognition Losses

04/16/2019
by   Luca Pasa, et al.
0

In this paper we propose an end-to-end LSTM-based model that performs single-channel speech enhancement and phone recognition in a cocktail party scenario where visual information of the target speaker is available. In the speech enhancement phase the proposed system uses a "visual attention" signal of the speaker of interest to extract her speech from the input mixed-speech signal, while in the ASR phase it recognizes her phone sequence through a phone recognizer trained with a CTC loss. It is well known that learning multiple related tasks from data simultaneously can improve performance than learning these tasks independently, therefore we decided to train the model by optimizing both tasks at the same time. This allowed us also to explore whether (and how) this joint optimization leads to better results. We analyzed different training strategies that reveal some interesting and unexpected behaviors. In particular, the experiments demonstrated that during optimization of the ASR phase the speech enhancement capability of the model significantly decreases and vice-versa. We evaluated our approach on mixed-speech versions of GRID and TCD-TIMIT. The obtained results show a remarkable drop of the Phone Error Rate (PER) compared to the audio-visual baseline models trained only to perform phone recognition phase.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2023

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Accurate recognition of cocktail party speech containing overlapping spe...
research
11/06/2018

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments

In this paper, we address the problem of enhancing the speech of a speak...
research
09/21/2020

Correlating Subword Articulation with Lip Shapes for Embedding Aware Audio-Visual Speech Enhancement

In this paper, we propose a visual embedding approach to improving embed...
research
05/24/2023

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech ...
research
09/14/2022

A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation

Recent work has shown that it is possible to train a single model to per...
research
02/15/2019

An improved uncertainty propagation method for robust i-vector based speaker recognition

The performance of automatic speaker recognition systems degrades when f...
research
12/10/2021

Learning-based personal speech enhancement for teleconferencing by exploiting spatial-spectral features

Teleconferencing is becoming essential during the COVID-19 pandemic. How...

Please sign up or login with your details

Forgot password? Click here to reset