Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

09/19/2023
by   Rui-Chen Zheng, et al.
0

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. Specifically, we guide an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model, thus transferring tongue-related knowledge. To better model the alignment between the lip and tongue modalities, we further propose the introduction of a lip-tongue key-value memory network into the AV-SE model. This network enables the retrieval of tongue features based on readily available lip features, thereby assisting the subsequent speech enhancement task. Experimental results demonstrate that both methods significantly improve the quality and intelligibility of the enhanced speech compared to traditional lip-based AV-SE baselines. Moreover, both proposed methods exhibit strong generalization performance on unseen speakers and in the presence of unseen noises. Furthermore, phone error rate (PER) analysis of automatic speech recognition (ASR) reveals that while all phonemes benefit from introducing ultrasound tongue images, palatal and velar consonants benefit most.

READ FULL TEXT
research
05/24/2023

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech ...
research
11/18/2021

Towards Intelligibility-Oriented Audio-Visual Speech Enhancement

Existing deep learning (DL) based speech enhancement approaches are gene...
research
02/20/2023

Improving Speech Enhancement via Event-based Query

Existing deep learning based speech enhancement (SE) methods either use ...
research
07/01/2019

Synchronising audio and ultrasound by learning cross-modal embeddings

Audiovisual synchronisation is the task of determining the time offset b...
research
03/30/2017

Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Networks

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE ...
research
09/21/2020

Correlating Subword Articulation with Lip Shapes for Embedding Aware Audio-Visual Speech Enhancement

In this paper, we propose a visual embedding approach to improving embed...
research
10/26/2018

Scaling Speech Enhancement in Unseen Environments with Noise Embeddings

We address the problem of speech enhancement generalisation to unseen en...

Please sign up or login with your details

Forgot password? Click here to reset