Jointly optimal denoising, dereverberation, and source separation
This paper proposes methods that can optimize a Convolutional BeamFormer (CBF) for performing denoising, dereverberation, and source separation (DN+DR+SS) at the same time. Conventionally, cascade configuration composed of a Weighted Prediction Error minimization (WPE) dereverberation filter followed by a Minimum Variance Distortionless Response (MVDR) beamformer has been used as the state-of-the-art frontend of far-field speech recognition, however, overall optimality of this approach is not guaranteed. In the blind signal processing area, an approach for jointly optimizing dereverberation and source separation (DR+SS) has been proposed, however, this approach requires huge computing cost, and has not been extended for application to DN+DR+SS. To overcome the above limitations, this paper develops new approaches for optimizing DN+DR+SS in a computationally much more efficient way. To this end, we introduce two different techniques for factorizing a CBF into WPE filters and beamformers, one based on extension of the conventional joint optimization approach proposed for DR+SS and the other based on a novel factorization technique, and derive methods optimizing them for DN+DR+SS based on the maximum likelihood estimation using a neural network-supported steering vector estimation. Experiments using noisy reverberant sound mixtures show that the proposed optimization approaches greatly improve the performance of the speech enhancement in comparison with the conventional cascade configuration in terms of the signal distortion measures and ASR performance. It is also shown that the proposed approaches can greatly reduce the computing cost with improved estimation accuracy in comparison with the conventional joint optimization approach.
READ FULL TEXT