End to End Bangla Speech Synthesis

08/01/2021
by   Prithwiraj Bhattacharjee, et al.
0

Text-to-Speech (TTS) system is a system where speech is synthesized from a given text following any particular approach. Concatenative synthesis, Hidden Markov Model (HMM) based synthesis, Deep Learning (DL) based synthesis with multiple building blocks, etc. are the main approaches for implementing a TTS system. Here, we are presenting our deep learning-based end-to-end Bangla speech synthesis system. It has been implemented with minimal human annotation using only 3 major components (Encoder, Decoder, Post-processing net including waveform synthesis). It does not require any frontend preprocessor and Grapheme-to-Phoneme (G2P) converter. Our model has been trained with phonetically balanced 20 hours of single speaker speech data. It has obtained a 3.79 Mean Opinion Score (MOS) on a scale of 5.0 as subjective evaluation and a 0.77 Perceptual Evaluation of Speech Quality(PESQ) score on a scale of [-0.5, 4.5] as objective evaluation. It is outperforming all existing non-commercial state-of-the-art Bangla TTS systems based on naturalness.

READ FULL TEXT
research
06/26/2019

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

We present RUSLAN -- a new open Russian spoken language corpus for the t...
research
03/29/2017

Tacotron: Towards End-to-End Speech Synthesis

A text-to-speech synthesis system typically consists of multiple stages,...
research
07/06/2021

Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm

Text-to-Speech synthesis systems are generally evaluated using Mean Opin...
research
04/20/2021

Review of end-to-end speech synthesis technology based on deep learning

As an indispensable part of modern human-computer interaction system, sp...
research
10/31/2022

The Importance of Accurate Alignments in End-to-End Speech Synthesis

Unit selection synthesis systems required accurate segmentation and labe...
research
06/29/2020

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Recent advances in deep learning methods have elevated synthetic speech ...
research
04/08/2021

Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features

Neural sequence-to-sequence text-to-speech synthesis (TTS), such as Taco...

Please sign up or login with your details

Forgot password? Click here to reset