Representation Mixing for TTS Synthesis

11/17/2018
by   Kyle Kastner, et al.
0

Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference. Experiments and user studies on a public audiobook corpus show the efficacy of our approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/16/1999

Mixing representation levels: The hybrid approach to automatic text generation

Natural language generation systems (NLG) map non-linguistic representat...
research
05/18/2022

Macedonian Speech Synthesis for Assistive Technology Applications

Speech technology is becoming ever more ubiquitous with the advance of s...
research
06/17/2016

DeepStance at SemEval-2016 Task 6: Detecting Stance in Tweets Using Character and Word-Level CNNs

This paper describes our approach for the Detecting Stance in Tweets tas...
research
06/12/2020

Realistic Physics Based Character Controller

Over the course of the last several years there was a strong interest in...
research
06/16/2022

All the World's a (Hyper)Graph: A Data Drama

We introduce Hyperbard, a dataset of diverse relational data representat...
research
01/31/2023

PADL: Language-Directed Physics-Based Character Control

Developing systems that can synthesize natural and life-like motions for...
research
12/21/2017

The Character Thinks Ahead: creative writing with deep learning nets and its stylistic assessment

We discuss how to control outputs from deep learning models of text corp...

Please sign up or login with your details

Forgot password? Click here to reset