Grapheme or phoneme? An Analysis of Tacotron's Embedded Representations
End-to-end models, particularly Tacotron-based ones, are currently a popular solution for text-to-speech synthesis. They allow the production of high-quality synthesized speech with little to no text preprocessing. Phoneme inputs are usually preferred over graphemes in order to limit the amount of pronunciation errors. In this work we show that, in the case of a well-curated French dataset, graphemes can be used as input without increasing the amount of pronunciation errors. Furthermore, we perform an analysis of the representation learned by the Tacotron model and show that the contextual grapheme embeddings encode phoneme information, and that they can be used for grapheme-to-phoneme conversion and phoneme control of synthetic speech.
READ FULL TEXT