Text-to-speech duration models for resource-scarce languages in neural architectures

Louw, Johannes A

Text-to-speech duration models for resource-scarce languages in neural architectures

DOI: https://doi.org/10.1007/978-3-030-66151-9_9
http://hdl.handle.net/10204/11999

Abstract:

Sequence-to-sequence end-to-end models for text-to-speech have shown significant gains in naturalness of the produced synthetic speech. These models have an encoder-decoder architecture, without an explicit duration model, but rather a learned attention-based alignment mechanism, simplifying the training procedure as well as the reducing the language expertise requirements for building synthetic voices. However there are some drawbacks, attention-based alignment systems such as used in the Tacotron, Tacotron 2, Char2Wav and DC-TTS end-toend architectures typically suffer from low training efficiency as well as model instability, with several approaches attempted to address these problems. Recent neural acoustic models have moved away from using an attention-based mechanisms to align the linguistic and acoustic encoding and decoding, and have rather reverted to using an explicit duration model for the alignment. In this work we develop an efficient neural network based duration model and compare it to the traditional Gaussian mixture model based architectures as used in hidden Markov model (HMM)-based speech synthesis. We show through objective results that our proposed model is better suited to resource-scarce language settings than the traditional HMM-based models.

Reference:

Louw, J.A. 2020. Text-to-speech duration models for resource-scarce languages in neural architectures. Communications in Computer and Information Science, 1342. http://hdl.handle.net/10204/11999

Louw, J. A. (2020). Text-to-speech duration models for resource-scarce languages in neural architectures. Communications in Computer and Information Science, 1342, http://hdl.handle.net/10204/11999

Louw, Johannes A "Text-to-speech duration models for resource-scarce languages in neural architectures." Communications in Computer and Information Science, 1342 (2020) http://hdl.handle.net/10204/11999

Louw JA. Text-to-speech duration models for resource-scarce languages in neural architectures. Communications in Computer and Information Science, 1342. 2020; http://hdl.handle.net/10204/11999.

Download RIS

Louw, Johannes A

Dec 2020

Hidden Markov Model
HMM
Speech synthesis
Duration modelling
Resource-scarce languages

Show full item record

Files in this item

RS_24343_Text-to-speech duration models for resource-scarce languages in neural architectures_Dec_2020.pdf

Source

Communications in Computer and Information Science, 1342

This item appears in the following Collection(s)

Journal Articles

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.

Text-to-speech duration models for resource-scarce languages in neural architectures

Text-to-speech duration models for resource-scarce languages in neural architectures

This item appears in the following Collection(s)

Browse

All of ResearchSpace

This Collection

Quick Links

Legislation and compliance

General Enquiries

Social Connect