Sequence-to-sequence end-to-end models for text-to-speech have shown significant gains in naturalness of the produced synthetic speech. These models have an encoder-decoder architecture, without an explicit duration model, but rather a learned attention-based alignment mechanism, simplifying the training procedure as well as the reducing the language expertise requirements for building synthetic voices. However there are some drawbacks, attention-based alignment systems such as used in the Tacotron, Tacotron 2, Char2Wav and DC-TTS end-toend architectures typically suffer from low training efficiency as well as model instability, with several approaches attempted to address these problems. Recent neural acoustic models have moved away from using an attention-based mechanisms to align the linguistic and acoustic encoding and decoding, and have rather reverted to using an explicit duration model for the alignment. In this work we develop an efficient neural network based duration model and compare it to the traditional Gaussian mixture model based architectures as used in hidden Markov model (HMM)-based speech synthesis. We show through objective results that our proposed model is better suited to resource-scarce language settings than the traditional HMM-based models.
Reference:
Louw, J.A. 2020. Text-to-speech duration models for resource-scarce languages in neural architectures. Communications in Computer and Information Science, 1342. http://hdl.handle.net/10204/11999
Louw, J. A. (2020). Text-to-speech duration models for resource-scarce languages in neural architectures. Communications in Computer and Information Science, 1342, http://hdl.handle.net/10204/11999
Louw, Johannes A "Text-to-speech duration models for resource-scarce languages in neural architectures." Communications in Computer and Information Science, 1342 (2020) http://hdl.handle.net/10204/11999
Louw JA. Text-to-speech duration models for resource-scarce languages in neural architectures. Communications in Computer and Information Science, 1342. 2020; http://hdl.handle.net/10204/11999.