dc.contributor.author |
Louw, Johannes A
|
|
dc.date.accessioned |
2021-04-23T15:52:19Z |
|
dc.date.available |
2021-04-23T15:52:19Z |
|
dc.date.issued |
2020-12 |
|
dc.identifier.citation |
Louw, J.A. 2020. Text-to-speech duration models for resource-scarce languages in neural architectures. <i>Communications in Computer and Information Science, 1342.</i> http://hdl.handle.net/10204/11999 |
en_ZA |
dc.identifier.issn |
1865-0929 |
|
dc.identifier.uri |
DOI: https://doi.org/10.1007/978-3-030-66151-9_9
|
|
dc.identifier.uri |
http://hdl.handle.net/10204/11999
|
|
dc.description.abstract |
Sequence-to-sequence end-to-end models for text-to-speech have shown significant gains in naturalness of the produced synthetic speech. These models have an encoder-decoder architecture, without an explicit duration model, but rather a learned attention-based alignment mechanism, simplifying the training procedure as well as the reducing the language expertise requirements for building synthetic voices. However there are some drawbacks, attention-based alignment systems such as used in the Tacotron, Tacotron 2, Char2Wav and DC-TTS end-toend architectures typically suffer from low training efficiency as well as model instability, with several approaches attempted to address these problems. Recent neural acoustic models have moved away from using an attention-based mechanisms to align the linguistic and acoustic encoding and decoding, and have rather reverted to using an explicit duration model for the alignment. In this work we develop an efficient neural network based duration model and compare it to the traditional Gaussian mixture model based architectures as used in hidden Markov model (HMM)-based speech synthesis. We show through objective results that our proposed model is better suited to resource-scarce language settings than the traditional HMM-based models. |
en_US |
dc.format |
Fulltext |
en_US |
dc.language.iso |
en |
en_US |
dc.relation.uri |
https://www.springer.com/gp/book/9783030661502 |
en_US |
dc.relation.uri |
https://link.springer.com/content/pdf/10.1007%2F978-3-030-66151-9_9.pdf |
en_US |
dc.source |
Communications in Computer and Information Science, 1342 |
en_US |
dc.subject |
Hidden Markov Model |
en_US |
dc.subject |
HMM |
en_US |
dc.subject |
Speech synthesis |
en_US |
dc.subject |
Duration modelling |
en_US |
dc.subject |
Resource-scarce languages |
en_US |
dc.title |
Text-to-speech duration models for resource-scarce languages in neural architectures |
en_US |
dc.type |
Article |
en_US |
dc.description.pages |
141-153 |
en_US |
dc.description.cluster |
Next Generation Enterprises & Institutions |
en_US |
dc.description.impactarea |
Digital Audio-Visual Technologies |
en_US |
dc.identifier.apacitation |
Louw, J. A. (2020). Text-to-speech duration models for resource-scarce languages in neural architectures. <i>Communications in Computer and Information Science, 1342</i>, http://hdl.handle.net/10204/11999 |
en_ZA |
dc.identifier.chicagocitation |
Louw, Johannes A "Text-to-speech duration models for resource-scarce languages in neural architectures." <i>Communications in Computer and Information Science, 1342</i> (2020) http://hdl.handle.net/10204/11999 |
en_ZA |
dc.identifier.vancouvercitation |
Louw JA. Text-to-speech duration models for resource-scarce languages in neural architectures. Communications in Computer and Information Science, 1342. 2020; http://hdl.handle.net/10204/11999. |
en_ZA |
dc.identifier.ris |
TY - Article
AU - Louw, Johannes A
AB - Sequence-to-sequence end-to-end models for text-to-speech have shown significant gains in naturalness of the produced synthetic speech. These models have an encoder-decoder architecture, without an explicit duration model, but rather a learned attention-based alignment mechanism, simplifying the training procedure as well as the reducing the language expertise requirements for building synthetic voices. However there are some drawbacks, attention-based alignment systems such as used in the Tacotron, Tacotron 2, Char2Wav and DC-TTS end-toend architectures typically suffer from low training efficiency as well as model instability, with several approaches attempted to address these problems. Recent neural acoustic models have moved away from using an attention-based mechanisms to align the linguistic and acoustic encoding and decoding, and have rather reverted to using an explicit duration model for the alignment. In this work we develop an efficient neural network based duration model and compare it to the traditional Gaussian mixture model based architectures as used in hidden Markov model (HMM)-based speech synthesis. We show through objective results that our proposed model is better suited to resource-scarce language settings than the traditional HMM-based models.
DA - 2020-12
DB - ResearchSpace
DP - CSIR
J1 - Communications in Computer and Information Science, 1342
KW - Hidden Markov Model
KW - HMM
KW - Speech synthesis
KW - Duration modelling
KW - Resource-scarce languages
LK - https://researchspace.csir.co.za
PY - 2020
SM - 1865-0929
T1 - Text-to-speech duration models for resource-scarce languages in neural architectures
TI - Text-to-speech duration models for resource-scarce languages in neural architectures
UR - http://hdl.handle.net/10204/11999
ER - |
en_ZA |
dc.identifier.worklist |
24343 |
en_US |