ResearchSpace

Text-to-speech duration models for resource-scarce languages in neural architectures

Show simple item record

dc.contributor.author Louw, Johannes A
dc.date.accessioned 2021-04-23T15:52:19Z
dc.date.available 2021-04-23T15:52:19Z
dc.date.issued 2020-12
dc.identifier.citation Louw, J.A. 2020. Text-to-speech duration models for resource-scarce languages in neural architectures. <i>Communications in Computer and Information Science, 1342.</i> http://hdl.handle.net/10204/11999 en_ZA
dc.identifier.issn 1865-0929
dc.identifier.uri DOI: https://doi.org/10.1007/978-3-030-66151-9_9
dc.identifier.uri http://hdl.handle.net/10204/11999
dc.description.abstract Sequence-to-sequence end-to-end models for text-to-speech have shown significant gains in naturalness of the produced synthetic speech. These models have an encoder-decoder architecture, without an explicit duration model, but rather a learned attention-based alignment mechanism, simplifying the training procedure as well as the reducing the language expertise requirements for building synthetic voices. However there are some drawbacks, attention-based alignment systems such as used in the Tacotron, Tacotron 2, Char2Wav and DC-TTS end-toend architectures typically suffer from low training efficiency as well as model instability, with several approaches attempted to address these problems. Recent neural acoustic models have moved away from using an attention-based mechanisms to align the linguistic and acoustic encoding and decoding, and have rather reverted to using an explicit duration model for the alignment. In this work we develop an efficient neural network based duration model and compare it to the traditional Gaussian mixture model based architectures as used in hidden Markov model (HMM)-based speech synthesis. We show through objective results that our proposed model is better suited to resource-scarce language settings than the traditional HMM-based models. en_US
dc.format Fulltext en_US
dc.language.iso en en_US
dc.relation.uri https://www.springer.com/gp/book/9783030661502 en_US
dc.relation.uri https://link.springer.com/content/pdf/10.1007%2F978-3-030-66151-9_9.pdf en_US
dc.source Communications in Computer and Information Science, 1342 en_US
dc.subject Hidden Markov Model en_US
dc.subject HMM en_US
dc.subject Speech synthesis en_US
dc.subject Duration modelling en_US
dc.subject Resource-scarce languages en_US
dc.title Text-to-speech duration models for resource-scarce languages in neural architectures en_US
dc.type Article en_US
dc.description.pages 141-153 en_US
dc.description.cluster Next Generation Enterprises & Institutions en_US
dc.description.impactarea Digital Audio-Visual Technologies en_US
dc.identifier.apacitation Louw, J. A. (2020). Text-to-speech duration models for resource-scarce languages in neural architectures. <i>Communications in Computer and Information Science, 1342</i>, http://hdl.handle.net/10204/11999 en_ZA
dc.identifier.chicagocitation Louw, Johannes A "Text-to-speech duration models for resource-scarce languages in neural architectures." <i>Communications in Computer and Information Science, 1342</i> (2020) http://hdl.handle.net/10204/11999 en_ZA
dc.identifier.vancouvercitation Louw JA. Text-to-speech duration models for resource-scarce languages in neural architectures. Communications in Computer and Information Science, 1342. 2020; http://hdl.handle.net/10204/11999. en_ZA
dc.identifier.ris TY - Article AU - Louw, Johannes A AB - Sequence-to-sequence end-to-end models for text-to-speech have shown significant gains in naturalness of the produced synthetic speech. These models have an encoder-decoder architecture, without an explicit duration model, but rather a learned attention-based alignment mechanism, simplifying the training procedure as well as the reducing the language expertise requirements for building synthetic voices. However there are some drawbacks, attention-based alignment systems such as used in the Tacotron, Tacotron 2, Char2Wav and DC-TTS end-toend architectures typically suffer from low training efficiency as well as model instability, with several approaches attempted to address these problems. Recent neural acoustic models have moved away from using an attention-based mechanisms to align the linguistic and acoustic encoding and decoding, and have rather reverted to using an explicit duration model for the alignment. In this work we develop an efficient neural network based duration model and compare it to the traditional Gaussian mixture model based architectures as used in hidden Markov model (HMM)-based speech synthesis. We show through objective results that our proposed model is better suited to resource-scarce language settings than the traditional HMM-based models. DA - 2020-12 DB - ResearchSpace DP - CSIR J1 - Communications in Computer and Information Science, 1342 KW - Hidden Markov Model KW - HMM KW - Speech synthesis KW - Duration modelling KW - Resource-scarce languages LK - https://researchspace.csir.co.za PY - 2020 SM - 1865-0929 T1 - Text-to-speech duration models for resource-scarce languages in neural architectures TI - Text-to-speech duration models for resource-scarce languages in neural architectures UR - http://hdl.handle.net/10204/11999 ER - en_ZA
dc.identifier.worklist 24343 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record