Recent work in sequence-to-sequence neural networks with attention mechanisms, such as the Tacotron 2 and DCTTS architectures, have brought on substantial naturalness improvements in synthesised speech. These architectures require at least an order of magnitude more data than is generally available in resource-scarce language environments. In this paper we propose an efficient feed-forward deep neural network (DNN)-based acoustic model, using stacked bottleneck features, that together with the recently introduced LPCNet vocoder can be used in resource-scarce language environments, with corpora less than 1 hour in size, to build text-to-speech systems of high perceived naturalness. We compare traditional hidden Markov model (HMM)-based acoustic modelling for speech synthesis with the proposed architecture using the World and LPCNet vocoders, giving both objective and MUSHRA based subjective results, showing that the DNN LPCNet combination leads to more natural synthesised speech that can be confused with natural speech. The proposed acoustic model provides for an efficient implementation, with faster than real time synthesis.
Reference:
Louw, J.A. 2019. Neural speech synthesis for resource-scarce languages. In: Proceedings of the South African Forum for Artificial Intelligence, Cape Town, 4-6 December 2019
Louw, J. A. (2019). Neural speech synthesis for resource-scarce languages. Ruzica Piskac. http://hdl.handle.net/10204/11541
Louw, Johannes A. "Neural speech synthesis for resource-scarce languages." (2019): http://hdl.handle.net/10204/11541