In this work, we explore the use of phonological features incross-lingual transfer within resource-scarce settings. We modify the architecture of VITS to accept a phonological feature vector as input, instead of phonemes or characters. Subsequently, we train multispeaker base models using data from LibriTTS and then fine-tune them on single-speaker Afrikaans and isiXhosa datasets of varying sizes, representing the resourcescarce setting. We evaluate the synthetic speech both objectively and subjectively and compare it to models trained with the same data using the standard VITS architecture. In our experiments, the proposed system utilizing phonological features as input converges significantly faster and requires less data than the base system. We demonstrate that the model employing phonological features is capable of producing sounds in the target language that were unseen in the source language, even in languages with significant linguistic differences, and with only 5 minutes of data in the target language.
Reference:
Louw, J.A. 2023. Cross-lingual transfer using phonological features for resource-scarce text-to-speech. http://hdl.handle.net/10204/13404 .
Louw, J. A. (2023). Cross-lingual transfer using phonological features for resource-scarce text-to-speech. http://hdl.handle.net/10204/13404
Louw, Johannes A. "Cross-lingual transfer using phonological features for resource-scarce text-to-speech." 12th ISCA Speech Synthesis Workshop, Grenoble, France, 26-28 August 2023 (2023): http://hdl.handle.net/10204/13404
Louw JA, Cross-lingual transfer using phonological features for resource-scarce text-to-speech; 2023. http://hdl.handle.net/10204/13404 .