ResearchSpace

NCHLT Auxiliary speech data for ASR technology development in South Africa

Show simple item record

dc.contributor.author Badenhorst, Jacob AC
dc.contributor.author De Wet, Febe
dc.date.accessioned 2022-05-04T13:20:41Z
dc.date.available 2022-05-04T13:20:41Z
dc.date.issued 2022-04
dc.identifier.citation Badenhorst, J.A. & De Wet, F. 2022. NCHLT Auxiliary speech data for ASR technology development in South Africa. <i>Data in Brief, 41.</i> http://hdl.handle.net/10204/12379 en_ZA
dc.identifier.issn 2352-3409
dc.identifier.uri https://doi.org/10.1016/j.dib.2022.107860
dc.identifier.uri http://hdl.handle.net/10204/12379
dc.description.abstract The aim of the National Centre for Human Language Technology (NCHLT) project was to create speech and text resources that would enable Human Language Technology (HLT) development for the 11 official languages of South Africa. The speech data described in this paper was collected during the NCHLT project using a smartphone application. The official NCHLT Speech Corpus was released in 2014, but it did not include all recordings that were made during the data collection campaign. This paper describes the additional data that was recently released as auxiliary corpora [2]. The auxiliary data sets contain between 20 and 170 hours of speech data per language as well as the transcriptions associated with each utterance. In terms of the resources required for HLT development South Africa’s official languages are all under-resourced. The data described in this paper contributes toward alleviating this situation, specifically for the development of speech technology. en_US
dc.format Fulltext en_US
dc.language.iso en en_US
dc.relation.uri https://www.sciencedirect.com/science/article/pii/S2352340922000725 en_US
dc.source Data in Brief, 41 en_US
dc.subject Automatic speech recognition en_US
dc.subject Human language technology en_US
dc.subject Speech data en_US
dc.subject South African languages en_US
dc.subject Under-resourced languages en_US
dc.title NCHLT Auxiliary speech data for ASR technology development in South Africa en_US
dc.type Article en_US
dc.description.pages 8 en_US
dc.description.note © 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/) en_US
dc.description.cluster Next Generation Enterprises & Institutions en_US
dc.description.impactarea Voice Computing en_US
dc.identifier.apacitation Badenhorst, J. A., & De Wet, F. (2022). NCHLT Auxiliary speech data for ASR technology development in South Africa. <i>Data in Brief, 41</i>, http://hdl.handle.net/10204/12379 en_ZA
dc.identifier.chicagocitation Badenhorst, Jacob AC, and Febe De Wet "NCHLT Auxiliary speech data for ASR technology development in South Africa." <i>Data in Brief, 41</i> (2022) http://hdl.handle.net/10204/12379 en_ZA
dc.identifier.vancouvercitation Badenhorst JA, De Wet F. NCHLT Auxiliary speech data for ASR technology development in South Africa. Data in Brief, 41. 2022; http://hdl.handle.net/10204/12379. en_ZA
dc.identifier.ris TY - Article AU - Badenhorst, Jacob AC AU - De Wet, Febe AB - The aim of the National Centre for Human Language Technology (NCHLT) project was to create speech and text resources that would enable Human Language Technology (HLT) development for the 11 official languages of South Africa. The speech data described in this paper was collected during the NCHLT project using a smartphone application. The official NCHLT Speech Corpus was released in 2014, but it did not include all recordings that were made during the data collection campaign. This paper describes the additional data that was recently released as auxiliary corpora [2]. The auxiliary data sets contain between 20 and 170 hours of speech data per language as well as the transcriptions associated with each utterance. In terms of the resources required for HLT development South Africa’s official languages are all under-resourced. The data described in this paper contributes toward alleviating this situation, specifically for the development of speech technology. DA - 2022-04 DB - ResearchSpace DP - CSIR J1 - Data in Brief, 41 KW - Automatic speech recognition KW - Human language technology KW - Speech data KW - South African languages KW - Under-resourced languages LK - https://researchspace.csir.co.za PY - 2022 SM - 2352-3409 T1 - NCHLT Auxiliary speech data for ASR technology development in South Africa TI - NCHLT Auxiliary speech data for ASR technology development in South Africa UR - http://hdl.handle.net/10204/12379 ER - en_ZA
dc.identifier.worklist 25585 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record