NCHLT Auxiliary speech data for ASR technology development in South Africa

Badenhorst, Jacob AC; De Wet, Febe

dc.contributor.author	Badenhorst, Jacob AC
dc.contributor.author	De Wet, Febe
dc.date.accessioned	2022-05-04T13:20:41Z
dc.date.available	2022-05-04T13:20:41Z
dc.date.issued	2022-04
dc.identifier.citation	Badenhorst, J.A. & De Wet, F. 2022. NCHLT Auxiliary speech data for ASR technology development in South Africa. <i>Data in Brief, 41.</i> http://hdl.handle.net/10204/12379	en_ZA
dc.identifier.issn	2352-3409
dc.identifier.uri	https://doi.org/10.1016/j.dib.2022.107860
dc.identifier.uri	http://hdl.handle.net/10204/12379
dc.description.abstract	The aim of the National Centre for Human Language Technology (NCHLT) project was to create speech and text resources that would enable Human Language Technology (HLT) development for the 11 official languages of South Africa. The speech data described in this paper was collected during the NCHLT project using a smartphone application. The official NCHLT Speech Corpus was released in 2014, but it did not include all recordings that were made during the data collection campaign. This paper describes the additional data that was recently released as auxiliary corpora [2]. The auxiliary data sets contain between 20 and 170 hours of speech data per language as well as the transcriptions associated with each utterance. In terms of the resources required for HLT development South Africa’s official languages are all under-resourced. The data described in this paper contributes toward alleviating this situation, specifically for the development of speech technology.	en_US
dc.format	Fulltext	en_US
dc.language.iso	en	en_US
dc.relation.uri	https://www.sciencedirect.com/science/article/pii/S2352340922000725	en_US
dc.source	Data in Brief, 41	en_US
dc.subject	Automatic speech recognition	en_US
dc.subject	Human language technology	en_US
dc.subject	Speech data	en_US
dc.subject	South African languages	en_US
dc.subject	Under-resourced languages	en_US
dc.title	NCHLT Auxiliary speech data for ASR technology development in South Africa	en_US
dc.type	Article	en_US
dc.description.pages	8	en_US
dc.description.note	© 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)	en_US
dc.description.cluster	Next Generation Enterprises & Institutions	en_US
dc.description.impactarea	Voice Computing	en_US
dc.identifier.apacitation	Badenhorst, J. A., & De Wet, F. (2022). NCHLT Auxiliary speech data for ASR technology development in South Africa. <i>Data in Brief, 41</i>, http://hdl.handle.net/10204/12379	en_ZA
dc.identifier.chicagocitation	Badenhorst, Jacob AC, and Febe De Wet "NCHLT Auxiliary speech data for ASR technology development in South Africa." <i>Data in Brief, 41</i> (2022) http://hdl.handle.net/10204/12379	en_ZA
dc.identifier.vancouvercitation	Badenhorst JA, De Wet F. NCHLT Auxiliary speech data for ASR technology development in South Africa. Data in Brief, 41. 2022; http://hdl.handle.net/10204/12379.	en_ZA
dc.identifier.ris	TY - Article AU - Badenhorst, Jacob AC AU - De Wet, Febe AB - The aim of the National Centre for Human Language Technology (NCHLT) project was to create speech and text resources that would enable Human Language Technology (HLT) development for the 11 official languages of South Africa. The speech data described in this paper was collected during the NCHLT project using a smartphone application. The official NCHLT Speech Corpus was released in 2014, but it did not include all recordings that were made during the data collection campaign. This paper describes the additional data that was recently released as auxiliary corpora [2]. The auxiliary data sets contain between 20 and 170 hours of speech data per language as well as the transcriptions associated with each utterance. In terms of the resources required for HLT development South Africa’s official languages are all under-resourced. The data described in this paper contributes toward alleviating this situation, specifically for the development of speech technology. DA - 2022-04 DB - ResearchSpace DP - CSIR J1 - Data in Brief, 41 KW - Automatic speech recognition KW - Human language technology KW - Speech data KW - South African languages KW - Under-resourced languages LK - https://researchspace.csir.co.za PY - 2022 SM - 2352-3409 T1 - NCHLT Auxiliary speech data for ASR technology development in South Africa TI - NCHLT Auxiliary speech data for ASR technology development in South Africa UR - http://hdl.handle.net/10204/12379 ER -	en_ZA
dc.identifier.worklist	25585	en_US

Files in this item

Name: Badenhorst_2022.pdf

Size: 601.5Kb

Format: PDF

Description: Article

View/Open

This item appears in the following Collection(s)

Journal Articles

Show simple item record

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.