NCHLT Auxiliary speech data for ASR technology development in South Africa

Badenhorst, Jacob AC; De Wet, Febe

NCHLT Auxiliary speech data for ASR technology development in South Africa

https://doi.org/10.1016/j.dib.2022.107860
http://hdl.handle.net/10204/12379

Abstract:

The aim of the National Centre for Human Language Technology (NCHLT) project was to create speech and text resources that would enable Human Language Technology (HLT) development for the 11 official languages of South Africa. The speech data described in this paper was collected during the NCHLT project using a smartphone application. The official NCHLT Speech Corpus was released in 2014, but it did not include all recordings that were made during the data collection campaign. This paper describes the additional data that was recently released as auxiliary corpora [2]. The auxiliary data sets contain between 20 and 170 hours of speech data per language as well as the transcriptions associated with each utterance. In terms of the resources required for HLT development South Africa’s official languages are all under-resourced. The data described in this paper contributes toward alleviating this situation, specifically for the development of speech technology.

Reference:

Badenhorst, J.A. & De Wet, F. 2022. NCHLT Auxiliary speech data for ASR technology development in South Africa. Data in Brief, 41. http://hdl.handle.net/10204/12379

Badenhorst, J. A., & De Wet, F. (2022). NCHLT Auxiliary speech data for ASR technology development in South Africa. Data in Brief, 41, http://hdl.handle.net/10204/12379

Badenhorst, Jacob AC, and Febe De Wet "NCHLT Auxiliary speech data for ASR technology development in South Africa." Data in Brief, 41 (2022) http://hdl.handle.net/10204/12379

Badenhorst JA, De Wet F. NCHLT Auxiliary speech data for ASR technology development in South Africa. Data in Brief, 41. 2022; http://hdl.handle.net/10204/12379.

Download RIS

Badenhorst, Jacob AC
De Wet, Febe

Apr 2022

Automatic speech recognition
Human language technology
Speech data
South African languages
Under-resourced languages

Show full item record

Files in this item

Badenhorst_2022.pdf

Source

Data in Brief, 41

This item appears in the following Collection(s)

Journal Articles

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.

NCHLT Auxiliary speech data for ASR technology development in South Africa

NCHLT Auxiliary speech data for ASR technology development in South Africa

This item appears in the following Collection(s)

Browse

All of ResearchSpace

This Collection

Quick Links

Legislation and compliance

General Enquiries

Social Connect