dc.contributor.author |
Badenhorst, Jacob AC
|
|
dc.contributor.author |
De Wet, Febe
|
|
dc.date.accessioned |
2022-05-04T13:20:41Z |
|
dc.date.available |
2022-05-04T13:20:41Z |
|
dc.date.issued |
2022-04 |
|
dc.identifier.citation |
Badenhorst, J.A. & De Wet, F. 2022. NCHLT Auxiliary speech data for ASR technology development in South Africa. <i>Data in Brief, 41.</i> http://hdl.handle.net/10204/12379 |
en_ZA |
dc.identifier.issn |
2352-3409 |
|
dc.identifier.uri |
https://doi.org/10.1016/j.dib.2022.107860
|
|
dc.identifier.uri |
http://hdl.handle.net/10204/12379
|
|
dc.description.abstract |
The aim of the National Centre for Human Language Technology (NCHLT) project was to create speech and text resources that would enable Human Language Technology (HLT) development for the 11 official languages of South Africa. The speech data described in this paper was collected during the NCHLT project using a smartphone application. The official NCHLT Speech Corpus was released in 2014, but it did not include all recordings that were made during the data collection campaign. This paper describes the additional data that was recently released as auxiliary corpora [2]. The auxiliary data sets contain between 20 and 170 hours of speech data per language as well as the transcriptions associated with each utterance. In terms of the resources required for HLT development South Africa’s official languages are all under-resourced. The data described in this paper contributes toward alleviating this situation, specifically for the development of speech technology. |
en_US |
dc.format |
Fulltext |
en_US |
dc.language.iso |
en |
en_US |
dc.relation.uri |
https://www.sciencedirect.com/science/article/pii/S2352340922000725 |
en_US |
dc.source |
Data in Brief, 41 |
en_US |
dc.subject |
Automatic speech recognition |
en_US |
dc.subject |
Human language technology |
en_US |
dc.subject |
Speech data |
en_US |
dc.subject |
South African languages |
en_US |
dc.subject |
Under-resourced languages |
en_US |
dc.title |
NCHLT Auxiliary speech data for ASR technology development in South Africa |
en_US |
dc.type |
Article |
en_US |
dc.description.pages |
8 |
en_US |
dc.description.note |
© 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/) |
en_US |
dc.description.cluster |
Next Generation Enterprises & Institutions |
en_US |
dc.description.impactarea |
Voice Computing |
en_US |
dc.identifier.apacitation |
Badenhorst, J. A., & De Wet, F. (2022). NCHLT Auxiliary speech data for ASR technology development in South Africa. <i>Data in Brief, 41</i>, http://hdl.handle.net/10204/12379 |
en_ZA |
dc.identifier.chicagocitation |
Badenhorst, Jacob AC, and Febe De Wet "NCHLT Auxiliary speech data for ASR technology development in South Africa." <i>Data in Brief, 41</i> (2022) http://hdl.handle.net/10204/12379 |
en_ZA |
dc.identifier.vancouvercitation |
Badenhorst JA, De Wet F. NCHLT Auxiliary speech data for ASR technology development in South Africa. Data in Brief, 41. 2022; http://hdl.handle.net/10204/12379. |
en_ZA |
dc.identifier.ris |
TY - Article
AU - Badenhorst, Jacob AC
AU - De Wet, Febe
AB - The aim of the National Centre for Human Language Technology (NCHLT) project was to create speech and text resources that would enable Human Language Technology (HLT) development for the 11 official languages of South Africa. The speech data described in this paper was collected during the NCHLT project using a smartphone application. The official NCHLT Speech Corpus was released in 2014, but it did not include all recordings that were made during the data collection campaign. This paper describes the additional data that was recently released as auxiliary corpora [2]. The auxiliary data sets contain between 20 and 170 hours of speech data per language as well as the transcriptions associated with each utterance. In terms of the resources required for HLT development South Africa’s official languages are all under-resourced. The data described in this paper contributes toward alleviating this situation, specifically for the development of speech technology.
DA - 2022-04
DB - ResearchSpace
DP - CSIR
J1 - Data in Brief, 41
KW - Automatic speech recognition
KW - Human language technology
KW - Speech data
KW - South African languages
KW - Under-resourced languages
LK - https://researchspace.csir.co.za
PY - 2022
SM - 2352-3409
T1 - NCHLT Auxiliary speech data for ASR technology development in South Africa
TI - NCHLT Auxiliary speech data for ASR technology development in South Africa
UR - http://hdl.handle.net/10204/12379
ER -
|
en_ZA |
dc.identifier.worklist |
25585 |
en_US |