Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

Marivate, Vukosi N; Sefara, Tshephisho J; Chabalala, V; Makhaya, K; Mokgonyane, T; Mokoena, R; Modupe, A

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

https://www.sadilar.org/index.php/en/news/events/rail2020
https://www.aclweb.org/anthology/2020.rail-1.0.pdf
https://www.aclweb.org/anthology/2020.rail-1.3.pdf
http://hdl.handle.net/10204/11510

Abstract:

The recent advances in Natural Language Processing have only been a boon for well represented languages, negating research in lesser known global languages. This is in part due to the availability of curated data and research resources. One of the current challenges concerning low-resourced languages are clear guidelines on the collection, curation and preparation of datasets for different use-cases. In this work, we take on the task of creating two datasets that are focused on news headlines (i.e short text) for Setswana and Sepedi and the creation of a news topic classification task from these datasets. In this study, we document our work, propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages in order to improve the performance of the classifiers.

Reference:

Marivate, V. (et.al). 2020. Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi. Proceedings of the First Workshop on Resources for African Indigenous Languages, Marseille, France, 16 May 2020, 6pp

Marivate, V. N., Sefara, T. J., Chabalala, V., Makhaya, K., Mokgonyane, T., Mokoena, R., & Modupe, A. (2020). Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi. http://hdl.handle.net/10204/11510

Marivate, Vukosi N, Tshephisho J Sefara, V Chabalala, K Makhaya, T Mokgonyane, R Mokoena, and A Modupe. "Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi." (2020): http://hdl.handle.net/10204/11510

Marivate VN, Sefara TJ, Chabalala V, Makhaya K, Mokgonyane T, Mokoena R, et al, Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi; 2020. http://hdl.handle.net/10204/11510 .

Download RIS

Marivate, Vukosi N
Sefara, Tshephisho J
Chabalala, V
Makhaya, K
Mokgonyane, T
Mokoena, R
Modupe, A

May 2020

Low-resource languages
Natural Language Processing
Sepedi
Setswana

Show full item record

Files in this item

RS_Investigating an approach for low resource language.pdf

This item appears in the following Collection(s)

Conference Publications

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

This item appears in the following Collection(s)

Browse

All of ResearchSpace

This Collection

Quick Links

Legislation and compliance

General Enquiries

Social Connect