Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

Marivate, Vukosi N; Sefara, Tshephisho J; Chabalala, V; Makhaya, K; Mokgonyane, T; Mokoena, R; Modupe, A

dc.contributor.author	Marivate, Vukosi N
dc.contributor.author	Sefara, Tshephisho J
dc.contributor.author	Chabalala, V
dc.contributor.author	Makhaya, K
dc.contributor.author	Mokgonyane, T
dc.contributor.author	Mokoena, R
dc.contributor.author	Modupe, A
dc.date.accessioned	2020-07-27T06:43:50Z
dc.date.available	2020-07-27T06:43:50Z
dc.date.issued	2020-05
dc.identifier.citation	Marivate, V. (et.al). 2020. Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi. Proceedings of the First Workshop on Resources for African Indigenous Languages, Marseille, France, 16 May 2020, 6pp	en_US
dc.identifier.uri	https://www.sadilar.org/index.php/en/news/events/rail2020
dc.identifier.uri	https://www.aclweb.org/anthology/2020.rail-1.0.pdf
dc.identifier.uri	https://www.aclweb.org/anthology/2020.rail-1.3.pdf
dc.identifier.uri	http://hdl.handle.net/10204/11510
dc.description	Copyright: The South African Centre for Digital Language Resources (SADiLaR). This is the full text version of the work.	en_US
dc.description.abstract	The recent advances in Natural Language Processing have only been a boon for well represented languages, negating research in lesser known global languages. This is in part due to the availability of curated data and research resources. One of the current challenges concerning low-resourced languages are clear guidelines on the collection, curation and preparation of datasets for different use-cases. In this work, we take on the task of creating two datasets that are focused on news headlines (i.e short text) for Setswana and Sepedi and the creation of a news topic classification task from these datasets. In this study, we document our work, propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages in order to improve the performance of the classifiers.	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	Worklist;23605
dc.subject	Low-resource languages	en_US
dc.subject	Natural Language Processing	en_US
dc.subject	Sepedi	en_US
dc.subject	Setswana	en_US
dc.title	Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi	en_US
dc.type	Conference Presentation	en_US
dc.identifier.apacitation	Marivate, V. N., Sefara, T. J., Chabalala, V., Makhaya, K., Mokgonyane, T., Mokoena, R., & Modupe, A. (2020). Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi. http://hdl.handle.net/10204/11510	en_ZA
dc.identifier.chicagocitation	Marivate, Vukosi N, Tshephisho J Sefara, V Chabalala, K Makhaya, T Mokgonyane, R Mokoena, and A Modupe. "Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi." (2020): http://hdl.handle.net/10204/11510	en_ZA
dc.identifier.vancouvercitation	Marivate VN, Sefara TJ, Chabalala V, Makhaya K, Mokgonyane T, Mokoena R, et al, Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi; 2020. http://hdl.handle.net/10204/11510 .	en_ZA
dc.identifier.ris	TY - Conference Presentation AU - Marivate, Vukosi N AU - Sefara, Tshephisho J AU - Chabalala, V AU - Makhaya, K AU - Mokgonyane, T AU - Mokoena, R AU - Modupe, A AB - The recent advances in Natural Language Processing have only been a boon for well represented languages, negating research in lesser known global languages. This is in part due to the availability of curated data and research resources. One of the current challenges concerning low-resourced languages are clear guidelines on the collection, curation and preparation of datasets for different use-cases. In this work, we take on the task of creating two datasets that are focused on news headlines (i.e short text) for Setswana and Sepedi and the creation of a news topic classification task from these datasets. In this study, we document our work, propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages in order to improve the performance of the classifiers. DA - 2020-05 DB - ResearchSpace DP - CSIR KW - Low-resource languages KW - Natural Language Processing KW - Sepedi KW - Setswana LK - https://researchspace.csir.co.za PY - 2020 T1 - Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi TI - Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi UR - http://hdl.handle.net/10204/11510 ER -	en_ZA

Files in this item

Name: RS_Investigating ...

Size: 746.0Kb

Format: PDF

View/Open

This item appears in the following Collection(s)

Conference Publications

Show simple item record

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.