ResearchSpace

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

Show simple item record

dc.contributor.author Marivate, Vukosi N
dc.contributor.author Sefara, Tshephisho J
dc.contributor.author Chabalala, V
dc.contributor.author Makhaya, K
dc.contributor.author Mokgonyane, T
dc.contributor.author Mokoena, R
dc.contributor.author Modupe, A
dc.date.accessioned 2020-07-27T06:43:50Z
dc.date.available 2020-07-27T06:43:50Z
dc.date.issued 2020-05
dc.identifier.citation Marivate, V. (et.al). 2020. Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi. Proceedings of the First Workshop on Resources for African Indigenous Languages, Marseille, France, 16 May 2020, 6pp en_US
dc.identifier.uri https://www.sadilar.org/index.php/en/news/events/rail2020
dc.identifier.uri https://www.aclweb.org/anthology/2020.rail-1.0.pdf
dc.identifier.uri https://www.aclweb.org/anthology/2020.rail-1.3.pdf
dc.identifier.uri http://hdl.handle.net/10204/11510
dc.description Copyright: The South African Centre for Digital Language Resources (SADiLaR). This is the full text version of the work. en_US
dc.description.abstract The recent advances in Natural Language Processing have only been a boon for well represented languages, negating research in lesser known global languages. This is in part due to the availability of curated data and research resources. One of the current challenges concerning low-resourced languages are clear guidelines on the collection, curation and preparation of datasets for different use-cases. In this work, we take on the task of creating two datasets that are focused on news headlines (i.e short text) for Setswana and Sepedi and the creation of a news topic classification task from these datasets. In this study, we document our work, propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages in order to improve the performance of the classifiers. en_US
dc.language.iso en en_US
dc.relation.ispartofseries Worklist;23605
dc.subject Low-resource languages en_US
dc.subject Natural Language Processing en_US
dc.subject Sepedi en_US
dc.subject Setswana en_US
dc.title Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi en_US
dc.type Conference Presentation en_US
dc.identifier.apacitation Marivate, V. N., Sefara, T. J., Chabalala, V., Makhaya, K., Mokgonyane, T., Mokoena, R., & Modupe, A. (2020). Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi. http://hdl.handle.net/10204/11510 en_ZA
dc.identifier.chicagocitation Marivate, Vukosi N, Tshephisho J Sefara, V Chabalala, K Makhaya, T Mokgonyane, R Mokoena, and A Modupe. "Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi." (2020): http://hdl.handle.net/10204/11510 en_ZA
dc.identifier.vancouvercitation Marivate VN, Sefara TJ, Chabalala V, Makhaya K, Mokgonyane T, Mokoena R, et al, Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi; 2020. http://hdl.handle.net/10204/11510 . en_ZA
dc.identifier.ris TY - Conference Presentation AU - Marivate, Vukosi N AU - Sefara, Tshephisho J AU - Chabalala, V AU - Makhaya, K AU - Mokgonyane, T AU - Mokoena, R AU - Modupe, A AB - The recent advances in Natural Language Processing have only been a boon for well represented languages, negating research in lesser known global languages. This is in part due to the availability of curated data and research resources. One of the current challenges concerning low-resourced languages are clear guidelines on the collection, curation and preparation of datasets for different use-cases. In this work, we take on the task of creating two datasets that are focused on news headlines (i.e short text) for Setswana and Sepedi and the creation of a news topic classification task from these datasets. In this study, we document our work, propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages in order to improve the performance of the classifiers. DA - 2020-05 DB - ResearchSpace DP - CSIR KW - Low-resource languages KW - Natural Language Processing KW - Sepedi KW - Setswana LK - https://researchspace.csir.co.za PY - 2020 T1 - Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi TI - Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi UR - http://hdl.handle.net/10204/11510 ER - en_ZA


Files in this item

This item appears in the following Collection(s)

Show simple item record