Training cross-lingual embeddings for Setswana and Sepedi

Makgatho, M; Marivate, V; Sefara, Tshephisho J; Wagner, V

dc.contributor.author	Makgatho, M
dc.contributor.author	Marivate, V
dc.contributor.author	Sefara, Tshephisho J
dc.contributor.author	Wagner, V
dc.date.accessioned	2022-05-13T09:34:55Z
dc.date.available	2022-05-13T09:34:55Z
dc.date.issued	2022-02
dc.identifier.citation	Makgatho, M., Marivate, V., Sefara, T.J. & Wagner, V. 2022. Training cross-lingual embeddings for Setswana and Sepedi. http://hdl.handle.net/10204/12414 .	en_ZA
dc.identifier.uri	https://doi.org/10.55492/dhasa.v3i03.3822
dc.identifier.uri	http://hdl.handle.net/10204/12414
dc.description.abstract	How can we transfer semantic information between two low-resourced African languages? With one language having more resources than the other? The problem is that African languages still lag in the advances of Natural Language Processing techniques, one reason being the lack of representative data, having a technique that can transfer information between languages can help mitigate against the data problem. This paper trains Setswana and Sepedi monolingual word vectors and uses VecMap to create cross-lingual embeddings for Setswana-Sepedi since cross-lingual embeddings can be used as a method of transferring semantic information from rich to low-resourced languages. Word embeddings are word vectors that represent words as continuous floating numbers where semantically similar words are mapped to nearby points in n-dimensional space. Each point captures the meaning of a word with semantically similar words having similar vector values near the point. The vectors are captured in a manner that, words used in a similar context will be mapped very close to each other on a vector space. The geometric relations between words can be understood by determining the cosine distance between word vectors. The idea of word embeddings is based on the distribution hypothesis that states, semantically similar words are distributed in similar contexts (Harris, 1954). The monolingaul embeddings can be used to do cross-lingaul transfer thus improving one language from another. Cross-lingual embeddings leverages monolingual embeddings by learning a shared vector space for two separately trained monolingual vectors such that words with similar meaning are represented by similar vectors. In this work we investigate cross-lingual embeddings for Setswana-Sepedi monolingual word vector. We use the unsupervised cross lingual embeddings in VecMap to train the Setswana-Sepedi cross-language word embeddings. We evaluate the quality of the Setswana-Sepedi cross-lingual word representation using a semantic evaluation task. For the semantic similarity task, we translated the WordSim and SimLex tasks into Setswana and Sepedi. We release this dataset as part of this work for other researchers. We evaluate the intrinsic quality of the embeddings to determine if there is improvement in the semantic representation of the word embeddings.	en_US
dc.format	Fulltext	en_US
dc.language.iso	en	en_US
dc.relation.uri	https://upjournals.up.ac.za/index.php/dhasa/article/view/3822	en_US
dc.source	Proceedings of the International Conference of the Digital Humanities Association of Southern Africa (DHASA), Virtual Conference, 29 November - 3 December 2021	en_US
dc.subject	Cross-lingual embeddings	en_US
dc.subject	Word embeddings	en_US
dc.subject	Intrinsic evaluation	en_US
dc.subject	Setswana language	en_US
dc.subject	Sepedi language	en_US
dc.title	Training cross-lingual embeddings for Setswana and Sepedi	en_US
dc.type	Conference Presentation	en_US
dc.description.pages	9	en_US
dc.description.note	Paper published in Proceedings of the International Conference of the Digital Humanities Association of Southern Africa (DHASA), Virtual Conference, 29 November - 3 December 2021	en_US
dc.description.cluster	Next Generation Enterprises & Institutions	en_US
dc.description.impactarea	Data Science	en_US
dc.identifier.apacitation	Makgatho, M., Marivate, V., Sefara, T. J., & Wagner, V. (2022). Training cross-lingual embeddings for Setswana and Sepedi. http://hdl.handle.net/10204/12414	en_ZA
dc.identifier.chicagocitation	Makgatho, M, V Marivate, Tshephisho J Sefara, and V Wagner. "Training cross-lingual embeddings for Setswana and Sepedi." <i>Proceedings of the International Conference of the Digital Humanities Association of Southern Africa (DHASA), Virtual Conference, 29 November - 3 December 2021</i> (2022): http://hdl.handle.net/10204/12414	en_ZA
dc.identifier.vancouvercitation	Makgatho M, Marivate V, Sefara TJ, Wagner V, Training cross-lingual embeddings for Setswana and Sepedi; 2022. http://hdl.handle.net/10204/12414 .	en_ZA
dc.identifier.ris	TY - Conference Presentation AU - Makgatho, M AU - Marivate, V AU - Sefara, Tshephisho J AU - Wagner, V AB - How can we transfer semantic information between two low-resourced African languages? With one language having more resources than the other? The problem is that African languages still lag in the advances of Natural Language Processing techniques, one reason being the lack of representative data, having a technique that can transfer information between languages can help mitigate against the data problem. This paper trains Setswana and Sepedi monolingual word vectors and uses VecMap to create cross-lingual embeddings for Setswana-Sepedi since cross-lingual embeddings can be used as a method of transferring semantic information from rich to low-resourced languages. Word embeddings are word vectors that represent words as continuous floating numbers where semantically similar words are mapped to nearby points in n-dimensional space. Each point captures the meaning of a word with semantically similar words having similar vector values near the point. The vectors are captured in a manner that, words used in a similar context will be mapped very close to each other on a vector space. The geometric relations between words can be understood by determining the cosine distance between word vectors. The idea of word embeddings is based on the distribution hypothesis that states, semantically similar words are distributed in similar contexts (Harris, 1954). The monolingaul embeddings can be used to do cross-lingaul transfer thus improving one language from another. Cross-lingual embeddings leverages monolingual embeddings by learning a shared vector space for two separately trained monolingual vectors such that words with similar meaning are represented by similar vectors. In this work we investigate cross-lingual embeddings for Setswana-Sepedi monolingual word vector. We use the unsupervised cross lingual embeddings in VecMap to train the Setswana-Sepedi cross-language word embeddings. We evaluate the quality of the Setswana-Sepedi cross-lingual word representation using a semantic evaluation task. For the semantic similarity task, we translated the WordSim and SimLex tasks into Setswana and Sepedi. We release this dataset as part of this work for other researchers. We evaluate the intrinsic quality of the embeddings to determine if there is improvement in the semantic representation of the word embeddings. DA - 2022-02 DB - ResearchSpace DP - CSIR J1 - Proceedings of the International Conference of the Digital Humanities Association of Southern Africa (DHASA), Virtual Conference, 29 November - 3 December 2021 KW - Cross-lingual embeddings KW - Word embeddings KW - Intrinsic evaluation KW - Setswana language KW - Sepedi language LK - https://researchspace.csir.co.za PY - 2022 T1 - Training cross-lingual embeddings for Setswana and Sepedi TI - Training cross-lingual embeddings for Setswana and Sepedi UR - http://hdl.handle.net/10204/12414 ER -	en_ZA
dc.identifier.worklist	25677	en_US

Files in this item

Name: RS_25677_Training ...

Size: 360.5Kb

Format: PDF

Description: Conference paper

View/Open

This item appears in the following Collection(s)

Conference Publications

Show simple item record

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.