A toolkit for text extraction and analysis for natural language processing tasks

Sefara, Tshephisho J; Mbooi, Mahlatse S; Mashile, Katlego J; Rambuda, Thompho; Rangata, Mapitsi R

A toolkit for text extraction and analysis for natural language processing tasks

DOI: 10.1109/icABCD54961.2022.9856269
http://hdl.handle.net/10204/12565

Abstract:

Text extraction is an important part of natural language processing (NLP) tasks. Most NLP tasks like text classification, machine translation, text-to-speech, text-based language identification, text summarization, and named-entity recognition involve the use of textual data. Such data is limited for low-resourced languages making it difficult to experiment advanced NLP techniques on these languages. This paper presents a Python-based toolkit for text analysis and text extraction from different types of images, documents, and audio files. The toolkit is built as a library that has functions that can be imported and utilized for text extraction.

Reference:

Sefara, T.J., Mbooi, M.S., Mashile, K.J., Rambuda, T. & Rangata, M.R. 2022. A toolkit for text extraction and analysis for natural language processing tasks. http://hdl.handle.net/10204/12565 .

Sefara, T. J., Mbooi, M. S., Mashile, K. J., Rambuda, T., & Rangata, M. R. (2022). A toolkit for text extraction and analysis for natural language processing tasks. http://hdl.handle.net/10204/12565

Sefara, Tshephisho J, Mahlatse S Mbooi, Katlego J Mashile, Thompho Rambuda, and Mapitsi R Rangata. "A toolkit for text extraction and analysis for natural language processing tasks." 2022 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa, 4-5 August 2022 (2022): http://hdl.handle.net/10204/12565

Sefara TJ, Mbooi MS, Mashile KJ, Rambuda T, Rangata MR, A toolkit for text extraction and analysis for natural language processing tasks; 2022. http://hdl.handle.net/10204/12565 .

Download RIS

Sefara, Tshephisho J
Mbooi, Mahlatse S
Mashile, Katlego J
Rambuda, Thompho
Rangata, Mapitsi R

Aug 2022

Text recognition
Text categorization
Big data
Natural Language Processing
Machine translation
Data communication

Show full item record

Files in this item

Sefara_2022.pdf

Source

2022 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa, 4-5 August 2022

This item appears in the following Collection(s)

Conference Publications

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.

A toolkit for text extraction and analysis for natural language processing tasks

A toolkit for text extraction and analysis for natural language processing tasks

This item appears in the following Collection(s)

Browse

All of ResearchSpace

This Collection

Quick Links

Legislation and compliance

General Enquiries

Social Connect