The growing use of informal social text messages on Twitter is one of the known sources of big data. These type of messages are noisy and frequently rife with acronyms, slangs, grammatical errors and non-standard words causing grief for natural language processing (NLP) techniques. In this study, our contribution is to target non-standard words in the short text and propose a method to which the given word is likely to be transformed. Our method uses language model probability to characterise the relationship between formal and Informal-word, then employ the string similarity with a log-linear model to includes features for both word-level transformation and local context similarity. The weights of these features are trained by employing maximum likelihood framework using stochastic gradient descent (SGD) to hypothesise the better clean feature for a given informal short text. Experiments were conducted on a publicly available Enlish-language tweet and the approach is able to normalise inflected words in an online social network.
Reference:
Modupe, A. et al. 2017. Semi-supervised probabilistics approach for normalising informal short text messages. 2017 Conference on Information Communications Technology and Society (ICTAS), Umhlanga, South Africa, 8-10 March 2017
Modupe, A., Celik, T., Marivate, V. N., & Diale, M. (2017). Semi-supervised probabilistics approach for normalising informal short text messages. IEEE. http://hdl.handle.net/10204/9934
Modupe, A, T Celik, Vukosi N Marivate, and Melvin Diale. "Semi-supervised probabilistics approach for normalising informal short text messages." (2017): http://hdl.handle.net/10204/9934
Modupe A, Celik T, Marivate VN, Diale M, Semi-supervised probabilistics approach for normalising informal short text messages; IEEE; 2017. http://hdl.handle.net/10204/9934 .
Copyright: 2017 IEEE. Due to copyright restrictions, the attached PDF file only contains the abstract of the full text item. For access to the full text item, please consult the publisher's website.