Effects of Various Preprocessing Techniques to Turkish Text Categorization Using N-Gram Features

2017-10-08
Deniz, Ayca
Kiziloz, Llakan Ezgi
Natural Language Processing (NLP) is a prominent subject which includes various subcategories such as text classification, error correction, machine translation, etc. Unlike other languages, there are limited number of Turkish NLP studies in literature. In this study, we apply text classification on Turkish documents by using n-gram features. Our algorithm applies different preprocessing techniques, namely, n-gram choice (character level or word level, bigram or trigram models), stemming, and use of punctuation, and then determines the Turkish document's author and genre, and the gender of the author. For this purpose, Naive Bayes, Support Vector Machines and Random Forest are used as classification techniques. Finally, we discuss the effects of above mentioned preprocessing techniques to the performance of Turkish text classification.

Suggestions

Effects of web-based multimedia annotated vocabulary learning in context model on foreign language vocabulary retention of ıntermediate level english langauge learners
Baturay, Meltem Huri; Yıldırım, Soner İ.; Department of Computer Education and Instructional Technology (2007)
The aim of this study was to investigate the effects web-based multimedia annotated vocabulary learning in context model and in spaced repetitions on vocabulary retention of intermediate level English language learners. The research study encompassed two main faces which was related to development of the material and implementation of it. In WEBVOCLE, which stands for web-based vocabulary learning material, the contextual presentation of vocabulary were enriched with audible online dictionary, pictures and ...
Probabilistic learning of Turkish morphosemantics by latent syntax
Üstün, Ahmet; Bozşahin, Hüseyin Cem; Department of Cognitive Sciences (2017)
The language processing capability of humans is highly dependent on the transparent interface between syntax and semantics which is formalized as the grammar. Morphology also interferes with this interface, in languages having rich morphology such as Turkish. This thesis aims to discover word semantics in Turkish from the compositional morphosemantics by underlying latent syntax. A computational model has been developed to learn a morpheme lexicon in which each morpheme contains semantic information in logi...
Investigating the performance of segmentation methods with deep learning models for sentiment analysis on turkish informal texts
Kurt, Fatih; Karagöz, Pınar; Department of Information Systems (2018)
This work investigates segmentation approaches for informal short texts in morphologically rich languages in order to e ectively classify the sentiment. The two building blocks of the proposed work in this thesis are segmentation and deep neural network model building. Segmentation focuses on preprocessing of text with di erent methodologies. These methodologies are grouped under four distinct approaches; namely, morphological, sub-word, tokenization, and hybrid approaches. There is mostly multiple numbers ...
Frequency-driven late fusion-based word decomposition approach on the phrase-based statistical machine translation systems
Tatlıcıoğlu, Mehmet; Yazıcı, Adnan; Department of Computer Engineering (2013)
Machine translation is the process of translating texts from a natural language to another by computers based on linguistic motivations, statistical approaches, or the combination of them. In this study, the frequency-driven late fusion-based word decomposition approach is introduced to improve the translation quality of the phrase-based statistical machine translation system from Turkish to English. This late fusion-based approach is compared with the standalone statistical and rule-based word decompositio...
Head finalization and morphological analysis in factored phrase-based statistical machine translation from English to Turkish
İmren, Haydar; Çakıcı, Ruket; Department of Computer Engineering (2015)
Machine Translation is a field of study which deals with translating text from one natural language to another automatically. Statistical Machine Translation generates the translations using statistical methods and bilingual text corpora. In this study, an approach for translating from English to Turkish is introduced. Turkish is an agglutinative language with a free constituent order, whereas English is not agglutinative and the constituent order is strict. Besides these differences, there is a lack of par...
Citation Formats
A. Deniz and L. E. Kiziloz, “Effects of Various Preprocessing Techniques to Turkish Text Categorization Using N-Gram Features,” 2017, p. 655, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/65577.