An Efficient Part-of-Speech Tagger for Arabic

2011-02-26
Kopru, Selcuk
In this paper, we present an efficient part-of-speech (POS) tagger for Arabic which is based on a Hidden Markow Model. We explore different enhancements to improve the baseline system. Despite the morphological complexity of Arabic our approach is a data driven approach and does not utilize any morphological analyzer or a lexicon as many other Arabic PUS taggers. This makes our approach simple, very efficient and valuable to be used in real-life applications and the obtained accuracy results are still comparable to other Arabic POS taggers. In the experiments, we also thoroughly investigate different aspects of Arabic PUS tagging including tag sets, prefix and suffix analyses which were not examined in detail before. Our part-of-speech tagger achieves an accuracy of 95.57% on a standard tagset for Arabic. A detailed error analysis is provided for a better evaluation of the system. We also applied the same approach on different languages like Farsi and German to show the language independent aspect of the approach. Accuracy rates on these languages are also provided.

Suggestions

A heuristic algorithm for optical character recognition of Arabic script
Yarman Vural, Fatoş T.; Atici, A. Alper (1996-03-20)
In this paper, a heuristic method is developed for segmentation, feature extraction and recognition of the Arabic script. The study is part of a large project for the transcription of the documents in Ottoman Archives. A geometrical and topological feature analysis method is developed for segmentation and feature extraction stages. Chain code transformation is applied to main strokes of the characters which are then classified by the hidden Markov model (HMM) in the recognition stage. Experimental results i...
A heuristic algorithm for optical character recognition of Arabic script
Atici, A. Alper; Yarman Vural, Fatoş T. (1997-10-01)
In this paper, a heuristic method is developed for segmentation, feature extraction and recognition of the Arabic script. The study is part of a large project for transcription of the documents in Ottoman Archives. A geometrical and topological feature analysis method is developed for segmentation and feature extraction stages. Chain code transformation is applied to main strokes of the characters, which are classified by the hidden Markov model (HMM) in the recognition stage. Experimental results indicate ...
A character recognizer for Turkish language
Korkmaz, SU; Akinci, GKY; Atalay, Mehmet Volkan (2003-01-01)
This paper presents particularly a contextual post processing subsystem for a Turkish machine printed character recognition system. The contextual post processing subsystem is based on positional binary 3-gram statistics for Turkish language, an error corrector parser and a lexicon, which contains root words and the inflected forms of the root words. Error corrector parser is used for correcting CR alternatives using Turkish Morphology.
A framework for sentiment analysis in Turkish application to polarity detection of movie reviews in Turkish
Vural, Gural; Cambazoglu, Barla; Tokgoz, Özge Zehra; Karagöz, Pınar (null; 2012-10-28)
In this work, we present a framework for unsupervised sentiment analysis in Turkish text documents. As part of our framework, we customize the SentiStrength sentiment analysis library by translating its lexicon to Turkish. We apply our framework to the problem of classifying the polarity of movie reviews. For performance evaluation, we use a large corpus of Turkish movie reviews obtained from a popular Turkish social media site. Although our framework is unsupervised, it is demonstrated to achieve a fairly ...
Pronominal anaphora resolution in Turkish and English
Ertan, Melek; Zeyrek Bozşahin, Deniz; Department of Cognitive Sciences (2023-1-27)
This research analyzes pronominal anaphora in a Turkish and English translated TED corpus, namely the TED-MDB (Zeyrek et al., 2020) and presents a heuristic-based resolution algorithm for resolving pronominal anaphora in these languages separately. The corpus has characteristics of spoken language and has 364 English sentences aligned with their Turkish counterparts. The research is divided into two stages. In the first stage, the data was annotated using a web-based annotation tool INcePTION (Klie et al., ...
Citation Formats
S. Kopru, “An Efficient Part-of-Speech Tagger for Arabic,” 2011, vol. 6608, p. 202, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/64364.