Named entity recognition in Turkish with bayesian learning and hybrid approaches

Download
2011
Yavuz, Sermet Reha
Information Extraction (IE) is the process of extracting structured and important pieces of information from a set of unstructured text documents in natural language. The final goal of structured information extraction is to populate a database and reach data effectively. Our study focuses on named entity recognition (NER) which is an important subtask of IE. NER is the task that deals with extraction of named entities like person, location, organization names, temporal expressions (date and time) and numerical expressions (money and percent). NER research on Turkish is known to be rare. There are rule-based, learning based and hybrid systems for NER on Turkish texts. Some of the learning approaches used for NER in Turkish are conditional random fields (CRF), rote learning, rule extraction and generalization. In this thesis, we propose a learning based named entity recognizer for Turkish texts which employs a modified version of Bayesian learning as the learning scheme. To the best of our knowledge, this is the first learning based system that uses Bayesian approach for NER in Turkish. Several features (like token length, capitalization, lexical meaning, etc.) are used in the system to see the effects of different features on NER process. We also propose hybrid system where the Bayesian learning-based system is utilized along with a rule-based recognition system. There are two different versions of the hybrid system. Output of rule-based recognizer is utilized in different phases in these versions. We observed increase in F-Measure values for both hybrid versions. When partial scoring is active, hybrid system reached 91.44% F-Measure value; where rule-based system result is 87.43% and learning-based system result is 88.41%. The hybrid system can be improved by utilizing rule-based and learning-based components differently in the future. Hybrid system can also be improved by using different learning approaches and combining them with existing hybrid system or forming the hybrid system with a completely new approach.

Suggestions

Named Entity Recognition in Turkish with Bayesian Learning and Hybrid Approaches
RehaYavuz, Sermet; Kucuk, Dilek; Yazıcı, Adnan (2013-10-29)
Named entity recognition is one of the significant textual information extraction tasks. In this paper, we present two approaches for named entity recognition on Turkish texts. The first is a Bayesian learning approach which is trained on a considerably limited training set. The second approach comprises two hybrid systems based on joint utilization of this Bayesian learning approach and a previously proposed rule-based named entity recognizer. All of the proposed three approaches achieve promising performa...
Clustering scientific literature using sparse citation graph analysis
Bolelli, Levent; Ertekin Bolelli, Şeyda; Giles, C. Lee (2006-01-01)
It is well known that connectivity analysis of linked documents provides significant information about the structure of the document space for unsupervised learning tasks. However, the ability to identify distinct clusters of documents based on link graph analysis is proportional to the density of the graph and depends on the availability of the linking and/or linked documents in the collection. In this paper, we present an information theoretic approach towards measuring the significance of individual word...
On Fuzzy Extensions to Energy Ontologies for Text Processing Applications
Kucuk, Dilek; Kucuk, Dogan; Yazıcı, Adnan (2014-10-28)
Ubiquitous application areas of domain ontologies include text processing applications like categorizing related documents of the domain, extraction of information from these documents, and semantic search. In this paper, we focus on the utilization of two energy ontologies, one for electrical power quality and the second for wind energy, within such applications. For this purpose, we present fuzzy extensions to these domain ontologies as fuzziness is an essential feature of the ultimate forms of the ontolo...
Comparison of multidimensional data access methods for feature-based image retrieval
Arslan, Serdar; Saçan, Ahmet; Açar, Esra; Toroslu, İsmail Hakkı; Yazıcı, Adnan (2010-11-18)
Within the scope of information retrieval, efficient similarity search in large document or multimedia collections is a critical task. In this paper, we present a rigorous comparison of three different approaches to the image retrieval problem, including cluster-based indexing, distance-based indexing, and multidimensional scaling methods. The time and accuracy tradeoffs for each of these methods are demonstrated on a large Corel image database. Similarity of images is obtained via a featurebased similarity...
Exploiting information extraction techniques for automatic semantic annotation and retrieval of news videos in Turkish
Küçük, Dilek; Yazıcı, Adnan; Department of Computer Engineering (2011)
Information extraction (IE) is known to be an effective technique for automatic semantic indexing of news texts. In this study, we propose a text-based fully automated system for the semantic annotation and retrieval of news videos in Turkish which exploits several IE techniques on the video texts. The IE techniques employed by the system include named entity recognition, automatic hyperlinking, person entity extraction with coreference resolution, and event extraction. The system utilizes the outputs of th...
Citation Formats
S. R. Yavuz, “Named entity recognition in Turkish with bayesian learning and hybrid approaches,” M.S. - Master of Science, Middle East Technical University, 2011.