A Hybrid method for toponym recognition on informal Turkish text

Download
2014
Sağcan, Meryem
Since accessing the Internet is getting easier and people are more willing to share information on the Internet than the previous generations, the data on such kind of reachable sources are growing very rapidly day by day. Moreover, because of the popularity and widely usage of those sources, the information which researchers and organizations are interested in can be found somewhere in these data collection. The purpose of Information Extraction (IE) is to analyze this information cloud and to extract the desired data among them. This study designs a system dealing with a subfield of Information Extraction, namely, Named Entity Recognition (NER), which many of the IE systems use as a basis. NER is used to identify the entities related to the aspired information in texts and classify them into a set of predefined categories such as person, location, and organization names, date and money expressions, etc. Since most of the desired information such as trends, agendas, needs and thoughts of people may vary among locations and a location name can be used for more than one location, extracting location information is another research area. There is a field for this purpose, named as Toponym Extraction, which uses NER as a basic step in order to recognize location names. Toponym Extraction consists of two steps, namely Toponym Recognition and Toponym Resolution. The first step, Toponym Recognition, is the subject of the proposed study. It aims to extract named entities referring to location names; whereas, Toponym Resolution aims to make decision about which geographical coordinate the entity refers to; since, a location name can be used for more than one geographical coordinates. Prominence of social media such as Twitter and Facebook have drawn attention from companies and researchers interested in detecting trends; however, the informal and popular nature of these services leads to a large amount of noisy misspellings, lack of punctuation, non-standard abbreviations and abnormal capitalization which make the recognition process really hard. This case creates a new challenge in NER field; thus, it also creates a new challenge in Toponym Recognition. The proposed system in this thesis, constructs a hybrid NER system which uses both rule based and machine learning based techniques to extract toponyms from an informally written, unstructured text document which includes Turkish tweets. In this study, Conditional Random Fields (CRF) is used as a machine learning tool and some features such as POS-Tags and Conjunction Window are defined to train the constructed CRF model. In the rule based part, regular expressions which aim to define some rules in order to extract some words that containing "köy", "deniz", " ̧ehir", "istan", etc. are used. The result of the rule based part is used as a feature in the machine learning part. All defined features are experimented interchangeably and incrementally. In addition, various learning mechanisms within CRF are compared in terms of their accuracy. Finally, the proposed study shows the effect of the size of the training and test data sets on the system accuracy. Those parameters are all experimented and the combination giving the best result is used in the comparison part in which the system is compared with some previous studies.

Suggestions

A Context Aware Notification Architecture Based on Distributed Focused Crawling in the Big Data Era
AKYOL, MEHMET ALİ; Gökalp, Mert Onuralp; Kayabay, Kerem; Eren, Pekin Erhan; Koçyiğit, Altan (2017-09-08)
The amount of data created in various sources over the Web is tremendously increasing. Trying to keep track of relevant sources is an increasingly time-consuming task. The traditional way of accessing information over the Web is pull-based. Users need to query data sources in certain time intervals where an important piece of information can be lately recognized or even missed completely. Technologies including RSS help users to get push-based notifications from websites. Discovering the relevant informatio...
A content boosted collaborative filtering approach for recommender systems based on multi level and bidirectional trust data
Şahinkaya, Ferhat; Alpaslan, Ferda Nur; Department of Computer Engineering (2010)
As the Internet became widespread all over the world, people started to share great amount of data on the web and almost every people joined different data networks in order to have a quick access to data shared among people and survive against the information overload on the web. Recommender systems are created to provide users more personalized information services and to make data available for people without an extra effort. Most of these systems aim to get or learn user preferences, explicitly or impli...
A hybrid recommendation system capturing the effect of time and demographic data
Oktay, Fulya; Alpaslan, Ferda Nur; Department of Computer Engineering (2010)
The information that World Wide Web (WWW) provides have grown up very rapidly in recent years, which resulted in new approaches for people to reach the information they need. Although web pages and search engines are indeed strong enough for us to reach what we want, it is not an efficient solution to present data and wait people to reach it. Some more creative and beneficial methods had to be developed for decreasing the time to reach the information and increase the quality of the information. Recommendat...
A content boosted collaborative filtering approach for movie recommendation based on local & global similarity and missing data prediction
Özbal, Gözde; Alpaslan, Ferda Nur; Department of Computer Engineering (2009)
Recently, it has become more and more difficult for the existing web based systems to locate or retrieve any kind of relevant information, due to the rapid growth of the World Wide Web (WWW) in terms of the information space and the amount of the users in that space. However, in today's world, many systems and approaches make it possible for the users to be guided by the recommendations that they provide about new items such as articles, news, books, music, and movies. However, a lot of traditional recommen...
A Hybrid Approach for Credibility Detection in Twitter
Gun, Alper; Karagöz, Pınar (2014-06-13)
Nowadays, microblogging services are seen as a source of information. It brings us a question. Can we trust information in a microblogging service? In this paper, we focus on one of the popular microblogging services, Twitter, and try to answer which information in Twitter is credible. Newsworthiness, importance and correctness are the dimensions to be measured in this study. We propose a hybrid credibility analysis which combines feature based and graph based approaches. Our model is based on three types o...
Citation Formats
M. Sağcan, “A Hybrid method for toponym recognition on informal Turkish text,” M.S. - Master of Science, Middle East Technical University, 2014.