Efficient name disambiguation for large-scale databases

Date

2006-01-01

Author

Huang, Jian
Ertekin Bolelli, Şeyda
Giles, C. Lee

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

77
views

0
downloads

Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.

URI

https://hdl.handle.net/11511/55057

Journal

KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2006, PROCEEDINGS

Collections

Department of Computer Engineering, Article

Suggestions

OpenMETU
Core

Efficient Name Disambiguation for Large Scale Datasets. Huang, Jian; Ertekin Bolelli, Şeyda; Giles, C Lee (2006-09-18) Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection supp...
Incremental clustering with vector expansion for online event detection in microblogs Ozdikis, Ozer; Karagöz, Pınar; Oğuztüzün, Mehmet Halit S. (2017-11-04) Identifying similarities in microblog posts for event detection poses challenges due to short texts with idiosyncratic spellings, irregular writing styles, abbreviations and synonyms. In order to overcome these challenges, we present an enhancement to the incremental clustering techniques by detecting similar terms in microblog posts in a temporal context. We devise an unsupervised method to measure the similarities online using co-occurrence-based techniques and use them in a vector expansion process. The ...
Employing Named Entities for Semantic Retrieval of News Videos in Turkish Kucuk, Dilek; Yazıcı, Adnan (2009-09-16) Named entities are known to be important means for semantic annotation of news texts. Considerable work has been carried out for semantic indexing of both textual news and news videos especially in English through the employment of named entities extracted from textual news or transcriptions of the news videos. In this paper, we present our semantic retrieval architecture for news videos in Turkish based on prior semantic annotation of the videos with the corresponding named entities in the news transcripti...
Selective word encoding for effective text representation Ozkan, Savas; Ozkan, Akin (The Scientific and Technological Research Council of Turkey, 2019-01-01) Determining the category of a text document from its semantic content is highly motivated in the literature and it has been extensively studied in various applications. Also, the compact representation of the text is a fundamental step in achieving precise results for the applications and the studies are generously concentrated to improve its performance. In particular, the studies which exploit the aggregation of word-level representations are the mainstream techniques used in the problem. In this paper, w...
Person name recognition in turkish financial texts by using local grammar approach Bayraktar, Özkan; Taşkaya Temizel, Tuğba; Department of Information Systems (2007) Named entity recognition (NER) is the task of identifying the named entities (NEs) in the texts and classifying them into semantic categories such as person, organization, and place names and time, date, monetary, and percent expressions. NER has two principal aims: identification of NEs and classification of them into semantic categories. The local grammar (LG) approach has recently been shown to be superior to other NER techniques such as the probabilistic approach, the symbolic approach, and the hybrid a...

Citation Formats

J. Huang, Ş. Ertekin Bolelli, and C. L. Giles, “Efficient name disambiguation for large-scale databases,” KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2006, PROCEEDINGS, pp. 536–544, 2006, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/55057.