Efficient Name Disambiguation for Large Scale Datasets.

2006-09-18
Huang, Jian
Ertekin Bolelli, Şeyda
Giles, C Lee
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.

Suggestions

Efficient name disambiguation for large-scale databases
Huang, Jian; Ertekin Bolelli, Şeyda; Giles, C. Lee (2006-01-01)
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection supp...
Incremental clustering with vector expansion for online event detection in microblogs
Ozdikis, Ozer; Karagöz, Pınar; Oğuztüzün, Mehmet Halit S. (2017-11-04)
Identifying similarities in microblog posts for event detection poses challenges due to short texts with idiosyncratic spellings, irregular writing styles, abbreviations and synonyms. In order to overcome these challenges, we present an enhancement to the incremental clustering techniques by detecting similar terms in microblog posts in a temporal context. We devise an unsupervised method to measure the similarities online using co-occurrence-based techniques and use them in a vector expansion process. The ...
Employing Named Entities for Semantic Retrieval of News Videos in Turkish
Kucuk, Dilek; Yazıcı, Adnan (2009-09-16)
Named entities are known to be important means for semantic annotation of news texts. Considerable work has been carried out for semantic indexing of both textual news and news videos especially in English through the employment of named entities extracted from textual news or transcriptions of the news videos. In this paper, we present our semantic retrieval architecture for news videos in Turkish based on prior semantic annotation of the videos with the corresponding named entities in the news transcripti...
Predicting the effect of hydrophobicity surface on binding affinity of PCP-like compounds using machine learning methods
Yoldaş, Mine; Alpaslan, Ferda Nur; Büyükbingöl, Erdem; Department of Computer Engineering (2011)
This study aims to predict the binding affinity of the PCP-like compounds by means of molecular hydrophobicity. Molecular hydrophobicity is an important property which aff ects the binding affinity of molecules. The values of molecular hydrophobicity of molecules are obtained on three-dimensional coordinate system. Our aim is to reduce the number of points on the hydrophobicity surface of the molecules. This is modeled by using self organizing maps (SOM) and k-means clustering. The feature sets obtained fro...
Selective word encoding for effective text representation
Ozkan, Savas; Ozkan, Akin (The Scientific and Technological Research Council of Turkey, 2019-01-01)
Determining the category of a text document from its semantic content is highly motivated in the literature and it has been extensively studied in various applications. Also, the compact representation of the text is a fundamental step in achieving precise results for the applications and the studies are generously concentrated to improve its performance. In particular, the studies which exploit the aggregation of word-level representations are the mainstream techniques used in the problem. In this paper, w...
Citation Formats
J. Huang, Ş. Ertekin Bolelli, and C. L. Giles, “Efficient Name Disambiguation for Large Scale Datasets.,” 2006, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/69527.