Efficient Name Disambiguation for Large Scale Datasets.

2006-09-18
Huang, Jian
Ertekin Bolelli, Şeyda
Giles, C Lee
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.

Suggestions

Efficient name disambiguation for large-scale databases
Huang, Jian; Ertekin Bolelli, Şeyda; Giles, C. Lee (2006-01-01)
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection supp...
Incremental clustering with vector expansion for online event detection in microblogs
Ozdikis, Ozer; Karagöz, Pınar; Oğuztüzün, Mehmet Halit S. (2017-11-04)
Identifying similarities in microblog posts for event detection poses challenges due to short texts with idiosyncratic spellings, irregular writing styles, abbreviations and synonyms. In order to overcome these challenges, we present an enhancement to the incremental clustering techniques by detecting similar terms in microblog posts in a temporal context. We devise an unsupervised method to measure the similarities online using co-occurrence-based techniques and use them in a vector expansion process. The ...
Predicting the effect of hydrophobicity surface on binding affinity of PCP-like compounds using machine learning methods
Yoldaş, Mine; Alpaslan, Ferda Nur; Büyükbingöl, Erdem; Department of Computer Engineering (2011)
This study aims to predict the binding affinity of the PCP-like compounds by means of molecular hydrophobicity. Molecular hydrophobicity is an important property which aff ects the binding affinity of molecules. The values of molecular hydrophobicity of molecules are obtained on three-dimensional coordinate system. Our aim is to reduce the number of points on the hydrophobicity surface of the molecules. This is modeled by using self organizing maps (SOM) and k-means clustering. The feature sets obtained fro...
K-SVMeans: A hybrid clustering algorithm for multi-type interrelated datasets
Bolelli, Levent; Ertekin Bolelli, Şeyda; Zhou, Ding; Giles, C. Lee (2007-01-01)
Identification of distinct clusters of documents in text collections has traditionally been addressed by making the assumption that the data instances can only be represented by homogeneous and uniform features. Many real-world data, on the other hand, comprise of multiple types of heterogeneous interrelated components, such as web pages and hyperlinks, online scientific publications and authors and publication venues to name a few. In this paper, we present K-SVMeans, a clustering algorithm for multi-type ...
Employing Named Entities for Semantic Retrieval of News Videos in Turkish
Kucuk, Dilek; Yazıcı, Adnan (2009-09-16)
Named entities are known to be important means for semantic annotation of news texts. Considerable work has been carried out for semantic indexing of both textual news and news videos especially in English through the employment of named entities extracted from textual news or transcriptions of the news videos. In this paper, we present our semantic retrieval architecture for news videos in Turkish based on prior semantic annotation of the videos with the corresponding named entities in the news transcripti...
Citation Formats
J. Huang, Ş. Ertekin Bolelli, and C. L. Giles, “Efficient Name Disambiguation for Large Scale Datasets.,” 2006, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/69527.