Unsupervised identification of redundant domain entries in InterPro database using clustering techniques

2015-09-12
InterPro is a widely used database that integrates functional signatures provided by different protein sequence annotation databases with manual curation; in order to present a comprehensive database of functional sequence annotation. However, the integration of the signatures causes inconsistent and/or redundant annotations in some cases. In this study, we proposed an unsupervised method for the automatic detection of inconsistent and redundant entries in the InterPro database. Two clustering methods: Markov Cluster Algorithm (MCL) and hierarchical clustering are employed in order to investigate to what extent these signatures can be detected. Results show that a considerable amount of (~75%) redundant entries can be identified. The future goal is to develop a system that does the identification of redundant and inconsistent signatures with very high performance using machine learning techniques in a supervised fashion. The findings of the study may aid InterPro curators to fix the problematic entries. It may also be used by curators as a road map before the integration of new signatures.

Suggestions

Derivation of Transcriptional Regulatory Relationships by Partial Least Squares Regression
Tan, Mehmet; Polat, Faruk; Alhajj, Reda (2009-11-04)
As the number of genes in a transcriptional regulatory network is large and the number of samples in biological data types is usually small, there is a need for integrating multiple data types for reverse engineering these networks. In this paper, we propose a method to integrate microarray gene expression, ChIP-chip and transcription factor binding motif data sets in a partial least squares regression model to derive transcription factors (TFs) gene interactions. Both single and synergistic effects of TFs ...
Process ontology development using natural language processing: a multiple case study
Gurbuz, Ozge; Rabhi, Fethi; Demirörs, Onur (2019-09-17)
Purpose Integrating ontologies with process modeling has gained increasing attention in recent years since it enhances data representations and makes it easier to query, store and reuse knowledge at the semantic level. The authors focused on a process and ontology integration approach by extracting the activities, roles and other concepts related to the process models from organizational sources using natural language processing techniques. As part of this study, a process ontology population (PrOnPo) metho...
Flexible Content Extraction and Querying for Videos
Demir, Utku; KOYUNCU, Murat; Yazıcı, Adnan; Yilmaz, Turgay; SERT, MUSTAFA (2011-10-28)
In this study, a multimedia database system which includes a semantic content extractor, a high-dimensional index structure and an intelligent fuzzy object-oriented database component is proposed. The proposed system is realized by following a component-oriented approach. It supports different flexible query capabilities for the requirements of video users, which is the main focus of this paper. The query performance of the system (including automatic semantic content extraction) is tested and analyzed in t...
Semantic concept recognition from structured and unstructured inputs within cyber security domain
Hoşsucu, Alp Gökhan; Baykal, Nazife; Department of Information Systems (2015)
Linked data initiative has been quite successful in terms of publishing and interlinking data over ontological structures. The success is due to answering semantically rich queries over highly structured data. The utilization of linked data structures are widely used in various domains to solve the problem of producing domain specific knowledge which can be interpreted by automated agents without any human interference. Cyber security field is one of the domains that suffer from the excessiveness of the raw...
Image Annotation With Semi-Supervised Clustering
Sayar, Ahmet; Yarman Vural, Fatoş Tunay (2009-09-16)
Methods developed for image annotation usually make use of region clustering algorithms. Visual codebooks are generated from the region clusters of low level features. These codebooks are then, matched with the words of the text document related to the image, in various ways. In this paper, we supervise the clustering process by using three types of side information. The first one is the topic probability information obtained from the text document associated with the image. The second is the orientation an...
Citation Formats
A. S. Rifaioğlu and T. Can, “Unsupervised identification of redundant domain entries in InterPro database using clustering techniques,” 2015, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/31766.