Unsupervised identification of redundant domain entries in InterPro database using clustering techniques

Date

2015-09-12

Author

Rifaioğlu, Ahmet Süreyya
Can, Tolga

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

188
views

0
downloads

InterPro is a widely used database that integrates functional signatures provided by different protein sequence annotation databases with manual curation; in order to present a comprehensive database of functional sequence annotation. However, the integration of the signatures causes inconsistent and/or redundant annotations in some cases. In this study, we proposed an unsupervised method for the automatic detection of inconsistent and redundant entries in the InterPro database. Two clustering methods: Markov Cluster Algorithm (MCL) and hierarchical clustering are employed in order to investigate to what extent these signatures can be detected. Results show that a considerable amount of (~75%) redundant entries can be identified. The future goal is to develop a system that does the identification of redundant and inconsistent signatures with very high performance using machine learning techniques in a supervised fashion. The findings of the study may aid InterPro curators to fix the problematic entries. It may also be used by curators as a road map before the integration of new signatures.

Subject Keywords

Applied computing, Human-centered computing, Computing methodologies, Mathematics of computing, Theory of computation, Markov processes, Markov decision processes

URI

https://hdl.handle.net/11511/31766

DOI

https://doi.org/10.1145/2808719.2811430

Collections

Graduate School of Natural and Applied Sciences, Conference / Seminar

Suggestions

OpenMETU
Core

Derivation of Transcriptional Regulatory Relationships by Partial Least Squares Regression Tan, Mehmet; Polat, Faruk; Alhajj, Reda (2009-11-04) As the number of genes in a transcriptional regulatory network is large and the number of samples in biological data types is usually small, there is a need for integrating multiple data types for reverse engineering these networks. In this paper, we propose a method to integrate microarray gene expression, ChIP-chip and transcription factor binding motif data sets in a partial least squares regression model to derive transcription factors (TFs) gene interactions. Both single and synergistic effects of TFs ...
Multiobjective relational data warehouse design for the cloud Dökeroğlu, Tansel; Coşar, Ahmet; Department of Computer Engineering (2014) Conventional distributed DataWarehouse (DW) design techniques seek to assign data tables/fragments to a given static database hardware setting optimally. However; it is now possible to use elastic virtual resources provided by the Cloud environment, thus achieve reductions in both the execution time and the monetary cost of a DW system within predefined budget and response time constraints. Finding an optimal assignment plan for database tables to machines for this design problem is NP-Hard. Therefore, robu...
Semantic concept recognition from structured and unstructured inputs within cyber security domain Hoşsucu, Alp Gökhan; Baykal, Nazife; Department of Information Systems (2015) Linked data initiative has been quite successful in terms of publishing and interlinking data over ontological structures. The success is due to answering semantically rich queries over highly structured data. The utilization of linked data structures are widely used in various domains to solve the problem of producing domain specific knowledge which can be interpreted by automated agents without any human interference. Cyber security field is one of the domains that suffer from the excessiveness of the raw...
Image Annotation With Semi-Supervised Clustering Sayar, Ahmet; Yarman Vural, Fatoş Tunay (2009-09-16) Methods developed for image annotation usually make use of region clustering algorithms. Visual codebooks are generated from the region clusters of low level features. These codebooks are then, matched with the words of the text document related to the image, in various ways. In this paper, we supervise the clustering process by using three types of side information. The first one is the topic probability information obtained from the text document associated with the image. The second is the orientation an...
Flexible Content Extraction and Querying for Videos Demir, Utku; KOYUNCU, Murat; Yazıcı, Adnan; Yilmaz, Turgay; SERT, MUSTAFA (2011-10-28) In this study, a multimedia database system which includes a semantic content extractor, a high-dimensional index structure and an intelligent fuzzy object-oriented database component is proposed. The proposed system is realized by following a component-oriented approach. It supports different flexible query capabilities for the requirements of video users, which is the main focus of this paper. The query performance of the system (including automatic semantic content extraction) is tested and analyzed in t...

Citation Formats

A. S. Rifaioğlu and T. Can, “Unsupervised identification of redundant domain entries in InterPro database using clustering techniques,” 2015, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/31766.