Similarity search in protein sequence databases using metric access methods

2013-09-13
Cetintas, Ahmet
Sacan, Ahmet
Toroslu, İsmail Hakkı
The rapid increase in the size of biological sequence data owing to the advancements in high-throughput sequencing techniques, and the increased complexity of hypothesis-driven exploration of this data requiring massive number of similarity queries call for new approaches for managing sequence databases and analysis of this information. The metric space representation for sequences is suitable for similarity search and provides several sophisticated metric-indexing techniques. In this work, we provide a thorough survey and analysis of the application of metric access methods to similarity search in protein sequence databases. A framework supporting application of different metric space indexing methods is developed and a non-redundant sequence database is used to benchmark different methods in terms of number of distance-computations incurred and the computation time required during database compilation and query phases. The parameters of each method are optimized on a subset of experimental conditions. We demonstrate that Onion-Tree, a hybrid metric access method, performs the best in both index building and querying phases for the protein database investigated, and scales well for large databases, incurring distance computations with 0.5% of the database sequences per query.
5th International Conference on Bioinformatics and Computational Biology (4-6 March 2013)

Suggestions

Self-adaptive negative selection using local outlier factor
Ataser, Zafer; Alpaslan, Ferda Nur (null; 2012-09-05)
Negative selection algorithm (NSA) classifies a given data either as normal (self) or anomalous (non-self). To make this classification, it is trained using normal (self) samples. NSA generates detectors to cover the complementary space of self in training phase. The classification of NSAs is mainly specified by two issues, self space determination and detectors coverage. The boundary of self is ambiguous so NSAs use self samples to calculate a space close to the self space. The other issue is the detectors...
Quantum Search in Sets with Prior Knowledge
Çalıkyılmaz, Umut; Turgut, Sadi; Department of Physics (2021-10-7)
Quantum search algorithm revolutionized the field by reducing the complexity significantly for the search problem. However, by not being able to decrease the complexity to logarithmic scales, this algorithm still needs a significant amount of time to solve the search problem for large sets. It is proven that the order of the time required to solve this problem cannot be reduced further but, making an improvement by some constant factor is still possible. This aim has been pursued by some scientists in the p...
Multimodal video database modeling, querying and browsing
Durak, N; Yazıcı, Adnan (2005-01-01)
In this paper, a multimodal video indexing and retrieval system, MMVIRS, is presented. MMVIRS models the auditory, visual, and textual sources of video collections from a semantic perspective. Besides multimodality, our model is constituted on semantic hierarchies that enable us to access the video from different semantic levels. MMVIRS has been implemented with data annotation, querying and browsing parts. In the annotation part, metadata information and video semantics are extracted in hierarchical ways. ...
Clustering of manifold-modeled data based on tangent space variations
Gökdoğan, Gökhan; Vural, Elif; Department of Electrical and Electronics Engineering (2017)
An important research topic of the recent years has been to understand and analyze data collections for clustering and classification applications. In many data analysis problems, the data sets at hand have an intrinsically low-dimensional structure and admit a manifold model. Most state-of-the-art clustering methods developed for data of non-linear and low-dimensional structure are based on local linearity assumptions. However, clustering algorithms based on locally linear representations can tolerate diff...
Activity prediction from auto-captured lifelog images
Belli, Kader; Akbaş, Emre; Department of Computer Engineering (2019)
The analysis of lifelogging has generated great interest among data scientists because large-scale, multidimensional and multimodal data are generated as a result of lifelogging activities. In this study, we use the NTCIR Lifelog dataset where daily lives of two users are monitored for a total of 90 days, and archived as a set of minute-based records consisting of details like semantic location, body measurements, listening history, and user activity. In addition, images which are captured automatically by ...
Citation Formats
A. Cetintas, A. Sacan, and İ. H. Toroslu, “Similarity search in protein sequence databases using metric access methods,” Honolulu, HI, United States, 2013, p. 131, Accessed: 00, 2021. [Online]. Available: https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84883633278&origin=inward.