Show/Hide Menu
Hide/Show Apps
Logout
Türkçe
Türkçe
Search
Search
Login
Login
OpenMETU
OpenMETU
About
About
Open Science Policy
Open Science Policy
Open Access Guideline
Open Access Guideline
Postgraduate Thesis Guideline
Postgraduate Thesis Guideline
Communities & Collections
Communities & Collections
Help
Help
Frequently Asked Questions
Frequently Asked Questions
Guides
Guides
Thesis submission
Thesis submission
MS without thesis term project submission
MS without thesis term project submission
Publication submission with DOI
Publication submission with DOI
Publication submission
Publication submission
Supporting Information
Supporting Information
General Information
General Information
Copyright, Embargo and License
Copyright, Embargo and License
Contact us
Contact us
Approximate similarity search in genomic sequence databases using landmark-guided embedding
Date
2008-04-12
Author
Sacan, Ahmet
Toroslu, İsmail Hakkı
Metadata
Show full item record
This work is licensed under a
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
.
Item Usage Stats
118
views
0
downloads
Cite This
Similarity search in sequence databases is ofparamount importance in bioinformatics research. As the size of the genomic databases increases, similarity search of proteins in these databases becomes a bottle-neck in large-scale studies, calling for more efficient methods of content-based retrieval. In this study, we present a metric-preserving, landmark-guided embedding approach to represent sequences in the vector domain in order to allow efficient indexing and similarity search. We analyze various properties of the embedding and show that the approximation achieved by the embedded representation is sufficient to achieve biologically relevant results. The approximate representation is shown to provide several orders of magnitude speed-up in similarity search compared to the exact representation, while maintaining comparable search accuracy.
Subject Keywords
Genomics
,
Bioinformatics
,
Databases
,
Sequences
,
Indexing
,
Matrices
,
Proteins
,
Data engineering
,
Large-scale systems
,
Computer science
URI
https://hdl.handle.net/11511/45983
DOI
https://doi.org/10.1109/icdew.2008.4498343
Collections
Department of Computer Engineering, Conference / Seminar
Suggestions
OpenMETU
Core
Mask Combination of Multi-Layer Graphs for Global Structure Inference
Bayram, Eda; Thanou, Dorina; Vural, Elif; Frossard, Pascal (2020-01-01)
Structure inference is an important task for network data processing and analysis in data science. In recent years, quite a few approaches have been developed to learn the graph structure underlying a set of observations captured in a data space. Although real-world data is often acquired in settings where relationships are influenced by a priori known rules, such domain knowledge is still not well exploited in structure inference problems. In this paper, we identify the structure of signals defined in a da...
Clustering of manifold-modeled data based on tangent space variations
Gökdoğan, Gökhan; Vural, Elif; Department of Electrical and Electronics Engineering (2017)
An important research topic of the recent years has been to understand and analyze data collections for clustering and classification applications. In many data analysis problems, the data sets at hand have an intrinsically low-dimensional structure and admit a manifold model. Most state-of-the-art clustering methods developed for data of non-linear and low-dimensional structure are based on local linearity assumptions. However, clustering algorithms based on locally linear representations can tolerate diff...
Computational representation of protein sequences for homology detection and classification
Oğul, Hasan; Mumcuoğlu, Ünal Erkan; Department of Information Systems (2006)
Machine learning techniques have been widely used for classification problems in computational biology. They require that the input must be a collection of fixedlength feature vectors. Since proteins are of varying lengths, there is a need for a means of representing protein sequences by a fixed-number of features. This thesis introduces three novel methods for this purpose: n-peptide compositions with reduced alphabets, pairwise similarity scores by maximal unique matches, and pairwise similarity scores by...
Comparison of multidimensional data access methods for feature-based image retrieval
Arslan, Serdar; Saçan, Ahmet; Açar, Esra; Toroslu, İsmail Hakkı; Yazıcı, Adnan (2010-11-18)
Within the scope of information retrieval, efficient similarity search in large document or multimedia collections is a critical task. In this paper, we present a rigorous comparison of three different approaches to the image retrieval problem, including cluster-based indexing, distance-based indexing, and multidimensional scaling methods. The time and accuracy tradeoffs for each of these methods are demonstrated on a large Corel image database. Similarity of images is obtained via a featurebased similarity...
Derivation of Transcriptional Regulatory Relationships by Partial Least Squares Regression
Tan, Mehmet; Polat, Faruk; Alhajj, Reda (2009-11-04)
As the number of genes in a transcriptional regulatory network is large and the number of samples in biological data types is usually small, there is a need for integrating multiple data types for reverse engineering these networks. In this paper, we propose a method to integrate microarray gene expression, ChIP-chip and transcription factor binding motif data sets in a partial least squares regression model to derive transcription factors (TFs) gene interactions. Both single and synergistic effects of TFs ...
Citation Formats
IEEE
ACM
APA
CHICAGO
MLA
BibTeX
A. Sacan and İ. H. Toroslu, “Approximate similarity search in genomic sequence databases using landmark-guided embedding,” 2008, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/45983.