Div-BLAST: Diversification of Sequence Search Results

2014-12-22
Eser, Elif
Can, Tolga
Ferhatosmanoglu, Hakan
Sequence similarity tools, such as BLAST, seek sequences most similar to a query from a database of sequences. They return results significantly similar to the query sequence and that are typically highly similar to each other. Most sequence analysis tasks in bioinformatics require an exploratory approach, where the initial results guide the user to new searches. However, diversity has not yet been considered an integral component of sequence search tools for this discipline. Some redundancy can be avoided by introducing non-redundancy during database construction, but it is not feasible to dynamically set a level of non-redundancy tailored to a query sequence. We introduce the problem of diverse search and browsing in sequence databases that produce non-redundant results optimized for any given query. We define diversity measures for sequences and propose methods to obtain diverse results extracted from current sequence similarity search tools. We also propose a new measure to evaluate the diversity of a set of sequences that is returned as a result of a sequence similarity query. We evaluate the effectiveness of the proposed methods in post-processing BLAST and PSI-BLAST results. We also assess the functional diversity of the returned results based on available Gene Ontology annotations. Additionally, we include a comparison with a current redundancy elimination tool, CD-HIT. Our experiments show that the proposed methods are able to achieve more diverse yet significant result sets compared to static non-redundancy approaches. In both sequence-based and functional diversity evaluation, the proposed diversification methods significantly outperform original BLAST results and other baselines. A web based tool implementing the proposed methods, Div-BLAST, can be accessed at cedar.cs.bilkent.edu.tr/Div-BLAST

Suggestions

mESAdb: microRNA expression and sequence analysis database.
Kaya, KD; Karakülah, G; Yakicier, CM; Acar, Aybar Can; Konu, O (2011-01-01)
MicroRNA expression and sequence analysis database (http://konulab.fen.bilkent.edu.tr/mirna/) (mESAdb) is a regularly updated database for the multivariate analysis of sequences and expression of microRNAs from multiple taxa. mESAdb is modular and has a user interface implemented in PHP and JavaScript and coupled with statistical analysis and visualization packages written for the R language. The database primarily comprises mature microRNA sequences and their target data, along with selected human, mouse a...
Fuzzy data representation and querying in XML database
Ustunkaya, Ekin; Yazıcı, Adnan; George, Roy (2007-02-01)
Real-world information including subjective opinions and judgments need imprecise data to be modeled for representation and querying in databases. The Extensible Markup Language (XML) has become a de-facto standard for data modeling and exchange in recent years. Efforts on modeling imprecision and representing such data in XML have not been fully developed. In this paper, an XML based fuzzy data representation and querying system is presented. Complex and imprecise data are represented using a fuzzy extensi...
An attempt to classify Turkish district data : K-Means and Self-Organizing Map (SOM) algorithms
Aksoy, Ece; Işık, Oğuz; Department of Geodetic and Geographical Information Technologies (2004)
There is no universally applicable clustering technique in discovering the variety of structures display in data sets. Also, a single algorithm or approach is not adequate to solve every clustering problem. There are many methods available, the criteria used differ and hence different classifications may be obtained for the same data. While larger and larger amounts of data are collected and stored in databases, there is increasing the need for efficient and effective analysis methods. Grouping or classific...
Image Annotation With Semi-Supervised Clustering
Sayar, Ahmet; Yarman Vural, Fatoş Tunay (2009-09-16)
Methods developed for image annotation usually make use of region clustering algorithms. Visual codebooks are generated from the region clusters of low level features. These codebooks are then, matched with the words of the text document related to the image, in various ways. In this paper, we supervise the clustering process by using three types of side information. The first one is the topic probability information obtained from the text document associated with the image. The second is the orientation an...
Joint linear complexity of multisequences consisting of linear recurring sequences
Fu, Fang-Wei; Niederreiter, Harald; Özbudak, Ferruh (Springer Science and Business Media LLC, 2009-04-01)
The linear complexity of sequences is one of the important security measures for stream cipher systems. Recently, in the study of vectorized stream cipher systems, the joint linear complexity of multisequences has been investigated. In this paper, we study the joint linear complexity of multisequences consisting of linear recurring sequences. The expectation and variance of the joint linear complexity of random multisequences consisting of linear recurring sequences are determined. These results extend the ...
Citation Formats
E. Eser, T. Can, and H. Ferhatosmanoglu, “Div-BLAST: Diversification of Sequence Search Results,” Plos One, 2014, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/28647.