Computational representation of protein sequences for homology detection and classification

Download

index.pdf

Date

2006

Author

Oğul, Hasan

Metadata

Show full item record

Item Usage Stats

417
views

109
downloads

Machine learning techniques have been widely used for classification problems in computational biology. They require that the input must be a collection of fixedlength feature vectors. Since proteins are of varying lengths, there is a need for a means of representing protein sequences by a fixed-number of features. This thesis introduces three novel methods for this purpose: n-peptide compositions with reduced alphabets, pairwise similarity scores by maximal unique matches, and pairwise similarity scores by probabilistic suffix trees. New sequence representations described in the thesis are applied on three challenging problems of computational biology: remote homology detection, subcellular localization prediction, and solvent accessibility prediction, with some problem-specific modifications. Rigorous experiments are conducted on common benchmarking datasets, and a comparative analysis is performed between the new methods and the existing ones for each problem. On remote homology detection tests, all three methods achieve competitive accuracies with the state-of-the-art methods, while being much more efficient. A combination of new representations are used to devise a hybrid system, called PredLOC, for predicting subcellular localization of proteins and it is tested on two distinct eukaryotic datasets. To the best of author̕s knowledge, the accuracy achieved by PredLOC is the highest one ever reported on those datasets. The maximal unique match method is resulted with only a slight improvement in solvent accessibility predictions.

Subject Keywords

Computer science.

URI

http://etd.lib.metu.edu.tr/upload/12606997/index.pdf
https://hdl.handle.net/11511/15857

Collections

Graduate School of Informatics, Thesis

Suggestions

OpenMETU
Core

Modelling and predicting binding affinity of PCP-like compounds using machine learning methods Erdaş, Özlem; Alpaslan, Ferda Nur; Department of Computer Engineering (2007) Machine learning methods have been promising tools in science and engineering fields. The use of these methods in chemistry and drug design has advanced after 1990s. In this study, molecular electrostatic potential (MEP) surfaces of PCP-like compounds are modelled and visualized in order to extract features which will be used in predicting binding affinity. In modelling, Cartesian coordinates of MEP surface points are mapped onto a spherical self-organizing map. Resulting maps are visualized by using values...
Neural networks with piecewise constant argument and impact activation Yılmaz, Enes; Akhmet, Marat; Department of Scientific Computing (2011) This dissertation addresses the new models in mathematical neuroscience: artificial neural networks, which have many similarities with the structure of human brain and the functions of cells by electronic circuits. The networks have been investigated due to their extensive applications in classification of patterns, associative memories, image processing, artificial intelligence, signal processing and optimization problems. These applications depend crucially on the dynamical behaviors of the networks. In t...
A temporal neural network model for constructing connectionist expert system knowledge bases Alpaslan, Ferda Nur (Elsevier BV, 1996-04-01) This paper introduces a temporal feedforward neural network model that can be applied to a number of neural network application areas, including connectionist expert systems. The neural network model has a multi-layer structure, i.e. the number of layers is not limited. Also, the model has the flexibility of defining output nodes in any layer. This is especially important for connectionist expert system applications.
Video Shot Boundary Detection by Graph-theoretic Dominant Sets Approach Asan, Emrah; Alatan, Abdullah Aydın (2009-09-16) We present a video shot boundary detection algorithm based on the novel graph theoretic concept, namely dominant sets. Dominant sets are defined as a set of the nodes in a graph, mostly similar to each other and dissimilar to the others. In order to achieve this goal, candidate shot boundaries are determined by using simply pixelwise differences between consequent frames. For each candidate position, a testing sequence is constructed by considering 4 frames before the candidate position and 2 frames after t...
Learning customized and optimized lists of rules with mathematical programming Rudin, Cynthia; Ertekin Bolelli, Şeyda (Springer Science and Business Media LLC, 2018-12-01) We introduce a mathematical programming approach to building rule lists, which are a type of interpretable, nonlinear, and logical machine learning classifier involving IF-THEN rules. Unlike traditional decision tree algorithms like CART and C5.0, this method does not use greedy splitting and pruning. Instead, it aims to fully optimize a combination of accuracy and sparsity, obeying user-defined constraints. This method is useful for producing non-black-box predictive models, and has the benefit of a clear ...

Citation Formats

H. Oğul, “Computational representation of protein sequences for homology detection and classification,” Ph.D. - Doctoral Program, Middle East Technical University, 2006.