Coevolution based prediction of protein-protein ınteractions with reduced training data

Download

index.pdf

Date

2009

Author

Pamuk, Bahar

Metadata

Show full item record

Item Usage Stats

206
views

172
downloads

Protein-protein interactions are important for the prediction of protein functions since two interacting proteins usually have similar functions in a cell. Available protein interaction networks are incomplete; but, they can be used to predict new interactions in a supervised learning framework. However, in the case that the known protein network includes large number of protein pairs, the training time of the machine learning algorithm becomes quite long. In this thesis work, our aim is to predict protein-protein interactions with a known portion of the interaction network. We used Support Vector Machines (SVM) as the machine learning algoritm and used the already known protein pairs in the network. We chose to use phylogenetic profiles of proteins to form the feature vectors required for the learner since the similarity of two proteins in evolution gives a reasonable rating about whether the two proteins interact or not. For large data sets, the training time of SVM becomes quite long, therefore we reduced the data size in a sensible way while we keep approximately the same prediction accuracy. We applied a number of clustering techniques to extract the most representative data and features in a two categorical framework. Knowing that the training data set is a two dimensional matrix, we applied data reduction methods in both dimensions, i.e., both in data size and in feature vector size. We observed that the data clustered by the k-means clustering technique gave superior results in prediction accuracies compared to another data clustering algorithm which was also developed for reducing data size for SVM training. Still the true positive and false positive rates (TPR-FPR) of the training data sets constructed by the two clustering methods did not give satisfying results about which method outperforms the other. On the other hand, we applied feature selection methods on the feature vectors of training data by selecting the most representative features in biological and in statistical meaning. We used phylogenetic tree of organisms to identify the organisms which are evolutionarily significant. Additionally we applied Fisher’s test method to select the features which are most representative statistically. The accuracy and TPR-FPR values obtained by feature selection methods could not provide to make a certain decision on the performance comparisons. However it can be mentioned that phylogenetic tree method resulted in acceptable prediction values when compared to Fisher’s test.

Subject Keywords

Computer enginnering.

URI

http://etd.lib.metu.edu.tr/upload/3/12610389/index.pdf
https://hdl.handle.net/11511/18451

Collections

Graduate School of Natural and Applied Sciences, Thesis

Suggestions

OpenMETU
Core

Prediction of protein-protein interactions from sequence using evolutionary relations of proteins and species Güney, Tacettin Doğacan; Can, Tolga; Department of Computer Engineering (2009) Prediction of protein-protein interactions is an important part in understanding the biological processes in a living cell. There are completely sequenced organisms that do not yet have experimentally verified protein-protein interaction networks. For such organisms, we can not generally use a supervised method, where a portion of the protein-protein interaction network is used as training set. Furthermore, for newly-sequenced organisms, many other data sources, such as gene expression data and gene ontolog...
Using semantic web services for data integration in banking domain Okat, Çağlar; Doğru, Ali Hikmet; Department of Computer Engineering (2010) A semantic model oriented transformation mechanism is developed for the centralization of intra-enterprise data integration. Such a mechanism is especially crucial in the banking domain which is selected in this study. A new domain ontology is constructed to provide basis for annotations. A bottom-up approach is preferred for semantic annotations to utilize existing web service definitions. Transformations between syntactic web service XML responses and semantic model concepts are defined in transformation ...
A clustering method for the problem of protein subcellular localization Bezek, Perit; Atalay, Mehmet Volkan; Department of Computer Engineering (2006) In this study, the focus is on predicting the subcellular localization of a protein, since subcellular localization is helpful in understanding a protein’s functions. Function of a protein may be estimated from its sequence. Motifs or conserved subsequences are strong indicators of function. In a given sample set of protein sequences known to perform the same function, a certain subsequence or group of subsequences should be common; that is, occurrence (frequency) of common subsequences should be high. Our ...
Prediction of enzyme classes in a hierarchical approach by using spmap Yaman, Ayşe Gül; Atalay, Mehmet Volkan; Department of Computer Engineering (2009) Enzymes are proteins that play an important role in biochemical reactions as catalysts. They are classified based on the reaction they catalyzed, in a hierarchical scheme by International Enzyme Commission (EC). This hierarchical scheme is expressed as a four-level tree structure and a unique number is assigned to each enzyme class. There are six major classes at the top level according to the reaction they carried out and sub-classes at the lower levels are further specific reactions of these classes. The ...
Modeling of various biological networks via LCMARS AYYILDIZ DEMİRCİ, EZGİ; Purutçuoğlu Gazi, Vilda (Elsevier BV, 2018-09-01) In system biology, the interactions between components such as genes, proteins, can be represented by a network. To understand the molecular mechanism of complex biological systems, construction of their networks plays a crucial role. However, estimation of these biological networks is a challenging problem because of their high dimensional and sparse structures. Several statistical methods are proposed to overcome this issue. The Conic Multivariate Adaptive Regression Splines (CMARS) is one of the recent n...

Citation Formats

B. Pamuk, “Coevolution based prediction of protein-protein ınteractions with reduced training data,” M.S. - Master of Science, Middle East Technical University, 2009.