An Extension to GOPred to annotate swiss-prot and trembl sequences for all gene ontology categories and EC numbers

Download
2015
Rifaioğlu, Ahmet Süreyya
Traditional protein function annotation methods cannot keep up with annotation of proteins as the number of proteins whose sequences known is increasing exponentially. For this reason, protein function prediction became an important research area. In this thesis, GOPred method is used with improvements for protein function prediction problem. GOPred consists of SPMap, Blast-kNN and Pepstats methods which are subsequence, similarity and feature based methods, respectively. Previous version of GOPred method used for functional classification of proteins based on 300 molecular function Gene Ontology (GO) terms. In this study, improved system is trained for 514 molecular function, 2909 biological process and 438 cellular component GO terms. The system is also applied on functional prediction of enzymes based on 851 Enzyme Commission (EC) Numbers. Hierarchical evaluation of predictions is proposed to give reliable predictions for EC numbers. In addition, we used a new method to calculate optimal decision thresholds for each functional term to determine the predictions that will be given. Optimal thresholds are calculated for each functional term and predictions whose scores are over determined optimal thresholds are presented. Performances of functional terms are measured separately and averages of performances are calculated to evaluate the system. GO term prediction results show that performance of our system is better for prediction of multi-functional proteins. To the best of our knowledge, this is the best performance achieved for EC number prediction in the literature. Improved system is tested on about 58 million TrEMBL proteins to compare predictions that are given by our system with the reference systems that give annotations for TrEMBL database which are EMBL, HAMAP, PDB, PIR, PIRNR and RuleBase. Results show that, most of the predictions that are given by our system are consistent with the predictions that are given by other systems.

Suggestions

Automated Negative Gene Ontology Based Functional Predictions for Proteins with UniGOPred
Doğan, Tunca; Rifaioğlu, Ahmet Süreyya; Saidi, Rabi; Martin, Maria Jesus; Atalay, Mehmet Volkan; Atalay, Rengül (2018-07-07)
Functional annotation of biomolecules in the gene and protein databases is mostly incomplete. This is especially valid for multi-domain proteins. There is a grey area in the protein function data resources, where the truly negative functions and the ones possessed by the protein but have not been discovered or documented yet (i.e. false negatives), reside together. In many cases the information about the functions absent from the target biomolecule can be as important as the assigned functions. It’s possibl...
Large scale prediction of protein domain functions using shared annotations
Ulusoy, Erva; Doğan, Tunca (Orta Doğu Teknik Üniversitesi Enformatik Enstitüsü; 2022-10)
Discovering the unknown functions of proteins is a major step toward understanding how biological processes work. The expensive and time-consuming nature of wet-lab experimental approaches prompted researchers to develop computational strategies for biomolecular function identification. The main idea behind these approaches, let it be network analysis- or machine learning-based, is that annotations can be transferred among proteins sharing similar characteristics (e.g., sequence, structure, protein-protein ...
A clustering method for the problem of protein subcellular localization
Bezek, Perit; Atalay, Mehmet Volkan; Department of Computer Engineering (2006)
In this study, the focus is on predicting the subcellular localization of a protein, since subcellular localization is helpful in understanding a protein’s functions. Function of a protein may be estimated from its sequence. Motifs or conserved subsequences are strong indicators of function. In a given sample set of protein sequences known to perform the same function, a certain subsequence or group of subsequences should be common; that is, occurrence (frequency) of common subsequences should be high. Our ...
A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets
OĞUL, Hasan; Mumcuoğlu, Ünal Erkan (2007-01-01)
In this study, n-peptide compositions are utilized for protein vectorization over a discriminative remote homology detection framework based on support vector machines (SVMs). The size of amino acid alphabet is gradually reduced for increasing values of n to make the method to conform with the memory resources in conventional workstations. A hash structure is implemented for accelerated search of n-peptides. The method is tested to see its ability to classify proteins into families on a subset of SCOP famil...
Partially Observable Gene Regulatory Network Control Without a Boundary on Horizon
Erdogdu, Utku; Polat, Faruk; Alhajj, Reda (2012-11-09)
Gene regulatory networks (GRNs) govern the protein transcription process in the cell and interactions among genes play a vital role in determining the biosynthesis rate of proteins. By using intervention techniques discovered by biological research it is possible to control a GRN, thus promoting or demoting the expression rate of a certain gene. In this work, this control task is studied in a partially observable setting where interventions lack perfect knowledge of the expression level of all genes. Moreov...
Citation Formats
A. S. Rifaioğlu, “An Extension to GOPred to annotate swiss-prot and trembl sequences for all gene ontology categories and EC numbers,” M.S. - Master of Science, Middle East Technical University, 2015.