An Extension to GOPred to annotate swiss-prot and trembl sequences for all gene ontology categories and EC numbers

Download
2015
Rifaioğlu, Ahmet Süreyya
Traditional protein function annotation methods cannot keep up with annotation of proteins as the number of proteins whose sequences known is increasing exponentially. For this reason, protein function prediction became an important research area. In this thesis, GOPred method is used with improvements for protein function prediction problem. GOPred consists of SPMap, Blast-kNN and Pepstats methods which are subsequence, similarity and feature based methods, respectively. Previous version of GOPred method used for functional classification of proteins based on 300 molecular function Gene Ontology (GO) terms. In this study, improved system is trained for 514 molecular function, 2909 biological process and 438 cellular component GO terms. The system is also applied on functional prediction of enzymes based on 851 Enzyme Commission (EC) Numbers. Hierarchical evaluation of predictions is proposed to give reliable predictions for EC numbers. In addition, we used a new method to calculate optimal decision thresholds for each functional term to determine the predictions that will be given. Optimal thresholds are calculated for each functional term and predictions whose scores are over determined optimal thresholds are presented. Performances of functional terms are measured separately and averages of performances are calculated to evaluate the system. GO term prediction results show that performance of our system is better for prediction of multi-functional proteins. To the best of our knowledge, this is the best performance achieved for EC number prediction in the literature. Improved system is tested on about 58 million TrEMBL proteins to compare predictions that are given by our system with the reference systems that give annotations for TrEMBL database which are EMBL, HAMAP, PDB, PIR, PIRNR and RuleBase. Results show that, most of the predictions that are given by our system are consistent with the predictions that are given by other systems.

Suggestions

Automated Negative Gene Ontology Based Functional Predictions for Proteins with UniGOPred
Doğan, Tunca; Rifaioğlu, Ahmet Süreyya; Saidi, Rabi; Martin, Maria Jesus; Atalay, Mehmet Volkan; Atalay, Rengül (2018-07-07)
Functional annotation of biomolecules in the gene and protein databases is mostly incomplete. This is especially valid for multi-domain proteins. There is a grey area in the protein function data resources, where the truly negative functions and the ones possessed by the protein but have not been discovered or documented yet (i.e. false negatives), reside together. In many cases the information about the functions absent from the target biomolecule can be as important as the assigned functions. It’s possibl...
A clustering method for the problem of protein subcellular localization
Bezek, Perit; Atalay, Mehmet Volkan; Department of Computer Engineering (2006)
In this study, the focus is on predicting the subcellular localization of a protein, since subcellular localization is helpful in understanding a protein’s functions. Function of a protein may be estimated from its sequence. Motifs or conserved subsequences are strong indicators of function. In a given sample set of protein sequences known to perform the same function, a certain subsequence or group of subsequences should be common; that is, occurrence (frequency) of common subsequences should be high. Our ...
Partially Observable Gene Regulatory Network Control Without a Boundary on Horizon
Erdogdu, Utku; Polat, Faruk; Alhajj, Reda (2012-11-09)
Gene regulatory networks (GRNs) govern the protein transcription process in the cell and interactions among genes play a vital role in determining the biosynthesis rate of proteins. By using intervention techniques discovered by biological research it is possible to control a GRN, thus promoting or demoting the expression rate of a certain gene. In this work, this control task is studied in a partially observable setting where interventions lack perfect knowledge of the expression level of all genes. Moreov...
Prediction of protein subcellular localization based on primary sequence data
Ozarar, M; Atalay, Mehmet Volkan; Atalay, Rengül (2003-01-01)
This paper describes a system called prediction of protein subcellular localization (P2SL) that predicts the subcellular localization of proteins in eukaryotic organisms based on the amino acid content of primary sequences using amino acid order. Our approach for prediction is to find the most frequent motifs for each protein (class) based on clustering and then to use these most frequent motifs as features for classification. This approach allows a classification independent of the length of the sequence. ...
Prediction of protein subcellular localization based on primary sequence data
Ozarar, M; Atalay, Mehmet Volkan; Atalay, Rengül (2004-04-30)
Subcellular localization is crucial for determining the functions of proteins. A system called prediction of protein subcellular localization (P2SL) that predicts the subcellular localization of proteins in eukaryotic organisms based on the amino acid content of primary sequences using amino acid order is designed. The approach for prediction is to find the most frequent motifs for each protein in a given class based on clustering via self organizing maps and then to use these most frequent motifs as featur...
Citation Formats
A. S. Rifaioğlu, “An Extension to GOPred to annotate swiss-prot and trembl sequences for all gene ontology categories and EC numbers,” M.S. - Master of Science, Middle East Technical University, 2015.