Multi-view subcellular localization prediction of human proteins

Download
2019
Özsarı, Gökhan
Determining the subcellular localization of proteins is crucial for Understanding the functions of proteins, drug targeting, systems biology, and proteomics research. Experimental validation of subcellular localization is an expensive and challenging process. There exist several computational methods for automated prediction of protein subcellular localization; however, there is still room for better performance. Here, we propose a multi-view SVM-based approach that provides predictions for human proteins. We represent each protein sequence by multi-view features; i.e., physicochemical properties, amino acid compositions, and homology-based features. Our classification model contains seven classifiers for each localization, where each classifier provides a probabilistic result. To develop a multi-view voting classifier, we employ a weighted classifier combination method that assigns different weights to classifiers based on their discriminative strengths. We evaluated the described method on previously used datasets, as well as on our in-house dataset, called Trust dataset. Trust dataset is created by using a new subcellular localization hierarchy which merges UniProt Subcellular Location hierarchy and GO Cellular Component hierarchy by applying it on only manual experimental annotations in UniProtKB. We compared our results with five state-of-the-art methods, which are SubCons, LocTree2, CELLO2.5, MultiLoc2, and DeepLoc. Our approach outperformed the others with 59%, 61%, 68% overall Matthews correlation coefficient (MCC) scores on Trust, Golden (SubCons benchmark dataset), Golden-Trust (refined Golden dataset) datasets, respectively where SubCon’s MCC scores were 43%, 53%, and 56%.

Suggestions

Enzyme prediction with word embedding approach
Akın, Erkan; Atalay, M. Volkan.; Department of Computer Engineering (2019)
Information such as molecular function, biological process, and cellular localization can be inferred from the protein sequence. However, protein sequences vary in length. Therefore, the sequence itself cannot be used directly as a feature vector for pattern recognition and machine learning algorithms since these algorithms require fixed length feature vectors. We describe an approach based on the use of the Word2vec model, more specifically continuous skip-gram model to generate the vector representation o...
Prediction of protein subcellular localization based on primary sequence data
Ozarar, M; Atalay, Mehmet Volkan; Atalay, Rengül (2004-04-30)
Subcellular localization is crucial for determining the functions of proteins. A system called prediction of protein subcellular localization (P2SL) that predicts the subcellular localization of proteins in eukaryotic organisms based on the amino acid content of primary sequences using amino acid order is designed. The approach for prediction is to find the most frequent motifs for each protein in a given class based on clustering via self organizing maps and then to use these most frequent motifs as featur...
Inference of Gene Regulatory Networks Via Multiple Data Sources and a Recommendation Method
Ozsoy, Makbule Gulcin; Polat, Faruk; Alhajj, Reda (2015-11-12)
Gene regulatory networks (GRNs) are composed of biological components, including genes, proteins and metabolites, and their interactions. In general, computational methods are used to infer the connections among these components. However, computational methods should take into account the general features of the GRNs, which are sparseness, scale-free topology, modularity and structure of the inferred networks. In this work, observing the common aspects between recommendation systems and GRNs, we decided to ...
Partially Observable Gene Regulatory Network Control Without a Boundary on Horizon
Erdogdu, Utku; Polat, Faruk; Alhajj, Reda (2012-11-09)
Gene regulatory networks (GRNs) govern the protein transcription process in the cell and interactions among genes play a vital role in determining the biosynthesis rate of proteins. By using intervention techniques discovered by biological research it is possible to control a GRN, thus promoting or demoting the expression rate of a certain gene. In this work, this control task is studied in a partially observable setting where interventions lack perfect knowledge of the expression level of all genes. Moreov...
Intergenic and Repeat Transcription in Human, Chimpanzee and Macaque Brains Measured by RNA-Seq
Xu, Augix Guohua; He, Liu; Li, Zhongshan; Xu, Ying; Li, Mingfeng; Fu, Xing; Yan, Zheng; Yuan, Yuan; Menzel, Corinna; Li, Na; Somel, Mehmet; Hu, Hao; Chen, Wei; Paabo, Svante; Khaitovich, Philipp (Public Library of Science (PLoS), 2010-07-01)
Transcription is the first step connecting genetic information with an organism's phenotype. While expression of annotated genes in the human brain has been characterized extensively, our knowledge about the scope and the conservation of transcripts located outside of the known genes' boundaries is limited. Here, we use high-throughput transcriptome sequencing (RNA-Seq) to characterize the total non-ribosomal transcriptome of human, chimpanzee, and rhesus macaque brain. In all species, only 20-28% of non-ri...
Citation Formats
G. Özsarı, “Multi-view subcellular localization prediction of human proteins,” Thesis (M.S.) -- Graduate School of Natural and Applied Sciences. Computer Engineering., Middle East Technical University, 2019.