A classification system for the problem of protein subcellular localization

Alay, Gökçen
The focus of this study is on predicting the subcellular localization of a protein. Subcellular localization information is important for protein function annotation which is a fundamental problem in computational biology. For this problem, a classification system is built that has two main parts: a predictor that is based on a feature mapping technique to extract biologically meaningful information from protein sequences and a client/server architecture for searching and predicting subcellular localizations. In the first part of the thesis, we describe a feature mapping technique based on frequent patterns. In the feature mapping technique we describe, frequent patterns in a protein sequence dataset were identified using a search technique based on a priori property and the distribution of these patterns over a new sample is used as a feature vector for classification. The effect of a number of feature selection methods on the classification performance is investigated and the best one is applied. The method is assessed on the subcellular localization prediction problem with 4 compartments (Endoplasmic reticulum (ER) targeted, cytosolic, mitochondrial, and nuclear) and the dataset is the same used in P2SL. Our method improved the overall accuracy to 91.71\% which was originally 81.96\% by P2SL. In the second part of the thesis, a client/server architecture is designed and implemented based on Simple Object Access Protocol (SOAP) technology which provides a user-friendly interface for accessing the protein subcellular localization predictions. Client part is in fact a Cytoscape plug-in that is used for functional enrichment of biological networks. Instead of the individual use of subcellular localization information, this plug-in lets biologists to analyze a set of genes/proteins under system view.


A clustering method for the problem of protein subcellular localization
Bezek, Perit; Atalay, Mehmet Volkan; Department of Computer Engineering (2006)
In this study, the focus is on predicting the subcellular localization of a protein, since subcellular localization is helpful in understanding a protein’s functions. Function of a protein may be estimated from its sequence. Motifs or conserved subsequences are strong indicators of function. In a given sample set of protein sequences known to perform the same function, a certain subsequence or group of subsequences should be common; that is, occurrence (frequency) of common subsequences should be high. Our ...
A systematic study of probabilistic aggregation strategies in swarm robotic systems
Soysal, Onur; Şahin, Erol; Department of Computer Engineering (2005)
In this study, a systematic analysis of probabilistic aggregation strategies in swarm robotic systems is presented. A generic aggregation behavior is proposed as a combination of four basic behaviors: obstacle avoidance, approach, repel, and wait. The latter three basic behaviors are combined using a three-state finite state machine with two probabilistic transitions among them. Two different metrics were used to compare performance of strategies. Through systematic experiments, how the aggregation performa...
Subsequence feature maps for protein function annotation
Saraç, Ömer Sinan; Atalay, Mehmet Volkan; Department of Computer Engineering (2008)
With the advances in sequencing technologies, the number of protein sequences with unknown function increases rapidly. Hence, computational methods for functional annotation of these protein sequences become of the upmost importance. In this thesis, we first defined a feature space mapping of protein primary sequences to fixed dimensional numerical vectors. This mapping, which is called the Subsequence Profile Map (SPMap), takes into account the models of the subsequences of protein sequences. The resulting...
An integrative approach to structured snp prioritization and representative snp selection for genome-wide association studies
Üstünkar, Gürkan; Aydın Son, Yeşim; Weber, Gerhard Wilhelm; Department of Information Systems (2011)
Single Nucleotide Polymorphisms (SNPs) are the most frequent genomic variations and the main basis for genetic differences among individuals and many diseases. As genotyping millions of SNPs at once is now possible with the microarrays and advanced sequencing technologies, SNPs are becoming more popular as genomic biomarkers. Like other high-throughput research techniques, genome wide association studies (GWAS) of SNPs usually hit a bottleneck after statistical analysis of significantly associated SNPs, as ...
Special index and retrieval mechanism for ontology based medical domain search engines
Kubilay, Mustafa; Baykal, Nazife; Department of Information Systems (2005)
This thesis focuses on index and retrieval mechanism of an ontology based medical domain search engine. First, indexing techniques and retrieval methods are reviewed. Then, a special indexing and retrieval mechanism are introduced. This thesis also specifies the functional requirements of these mechanisms. Finally, an evaluation is given by indicating the positive and negative aspects of mechanisms.
Citation Formats
G. Alay, “A classification system for the problem of protein subcellular localization,” M.S. - Master of Science, Middle East Technical University, 2007.