Prediction of the effects of single amino acid variations on protein functionality with structural and annotation centric modeling

Download

index.pdf

Date

2020

Author

Cankara, Fatma

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

417
views

221
downloads

Whole-genome and exome sequencing studies have indicated that genomic variations may cause deleterious effects on protein functionality via various mechanisms. Single nucleotide variations that alter the protein sequence, and thus, the structure and the function, namely non-synonymous SNPs (nsSNP), are associated with many genetic diseases in human. The current rate of manually annotating the reported nsSNPs cannot catch up with the rate of producing new sequencing data. To aid this process, automated computational approaches are being developed and applied on the unknown data. In this study, we propose a new methodology to collect and organize the information related to the effects of nsSNPs at the amino acid sequence level from various biological databases and to utilize this information in a supervised machine-learning based system to predict the function disrupting capacities of mutations with unknown consequences. For this, 157,138 annotated mutation data points (89,363 deleterious and 67,775 neutral) were collected from multiple resources such as UniProt, ClinVar and Protein Mutant Database. For each mutation data point, a feature vector was constructed using protein 3-D structure information and site-specific feature annotations in the UniProt database. The information about the spatial proximity of the reported mutations to these protein features were also incorporated to the feature vector. The system was trained with these feature vectors and their respective labels in a supervised fashion using random forest, where the ultimate aim was to construct a model that classifies unknown mutations either as deleterious or neutral. The prediction model was evaluated in detail to observe the contribution of different feature types to the prediction success. The finalized model displayed a satisfactory performance (AUROC:0.86, precision: 0.77, recall 0:90, accuracy: 0.78, F1-score: 0.83 and MCC: 0.54) on the independent test dataset. Besides, the performance of the proposed model was compared to the widely used variant effect predictors in the literature, over standard benchmark datasets. As future work, we plan to conduct a case study over interesting prediction examples and to validate our results via literature-based information. Finally, we plan to construct a ready-to-use command line based variant effect prediction tool and to share it with the research community over an open access data repository. We believe that this system will be complementary to the well-known methods in the literature and its incorporation to ensemble-based tools will increase the performance of the state-of-the-art in variant effect prediction.

Subject Keywords

Machine learning., Single amino acid variations, variant effect prediction, protein sequence annotations, machine learning, random forest.

URI

http://etd.lib.metu.edu.tr/upload/12625250/index.pdf
https://hdl.handle.net/11511/45389

Collections

Graduate School of Informatics, Thesis

Suggestions

OpenMETU
Core

Prediction of protein subcellular localization based on primary sequence data Ozarar, M; Atalay, Mehmet Volkan; Atalay, Rengül (2003-01-01) This paper describes a system called prediction of protein subcellular localization (P2SL) that predicts the subcellular localization of proteins in eukaryotic organisms based on the amino acid content of primary sequences using amino acid order. Our approach for prediction is to find the most frequent motifs for each protein (class) based on clustering and then to use these most frequent motifs as features for classification. This approach allows a classification independent of the length of the sequence. ...
Prediction of Protein Interactions by Structural Matching: Prediction of PPI Networks and the Effects of Mutations on PPIs that Combines Sequence and Structural Information Tunçbağ, Nurcan; Nussinov, Ruth; Gursoy, Attila (Humana Press Inc., 2017) Structural details of protein interactions are invaluable to the understanding of cellular processes. However, the identification of interactions at atomic resolution is a continuing challenge in the systems biology era. Although the number of structurally resolved complexes in the Protein Databank increases exponentially, the complexes only cover a small portion of the known structural interactome. In this chapter, we review the PRISM system that is a protein–protein interaction (PPI) prediction tool—its r...
Analysis of protein-protein interaction networks using random walks Can, Tolga; Singh, Ambuj K. (2005-08-21) Genome wide protein networks have become reality in recent years due to high throughput methods for detecting protein interactions. Recent studies show that a networked representation of proteins provides a more accurate model of biological systems and processes compared to conventional pair-wise analyses. Complementary to the availability of protein networks, various graph analysis techniques have been proposed to mine these networks for pathway discovery, function assignment, and prediction of complex mem...
Predicting the effect of hydrophobicity surface on binding affinity of PCP-like compounds using machine learning methods Yoldaş, Mine; Alpaslan, Ferda Nur; Büyükbingöl, Erdem; Department of Computer Engineering (2011) This study aims to predict the binding affinity of the PCP-like compounds by means of molecular hydrophobicity. Molecular hydrophobicity is an important property which aff ects the binding affinity of molecules. The values of molecular hydrophobicity of molecules are obtained on three-dimensional coordinate system. Our aim is to reduce the number of points on the hydrophobicity surface of the molecules. This is modeled by using self organizing maps (SOM) and k-means clustering. The feature sets obtained fro...
Single Mutation in Shine-Dalgarno-Like Sequence Present in the Amino Terminal of Lactate Dehydrogenase of Plasmodium Effects the Production of an Eukaryotic Protein Expressed in a Prokaryotic System Çiçek, Mustafa; MUTLU, ÖZAL; ERDEMİR ÜSTÜNDAĞ, Ayşegül; Ozkan, Ebru; Saricay, Yunus; BALIK, DİLEK (2013-06-01) One of the most important step in structure-based drug design studies is obtaining the protein in active form after cloning the target gene. In one of our previous study, it was determined that an internal Shine-Dalgarno-like sequence present just before the third methionine at N-terminus of wild type lactate dehydrogenase enzyme of Plasmodium falciparum prevent the translation of full length protein. Inspection of the same region in P. vivax LDH, which was overproduced as an active enzyme, indicated that t...

Citation Formats

F. Cankara, “Prediction of the effects of single amino acid variations on protein functionality with structural and annotation centric modeling,” Thesis (M.S.) -- Graduate School of Informatics. Bioinformatics., Middle East Technical University, 2020.