Enzyme prediction with word embedding approach

Download

index.pdf

Date

2019

Author

Akın, Erkan

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

360
views

172
downloads

Information such as molecular function, biological process, and cellular localization can be inferred from the protein sequence. However, protein sequences vary in length. Therefore, the sequence itself cannot be used directly as a feature vector for pattern recognition and machine learning algorithms since these algorithms require fixed length feature vectors. We describe an approach based on the use of the Word2vec model, more specifically continuous skip-gram model to generate the vector representation of a given protein sequence. In the Word2vec model, a protein sequence is treated as a document or a sentence and its subsequences correspond to words. The continuous skip-gram model is a supervised Word2vec model to predict the surrounding subsequences from a subsequence. Feature vectors from the Word2vec model can be coupled with classifiers to infer information from the sequence. As a sample application, we consider the problem of determining whether a given protein sequence is an enzyme or not. For a sample dataset that contains 19,155 of enzyme and non-enzyme protein sequences, for which 20% of these sequences are put apart for test and 80% is used for 5-fold cross-validation. The best performance scores are obtained as 0.97 for Precision, Recall, F1, accuracy and 0.93 for Matthews correlation coefficient by the Word2vec model with vector size of 100, the window size of 25 and number of epochs as 180 and for the Random Forest classifier. Also, we generate vector representations for the first level of Enzyme Commission classes by using the same hyper-parameter set for the Word2vec model. For vector representations of each class, binary classification is applied and the average performance scores are obtained as 0.87 for Precision, Recall, F1, accuracy and 0.70 for Matthews correlation coefficient by using the Random Forest classifier.

Subject Keywords

Proteins., Keywords: Word2vec, Word Embedding, Proteins, Classification, Enzymes.

URI

http://etd.lib.metu.edu.tr/upload/12624043/index.pdf
https://hdl.handle.net/11511/44706

Collections

Graduate School of Natural and Applied Sciences, Thesis

Suggestions

OpenMETU
Core

Prediction of protein subcellular localization based on primary sequence data Özarar, Mert; Atalay, Mehmet Volkan; Department of Computer Engineering (2003) Subcellular localization is crucial for determining the functions of proteins. A system called prediction of protein subcellular localization (P2SL) that predicts the subcellular localization of proteins in eukaryotic organisms based on the amino acid content of primary sequences using amino acid order is designed. The approach for prediction is to nd the most frequent motifs for each protein in a given class based on clustering via self organizing maps and then to use these most frequent motifs as features...
Genome-wide sequence analysis of human splice acceptor regions for motif discovery Karaduman Bahçe, Gülşah; Aydın Son, Yeşim; Department of Medical Informatics (2020-12-23) For eukaryotic cells, alternative splicing of genes is a vital mechanism that drives protein diversity. Splicing signals on the genomic sequence controls the regulatory factors that orchestrate the alternative splicing. 3’ and 5’ splice sites and common branchpoint sequences are the primary splicing signals, and changes in these signals can be disease- causing. Nevertheless, an extensive genome-wide analysis of the sequences around these signals is lacking. In this study, we focused on the genome-wide motif...
Parallelization of functional flow to predict protein functions Akkoyun, Emrah; Can, Tolga; Department of Medical Informatics (2011) Protein-protein interaction networks provide important information about what the biological function of proteins whose roles are unknown might be in a cell. These interaction networks were analyzed by a variety of approaches by running them on a single computer and the roles of the proteins identified were used to predict the function of the proteins unidentified. The functional flow is an approach that takes the network connectivity, distance effect, topology of the network with local and global views int...
Protein-Protein Interactions in Live Cells: Reinventing the Wheel Son, Çağdaş Devrim (2018-12-13) G protein-coupled receptors (GPCRs) are membrane proteins that mediate physiologicalresponse to a diverse array of stimuli. In humans, they mediate the action of hundreds ofpeptide hormones, sensory stimuli, odorants, neurotransmitters, and chemokines. GPCRs alsoare targets for ~40% of all currently marketed pharmaceuticals. These receptors traditionallybeen thought to act as monomeric units. However, recent evidence suggests that GPCRs mayform dimers as part of their normal trafficking and function. While ...
Multi-view subcellular localization prediction of human proteins Özsarı, Gökhan; Atalay, M. Volkan.; Department of Computer Engineering (2019) Determining the subcellular localization of proteins is crucial for Understanding the functions of proteins, drug targeting, systems biology, and proteomics research. Experimental validation of subcellular localization is an expensive and challenging process. There exist several computational methods for automated prediction of protein subcellular localization; however, there is still room for better performance. Here, we propose a multi-view SVM-based approach that provides predictions for human proteins. ...

Citation Formats

E. Akın, “Enzyme prediction with word embedding approach,” Thesis (M.S.) -- Graduate School of Natural and Applied Sciences. Computer Engineering., Middle East Technical University, 2019.