Hide/Show Apps

Enzyme prediction with word embedding approach

Akın, Erkan
Information such as molecular function, biological process, and cellular localization can be inferred from the protein sequence. However, protein sequences vary in length. Therefore, the sequence itself cannot be used directly as a feature vector for pattern recognition and machine learning algorithms since these algorithms require fixed length feature vectors. We describe an approach based on the use of the Word2vec model, more specifically continuous skip-gram model to generate the vector representation of a given protein sequence. In the Word2vec model, a protein sequence is treated as a document or a sentence and its subsequences correspond to words. The continuous skip-gram model is a supervised Word2vec model to predict the surrounding subsequences from a subsequence. Feature vectors from the Word2vec model can be coupled with classifiers to infer information from the sequence. As a sample application, we consider the problem of determining whether a given protein sequence is an enzyme or not. For a sample dataset that contains 19,155 of enzyme and non-enzyme protein sequences, for which 20% of these sequences are put apart for test and 80% is used for 5-fold cross-validation. The best performance scores are obtained as 0.97 for Precision, Recall, F1, accuracy and 0.93 for Matthews correlation coefficient by the Word2vec model with vector size of 100, the window size of 25 and number of epochs as 180 and for the Random Forest classifier. Also, we generate vector representations for the first level of Enzyme Commission classes by using the same hyper-parameter set for the Word2vec model. For vector representations of each class, binary classification is applied and the average performance scores are obtained as 0.87 for Precision, Recall, F1, accuracy and 0.70 for Matthews correlation coefficient by using the Random Forest classifier.