A DEEP LEARNING BASED PROTEIN REPRESENTATION MODEL FOR LOW-DATA PROTEIN FUNCTION PREDICTION

2023-3-27
Ünsal, Serbülent
Protein science is a broad discipline that involves the study of proteins at the individual and proteome levels through both experimental and computational methods. Protein informatics is a branch of protein science that focuses on the computational and data centric aspects of protein analysis, including the modeling of proteins' quantitative properties. The functional characterization of proteins is a critical aspect of protein science, as it is necessary for the development of new biomedical strategies and biotechnological products. However, the experimental and manual methods typically used for protein functional characterization are time-consuming and costly, and as a result, only a small fraction of the millions of protein entries in databases like UniProt have been manually reviewed and annotated by experts. To address this gap, in silico approaches, including protein function prediction (PFP), are being used to predict protein functions using computational methods. PFP involves the use of machine learning, natural language processing, and other techniques to predict protein functions based on various types of data, including protein sequence, structure, and interactome information. The development of accurate and reusable methods for PFP is an important goal in the field of protein science, as it has the potential to improve our understanding of protein function and advance the field of molecular biology. However, PFP remains an open problem, with current methods not consistently achieving high accuracy in predicting protein functions. One area that has received relatively little attention in the literature is low-data PFP, or the prediction of protein functions with a low number of positive training samples. To address this challenge, we developed a reusable benchmarking framework called Protein RepresentatiOn BEnchmark (PROBE) for evaluating different methods for PFP. This framework allows for the comparison of different approaches to PFP across different dimensions, including data abundance and predicted term specificity. We also developed novel methods specifically designed for addressing the challenge of low-data PFP and evaluated these methods using PROBE. Our results show that the PROBE framework and the novel methods developed for low-data PFP represent a significant contribution to the field of PFP and have the potential to shape future research efforts, particularly in contexts where data availability is limited. Overall, we hope that this study will be beneficial for researchers working in the PFP domain and will contribute to the ongoing efforts to improve our understanding of protein function.
Citation Formats
S. Ünsal, “A DEEP LEARNING BASED PROTEIN REPRESENTATION MODEL FOR LOW-DATA PROTEIN FUNCTION PREDICTION,” Ph.D. - Doctoral Program, Middle East Technical University, 2023.