Learning functional properties of proteins with language models

Unsal, Serbulent
Atas, Heval
Acar, Aybar Can
Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by better representing the data at hand. Novel data representation approaches mostly take inspiration from language models that have yielded ground-breaking improvements in natural language processing. Lately, these approaches have been applied to the field of protein science and have displayed highly promising results in terms of extracting complex sequence–structure–function relationships. In this study we conducted a detailed investigation over protein representation learning by first categorizing/explaining each approach, subsequently benchmarking their performances on predicting: (1) semantic similarities between proteins, (2) ontology-based protein functions, (3) drug target protein families and (4) protein–protein binding affinity changes following mutations. We evaluate and discuss the advantages and disadvantages of each method over the benchmark results, source datasets and algorithms used, in comparison with classical model-driven approaches. Finally, we discuss current challenges and suggest future directions. We believe that the conclusions of this study will help researchers to apply machine/deep learning-based representation techniques to protein data for various predictive tasks, and inspire the development of novel methods.


Prediction of enzyme classes in a hierarchical approach by using SPMap
Yaman, A.; Atalay, Mehmet Volkan; Atalay, Rengül (2010-04-01)
Enzymes are proteins that play important roles in biochemical reactions as catalysts. They are classified based on the reaction they catalyzed, in a hierarchical scheme by International Enzyme Commission (EC). This hierarchical scheme is expressed in four-level tree structure and a unique number is assigned to each enzyme class. There are six major classes at the top level according to the reaction they carried out and sub-classes at the lower levels are further specific reactions of these classes. The aim ...
Prediction of Protein Interactions by Structural Matching: Prediction of PPI Networks and the Effects of Mutations on PPIs that Combines Sequence and Structural Information
Tunçbağ, Nurcan; Nussinov, Ruth; Gursoy, Attila (Humana Press Inc., 2017)
Structural details of protein interactions are invaluable to the understanding of cellular processes. However, the identification of interactions at atomic resolution is a continuing challenge in the systems biology era. Although the number of structurally resolved complexes in the Protein Databank increases exponentially, the complexes only cover a small portion of the known structural interactome. In this chapter, we review the PRISM system that is a protein–protein interaction (PPI) prediction tool—its r...
Integration of topological measures for eliminating non-specific interactions in protein interaction networks
BAYIR, Murat Ali; GUNEY, Tacettin Dogacan; Can, Tolga (Elsevier BV, 2009-05-28)
High-throughput protein interaction assays aim to provide a comprehensive list of interactions that govern the biological processes in a cell. These large-scale sets of interactions, represented as protein-protein interaction networks, are often analyzed by computational methods for detailed biological interpretation. However, as a result of the tradeoff between speed and accuracy, the interactions reported by high-throughput techniques occasionally include non-specific (i.e., false-positive) interactions. ...
Distance-based Indexing of Residue Contacts for Protein Structure Retrieval and Alignment
Sacan, Ahmet; Toroslu, İsmail Hakkı; Ferhatosmanoglu, Hakan (2008-10-10)
New protein structures are continuously being determined with the hope of deriving insights into the function and mechanisms of proteins, and consequently, protein structure repositories are growing by leaps and bounds. However, we are still far from having the right methods for sensitive and effective use of the available structural data. The fact that current structural analysis tools are impractical for large-scale applications have given rise to several approaches that try to quickly identify candidate ...
Analysis of protein-protein interaction networks using random walks
Can, Tolga; Singh, Ambuj K. (2005-08-21)
Genome wide protein networks have become reality in recent years due to high throughput methods for detecting protein interactions. Recent studies show that a networked representation of proteins provides a more accurate model of biological systems and processes compared to conventional pair-wise analyses. Complementary to the availability of protein networks, various graph analysis techniques have been proposed to mine these networks for pathway discovery, function assignment, and prediction of complex mem...
Citation Formats
S. Unsal, H. Atas, M. ALBAYRAK, K. TURHAN, A. C. Acar, and T. DOĞAN, “Learning functional properties of proteins with language models,” NATURE MACHINE INTELLIGENCE, vol. 4, no. 3, pp. 227–245, 2022, Accessed: 00, 2022. [Online]. Available: https://hdl.handle.net/11511/97190.