Large scale prediction of protein domain functions using shared annotations

2022-10
Ulusoy, Erva
Doğan, Tunca
Discovering the unknown functions of proteins is a major step toward understanding how biological processes work. The expensive and time-consuming nature of wet-lab experimental approaches prompted researchers to develop computational strategies for biomolecular function identification. The main idea behind these approaches, let it be network analysis- or machine learning-based, is that annotations can be transferred among proteins sharing similar characteristics (e.g., sequence, structure, protein-protein interactions, phylogenetic profiles, etc.). Considering that the biological functions of genes and proteins are multifaceted, and there is a vast amount of scientific knowledge on this subject, it is important to define biomolecular functions in a systematic and machine-readable way for function annotation. Biological ontologies (e.g., gene ontology - GO) are frequently used to meet this need by providing standardized vocabularies of functional information. Another term that is relevant in this context is protein domains. Domain composition of a protein can reveal important properties, as domains are structural and functional units that dictate how the protein should act at the molecular level. In this study, we proposed a new method called Domain2GO with the aim of identifying unknown functions of proteins by associating their domains with Gene Ontology terms, thus redefining the problem as domain function prediction (Figure 1 displays the overall methodology). Domain2GO mappings are generated using information about the domain content of proteins together with their documented GO annotations, obtained from the InterPro and UniProt - Gene Ontology Annotation (GOA) databases, respectively. In order to obtain highly reliable associations, we employed statistical resampling and analyzed the co-occurrence patterns of domains and GO terms on the same proteins. Furthermore, three different probabilistic association measures were calculated via the expectation-maximization (EM) algorithm, in order to assess the significance of Domain2GO mappings and calculate the predictive performance of the proposed method in an ablation setting. Finalized domain-GO mappings were generated via thresholding the association scores. For protein function prediction performance evaluation and comparison against other methods, we employed Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the potential of Domain2GO, especially when predicting molecular function and biological process terms, as it performed better than baseline predictors, curated GO associations, and various challenge participating methods (with Fmax = 0.48 and 0.36 for MFO and BPO, respectively). Furthermore, we developed a hybrid/ensemble function prediction approach by combining the domain-based Domain2GO and sequence-based BLAST (to benefit from using a larger fraction of the biomolecular knowledge space), which performed especially well in terms of predicting cellular component annotations, indicating the complementarity between these approaches. We conducted use-case studies and observed that Domain2GO predicts more specific/informative function terms, compared to the manually curated GO annotations of the same proteins. Finally, Domain2GO was applied to predict currently unknown functions of the proteins in the UniProtKB/Swiss-Prot database by propagating domain-associated GO terms to full proteins that contain those domains. Apart from high performance, another advantage of using Domain2GO is its speed, as it is multiple orders of magnitude faster compared to machine/deep learning methods that have compatible or slightly higher prediction performance. Furthermore, its results are explainable, as opposed to black box models, since Domain2GO’s function predictions are localized to specific regions/structural units in proteins. The methodology of Domain2GO can easily be adapted to predict different types of biomolecular relationships, such as the disease, phenotype, ligand/drug associations of genes and proteins. The source code, datasets, and results of the study are fully available at https://github.com/HUBioDataLab/Domain2GO.

Suggestions

An Extension to GOPred to annotate swiss-prot and trembl sequences for all gene ontology categories and EC numbers
Rifaioğlu, Ahmet Süreyya; Toroslu, İsmail Hakkı; Atalay, Rengül; Department of Computer Engineering (2015)
Traditional protein function annotation methods cannot keep up with annotation of proteins as the number of proteins whose sequences known is increasing exponentially. For this reason, protein function prediction became an important research area. In this thesis, GOPred method is used with improvements for protein function prediction problem. GOPred consists of SPMap, Blast-kNN and Pepstats methods which are subsequence, similarity and feature based methods, respectively. Previous version of GOPred method u...
Predicting Protein-Protein Interactions from the Molecular to the Proteome Level
Keskin, Ozlem; Tunçbağ, Nurcan; Gursoy, Attila (2016-04-27)
Identification of protein protein interactions (PPIs) is at the center of molecular biology considering the unquestionable role of proteins in cells. Combinatorial interactions result in a repertoire of multiple functions; hence, knowledge of PPI and binding regions naturally serve to functional proteomics and drug discovery. Given experimental limitations to find all interactions in a proteome, computational prediction/modeling of protein interactions is a prerequisite to proceed on the way to complete int...
Predicting protein-protein interactions on a proteome scale by matching evolutionary and structural similarities at interfaces using PRISM
Gursoy, Attila; Tunçbağ, Nurcan; NUSSINOV, Ruth; Keskin, Ozlem (2011-09-01)
Prediction of protein-protein interactions at the structural level on the proteome scale is important because it allows prediction of protein function, helps drug discovery and takes steps toward genome-wide structural systems biology. We provide a protocol (termed PRISM, protein interactions by structural matching) for large-scale prediction of protein-protein interactions and assembly of protein complex structures. The method consists of two components: rigid-body structural comparisons of target proteins...
Fast Screening of Protein-Protein Interactions Using Forster Resonance Energy Transfer (FRET-) Based Fluorescence Plate Reader Assay in Live Cells
Durhan, Seyda Tugce; Sezer, Enise Nalli; Son, Çağdaş Devrim; Küçük Baloğlu, Fatma (2022-11-01)
Protein-protein interactions (PPIs) have great importance for intracellular signal transduction and sustaining the homeostasis of an organism. Thus, the identification of PPIs is necessary to better understand the downstream signaling functions of the proteins in healthy and pathological conditions. Forster resonance energy transfer (FRET) between fluorescent proteins (FPs) is a powerful tool for detecting PPIs in living cells. In literature, FRET analysis methods such as donor photobleaching (FLIM), accept...
Modeling of various biological networks via LCMARS
AYYILDIZ DEMİRCİ, EZGİ; Purutçuoğlu Gazi, Vilda (Elsevier BV, 2018-09-01)
In system biology, the interactions between components such as genes, proteins, can be represented by a network. To understand the molecular mechanism of complex biological systems, construction of their networks plays a crucial role. However, estimation of these biological networks is a challenging problem because of their high dimensional and sparse structures. Several statistical methods are proposed to overcome this issue. The Conic Multivariate Adaptive Regression Splines (CMARS) is one of the recent n...
Citation Formats
E. Ulusoy and T. Doğan, “Large scale prediction of protein domain functions using shared annotations,” Erdemli, Mersin, TÜRKİYE, 2022, p. 3022, Accessed: 00, 2023. [Online]. Available: https://hibit2022.ims.metu.edu.tr.