Distance matrices as protein representations

2022-9-02
Dinç, Mehmet
Representing protein sequences is a crucial problem in the field of bioinformatics since any data-driven model's performance is limited by the information contained in its input features. A protein's biological function is dictated by its structure and knowing a protein's structure can potentially help predict its interactions with drug candidates or predict its Gene Ontology (GO) term. Yet, off-the-shelf protein representations do not contain such information since only a small fraction of the billions of known protein sequences have experimentally determined structures, as the cost of running such experiments is quite high. A newly introduced neural network-based structure prediction model, AlphaFold, claims to be able to predict protein structures with high accuracy. In this study, two-dimensional distance matrices generated from AlphaFold structure predictions are used as input features while modeling two different bioinformatics problems; drug-target interaction (DTI) prediction and Gene Ontology term prediction. For the DTI prediction problem, a state-of-the-art model which already uses two-dimensional protein features, is employed as a baseline. Then, the effect of distance matrices is observed through ablation studies. Moreover, the same model is adapted in order to tackle the GO prediction problem and its success is compared with off-the-shelf protein representations.

Suggestions

Distance-based discretization of parametric signal manifolds
Vural, Elif (2010-06-28)
The characterization of signals and images in manifolds often lead to efficient dimensionality reduction algorithms based on manifold distance computation for analysis or classification tasks. We propose in this paper a method for the discretization of signal manifolds given in a parametric form. We present an iterative algorithm for the selection of samples on the manifold that permits to minimize the average error in the manifold distance computation. Experimental results with image appearance manifolds d...
Large-scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants
Rifaioğlu, Ahmet Süreyya; Sarac, Omer Sinan; ERSAHİN, Tulin; Saidi, Rabie; Atalay, Mehmet Volkan; Atalay, Rengül (2018-02-01)
Recent advances in computing power and machine learning empower functional annotation of protein sequences and their transcript variations. Here, we present an automated prediction system UniGOPred, for GO annotations and a database of GO term predictions for proteomes of several organisms in UniProt Knowledgebase (UniProtKB). UniGOPred provides function predictions for 514 molecular function (MF), 2909 biological process (BP), and 438 cellular component (CC) GO terms for each protein sequence. UniGOPred co...
Similarity matrix framework for data from union of subspaces
Aldroubi, Akram; Sekmen, Ali; Koku, Ahmet Buğra; Cakmak, Ahmet Faruk (2018-09-01)
This paper presents a framework for finding similarity matrices for the segmentation of data W = [w(1)...w(N)] subset of R-D drawn from a union U = boolean OR(M)(i=1) S-i, of independent subspaces {S-i}(i=1)(M), of dimensions {d(i)}(i=1)(M). It is shown that any factorization of W = BP, where columns of B form a basis for data W and they also come from U, can be used to produce a similarity matrix Xi w. In other words, Xi w(i, j) not equal 0, when the columns w(i) and w(j) of W come from the same subspace, ...
Short Time Series Microarray Data Analysis and Biological Annotation
Sökmen, Zerrin; Atalay, Mehmet Volkan; Atalay, Rengül (2008-01-01)
Significant gene list is the result of microarray data analysis should be explained for the purpose of biological functions. The aim of this study is to extract the biologically related gene clusters over the short time series microarray gene data by applying unsupervised methods and automatically perform biological annotation of those clusters. In the first step of the study, short time series microarray expression data is clustered according to similar expression profiles. After that, several biological d...
TRACEMIN Fiedler A Parallel Algorithm for Computing the Fiedler Vector
Manguoğlu, Murat; Saied, Faisal; Sameh, Ahmed (null; 2010-06-25)
The eigenvector corresponding to the second smallest eigenvalue of the Laplacian of a graph, known as the Fiedler vector, has a number of applications in areas that include matrix reordering, graph partitioning, protein analysis, data mining, machine learning, and web search. The computation of the Fiedler vector has been regarded as an expensive process as it involves solving a large eigenvalue problem. We present a novel and efficient parallel algorithm for computing the Fiedler vector of large graphs bas...
Citation Formats
M. Dinç, “Distance matrices as protein representations,” M.S. - Master of Science, Middle East Technical University, 2022.