Biological data integration and relation prediction by matrix factorization

Download
2020
Abay, Gökçe
The available molecular sequence data has increased greatly in the last decades, thanks to the new technological developments in the field of life-sciences. In order for this data to be useful to the scientific community, it should be characterized. Traditionally, this characterization is done manually, where the experimentally produced molecular data is curated and stored in the biological databases. The huge volume of the currently available data summons the need for the automatic and systematic analysis. A crucial part of this systematic analysis is data integration with the identification of the relationships between the elements from different biological data types. In this study, we propose to integrate large-scale gene/protein annotation data using non-negative matrix factorization (NMF), which is a frequently used method for recommender systems with successful real-world applications. NMF has also been employed for uniting multi-relational data in many different fields including bioinformatics and cheminformatics. Within the purposes of this study, we first collected protein annotations such as molecular functions, biological processes, sub-cellular localizations and disease relations from different resources such as UniProt-GOA and DisGeNET, and organized them as binary relation matrices. We then applied various NMF-based algorithms to this multi-dimensional relational biomolecular sequence annotation data (i.e. genes/proteins vs. functions, genes/proteins vs. diseases, diseases vs. functions) and evaluated the results of each model in terms of their capacity to learn the intrinsic structure in relational data, via cross-validation. The results indicated that NMF has the capacity to retrieve most of the known protein annotations without using any sequence or structure-based protein features (AUROC: 0.80 – 0.94, accuracy: 0.53 – 0.64, F1-score: 0.06 – 0.40, MCC: 0.13 – 0.38). Using NMF, the ultimate aim here is to predict the unknown binary relationships between these biological entities; and to represent these entities (i.e., proteins, functions and disease entries) as informative and non-redundant quantitative feature vectors (using the low-rank feature matrices generated by the factorization process), which can be used in diverse data mining and machine learning tasks in the future, such as the automated annotation of proteins or the construction of biological knowledge graphs.

Suggestions

Collaborative building control: a conceptual mixed-initiative framework
Topak, Fatih; Pekeriçli, Mehmet Koray (Taylor & Francis, 2021-6-22)
In the last two decades, automation systems have shown advanced developments and are widely adopted for various purposes in many fields. However, automation in buildings has not gained popularity and has a low acceptance level amongst the occupants. Decreased perceived control, ever-changing dynamic human needs, and standardized, one-size-fits-all approach in current automation systems lead to disharmony in human-machine coexistence. Although well-established continuous interaction between building control ...
Computational approaches leveraging integrated connections of multi-omic data toward clinical applications
Demirel, Habibe Cansu; Tunçbağ, Nurcan (2021-10-01)
In line with the advances in high-throughput technologies, multiple omic datasets have accumulated to study biological systems and diseases coherently. No single omics data type is capable of fully representing cellular activity. The complexity of the biological processes arises from the interactions between omic entities such as genes, proteins, and metabolites. Therefore, multi-omic data integration is crucial but challenging. The impact of the molecular alterations in multi-omic data is not local in the ...
Multi-modal learning with generalizable nonlinear dimensionality reduction
Kaya, Semih; Vural, Elif; Department of Electrical and Electronics Engineering (2019)
Thanks to significant advancements in information technologies, people can acquire various types of data from the universe. This data may include multiple features in different domains. Widespread machine learning methods benefit from distinctive features of data to reach desired outputs. Numerous studies demonstrate that machine learning algorithms that make use of multi-modal representations of data have more potential than methods with single modal structure. This potential comes from the mutual agreemen...
Kinetic approach for the purification of nucleotides with magnetic separation
Tural, Servet; Tural, Bilsen; Ece, Mehmet Sakir; Yetkin, Evren; Özkan, Necati (2014-11-01)
The isolation of beta-nicotinamide adenine dinucleotide is of great importance since it is widely used in different scientific and technologic fields such as biofuel cells, sensor technology, and hydrogen production. In order to isolate beta-nicotinamide adenine dinucleotide, first 3-aminophenyboronic acid functionalized magnetic nanoparticles were prepared to serve as a magnetic solid support and subsequently they were used for reversible adsorption/desorption of beta-nicotinamide adenine dinucleotide in a...
Automated biological data acquisition and integration using machine learning techniques
Çarkacıoğlu, Levent; Atalay, Mehmet Volkan; Department of Computer Engineering (2009)
Since the initial genome sequencing projects along with the recent advances on technology, molecular biology and large scale transcriptome analysis result in data accumulation at a large scale. These data have been provided in different platforms and come from different laboratories therefore, there is a need for compilation and comprehensive analysis. In this thesis, we addressed the automatization of biological data acquisition and integration from these non-uniform data using machine learning techniques....
Citation Formats
G. Abay, “Biological data integration and relation prediction by matrix factorization,” Thesis (M.S.) -- Graduate School of Informatics. Bioinformatics., Middle East Technical University, 2020.