Machine learning-based prediction of survival in cancer using multi-omics data

Çoruh, Ayşe Nur
Doğan, Tunca
Machine learning-based prediction of survival in cancer using multi-omics data Ayşe Nur Çoruh1,2, Tunca Doğan1,2,* 1Biological Data Science Lab, Department of Computer Engineering, Hacettepe University, Ankara, Turkey 2Department of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, Ankara, Turkey *To whom correspondence should be addressed: Today, cancer is one of the leading causes of death worldwide, according to the World Health Organization. The high lethality of some of the sub-types of cancer increases the importance of correct diagnosis, complete follow-up and effective treatment. Survivability in cancer can be defined as the length of time that patients live after the diagnosis and/or the administration of a certain treatment. The estimation of survival, which is a critical topic in biomedicine, is possible using relevant indicators and historical patient data. For this, computational methods such as machine learning techniques and statistical approaches has been utilized. Until lately, researchers mainly used clinical and demographic data of patient to model survivability, which generally resulted in low success, due to ignoring patient-specific molecular properties that affect both the response given to a treatment and the progression of the disease in general. To solve this problem, personalized medicine-based approaches have been developed and used in cancer research in recent years. As different type of “omics” data is getting easier to be produced/obtained by the ordinary lab, more data has been accumulated on public servers, which allowed computational scientists to build more successful prediction models. There are studies in the literature that use single-omic data to predict survivability in cancer; however, the utilization of multi-omics data is still understudied in this context. Here, the research question is that, “would it be possible to construct more successful survival prediction models by diversifying the input data used during modeling?”. In this study, we proposed a new computational method to predict the survival of cancer patients. For this purpose, we utilized multi-omics data of patients diagnosed with 1 of the 13 different types of cancer, which are obtained from Genomic Data Commons (GDC) data portal. GDC contain data from different cancer projects, carried out within the scope of The Cancer Genome Atlas (TCGA) Program. In our study, we used mutation, copy number variation (CNV), gene expression, and miRNA expression as our input omic data types. In addition, we incorporated the clinical data and administered drug information of the patients in our dataset, to our input features. We divided patients into two survival groups via determining a specific survival time-based threshold for each tissue/cancer type. In terms of the input genes, we employed (i) the whole human genome, and (ii) genes in the L1000 (landmark) set, in two different settings, to be able to evaluate the better choice in terms of dimensionality and the computational cost. We utilized the random forest algorithm and trained 13 tissue/cancer specific binary classification models (i.e., classes are 0: patient dies before reaching the threshold duration, such as 3 years, 1: patient lives longer than the threshold duration). We employed leave one out cross validation (LOOCV) strategy to calculate performance scores and Kaplan–Meier plots for the evaluation of prediction output. Figure 1 summarizes the overall workflow of the study. According to our results, models that use multiple types of omic data achieved better prediction performances, compared to the models using a single-omic. We also found that the use of clinical data and drug information of patients, employed in addition to the multi-omics data, further increased the prediction performance, in most tissues. Among different types of omics data, models that utilize mutation and gene expression features obtained the highest prediction performance, in the majority of the tissues. Breast and kidney tissue models performed better than the models of other tissues, probably due to lower heterogeneity among patient signatures. Utilizing the L1000 genes reduced both the noise and the curse of dimensionality, and provided a better performance, as opposed to using the whole genome at the input level. We also observed that applying feature selection during the data preprocessing step did not significantly improve the prediction performance. These results confirmed the idea that multiple omics data would be successful in determining cancer patients' survival. For future studies, we plan to incorporate additional types of data such as proteomics, lipidomics, glycomics, and etc. where available. We’ll also be trying novel machine/deep learning algorithms, especially in the framework of multi-modal learning.


Comparison of in vitro cytotoxicity and genotoxicity of MMA-based polymeric materials and various metallic materials
GÜLÇE İZ, SULTAN; Gurhan, Saime Ismet Deliloglu; Sen, Bilge Hakan; Endoğan Tanır, Tuğba; Hasırcı, Nesrin (2010-12-01)
Aim: To determine the in vitro cytotoxicity and genotoxicity of some polymeric and metallic implant materials used as base materials in dentistry, based on ISO (International Organization for Standardization) and OECD (Organization for Economic Co-Operation and Development) test protocols.
Ubuz, Behiye; Çakıroğlu, Erdinç; Erbaş, Ayhan Kürşat (2011-07-15)
This paper aims to review the current state of doctoral programs in mathematics education in Turkish universities. In this context, we first provide some brief background information about higher education system and teacher education policies in Turkey. Then the major national initiatives towards future faculty development efforts are explained. Finally, the nature and components of mathematics education doctoral programs in Turkey is provided. In doing so, we provide information about admission procedures...
Multiresolution analysis of S&P500 time series
KILIC, Deniz Kenan; Uğur, Ömür (2018-01-01)
Time series analysis is an essential research area for those who are dealing with scientific and engineering problems. The main objective, in general, is to understand the underlying characteristics of selected time series by using the time as well as the frequency domain analysis. Then one can make a prediction for desired system to forecast ahead from the past observations. Time series modeling, frequency domain and some other descriptive statistical data analyses are the primary subjects of this study: i...
Voxel-MARS: a method for early detection of Alzheimer's disease by classification of structural brain MRI
Cevik, Alper; Weber, Gerhard-Wilhelm; Eyüboğlu, Behçet Murat; Oguz, Kader Karli (2017-11-01)
Neuroscience is of emerging importance along with the contributions of Operational Research to the practices of diagnosing neurodegenerative diseases with computer-aided systems based on brain image analysis. Although multiple biomarkers derived from Magnetic Resonance Imaging (MRI) data have proven to be effective in diagnosing Alzheimer's disease (AD) and mild cognitive impairment (MCI), no specific system has yet been a part of routine clinical practice. This paper aims to introduce a fully-automated vox...
Network structure based pathway enrichment system to analyze pathway activities
Işık, Zerrin; Atalay, Mehmet Volkan; Atalay, Rengül; Department of Computer Engineering (2011)
Current approaches integrating large scale data and information from a variety of sources to reveal molecular basis of cellular events do not adequately benefit from pathway information. Here, we portray a network structure based pathway enrichment system that fuses and exploits model and data: signalling pathways are taken as the biological models while microarray and ChIP-seq data are the sample input data sources among many other alternatives. Our model- and data-driven hybrid system allows to quantitati...
Citation Formats
A. N. Çoruh and T. Doğan, “Machine learning-based prediction of survival in cancer using multi-omics data,” Erdemli, Mersin, TÜRKİYE, 2022, p. 2118, Accessed: 00, 2023. [Online]. Available: