Cost-sensitive learning for rare subtype classfication of lung cancer

Download

HIBIT22_paper_111.pdf

Date

2022-10

Author

Kızılilsoley, Nehir
Tanıl, Ezgi
Nikerel, Emrah

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

106
views

40
downloads

Machine learning (ML) algorithms assume or promote that the training set is balanced among classes. For imbalanced datasets, even though the overall accuracy is high, the classical machine learning algorithms bias toward the majority class, causing the model fit poorly to the minority class [1,2] which hinders the use of these algorithms for classification of rare events. Strategies to overcome this problem including altering the training data directly to reduce the difference between classes or changing the learning procedure so that the algorithm takes also the minority class into account are proposed [2]⁠. Usually, imbalance problem is handled with oversampling the minority or undersampling the majority class and/or generating synthetic samples from the original training data. Gene expression data is highly valuable and popular data for cancer classification by ML. However, it is highdimensional and severely imbalanced, making gene expression classification a cost-sensitive problem [1]⁠. Cost-sensitive learning (CSL), uses imbalanced costs for classes while making predictions and is required when prediction of minority class is more “interesting” than the other class(es). Instead of maximizing the overall accuracy on all classes while assuming equal costs, the goal is to minimize cost (penalty of a misclassification) as classes are associated with different penalties for misclassification. In this work, subtypes of lung cancer (AD, SC, LaC and SCLC) are classified using different CSL models that are either classical (e.g., support vector machines, naïve bayes, random forest) or ensemble learners, using imbalanced RNA-seq data from TCGA and microarray data from NCBI-GEO. Best performing model is evaluated by appropriate performance metrics (G-mean, accuracy, F-score etc.) and most important feature(s) will be extracted from this model using variable importance values.

URI

https://hibit2022.ims.metu.edu.tr
https://hdl.handle.net/11511/101351

Conference Name

The International Symposium on Health Informatics and Bioinformatics

Collections

Graduate School of Informatics, Conference / Seminar

Suggestions

OpenMETU
Core

Reducing Features to Improve Link Prediction Performance in Location Based Social Networks, Non-Monotonically Selected Subset from Feature Clusters Bayrak, Ahmet Engin; Polat, Faruk (2019-01-01) In most cases, feature sets available for machine learning algorithms require a feature engineering approach to pick the subset for optimal performance. During our link prediction research, we had observed the same challenge for features of Location Based Social Networks (LBSNs). We applied multiple reduction approaches to avoid performance issues caused by redundancy and relevance interactions between features. One of the approaches was the custom two-step method; starts with clustering features based on t...
Domain adaptation on graphs by learning graph topologies: theoretical analysis and an algorithm Vural, Elif (The Scientific and Technological Research Council of Turkey, 2019-01-01) Traditional machine learning algorithms assume that the training and test data have the same distribution, while this assumption does not necessarily hold in real applications. Domain adaptation methods take into account the deviations in data distribution. In this work, we study the problem of domain adaptation on graphs. We consider a source graph and a target graph constructed with samples drawn from data manifolds. We study the problem of estimating the unknown class labels on the target graph using the...
Cross-modal Representation Learning with Nonlinear Dimensionality Reduction KAYA, SEMİH; Vural, Elif (2019-08-22) In many problems in machine learning there exist relations between data collections from different modalities. The purpose of multi-modal learning algorithms is to efficiently use the information present in different modalities when solving multi-modal retrieval problems. In this work, a multi-modal representation learning algorithm is proposed, which is based on nonlinear dimensionality reduction. Compared to linear dimensionality reduction methods, nonlinear methods provide more flexible representations e...
Domain Adaptation on Graphs via Frequency Analysis Pilancı, Mehmet; Vural, Elif (2019-08-22) Classical machine learning algorithms assume the training and test data to be sampled from the same distribution, while this assumption may be violated in practice. Domain adaptation methods aim to exploit the information available in a source domain in order to improve the performance of classification in a target domain. In this work, we focus on the problem of domain adaptation in graph settings. We consider a source graph with many labeled nodes and aim to estimate the class labels on a target graph wit...
MODELLING OF KERNEL MACHINES BY INFINITE AND SEMI-INFINITE PROGRAMMING Ozogur-Akyuz, S.; Weber, Gerhard Wilhelm (2009-06-03) In Machine Learning (ML) algorithms, one of the crucial issues is the representation of the data. As the data become heterogeneous and large-scale, single kernel methods become insufficient to classify nonlinear data. The finite combinations of kernels are limited up to a finite choice. In order to overcome this discrepancy, we propose a novel method of "infinite" kernel combinations for learning problems with the help of infinite and semi-infinite programming regarding all elements in kernel space. Looking...

Citation Formats

N. Kızılilsoley, E. Tanıl, and E. Nikerel, “Cost-sensitive learning for rare subtype classfication of lung cancer,” Erdemli, Mersin, TÜRKİYE, 2022, p. 3111, Accessed: 00, 2023. [Online]. Available: https://hibit2022.ims.metu.edu.tr.