HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification

2020-04-01
Al Majzoub, Hisham
Elgedawy, Islam
Akaydin, Oyku
Ulukok, Mehtap Kose
Binary datasets are considered imbalanced when one of their two classes has less than 40% of the total number of the data instances (i.e., minority class). Existing classification algorithms are biased when applied on imbalanced binary datasets, as they misclassify instances of minority class. Many techniques are proposed to minimize the bias and to increase the classification accuracy. Synthetic Minority Oversampling Technique (SMOTE) is a well-known approach proposed to address this problem. It generates new synthetic data instances to balance the dataset. Unfortunately, it generates these instances randomly, leading to the generation of useless new instances, which is time and memory consuming. Different SMOTE derivatives were proposed to overcome this problem (such as Borderline SMOTE), yet the number of generated instances slightly changed. To overcome such problem, this paper proposes a novel approach for generating synthesized data instances known as Hybrid Clustered Affinitive Borderline SMOTE (HCAB-SMOTE). It managed to minimize the number of generated instances while increasing the classification accuracy. It combines undersampling for removing majority noise instances and oversampling approaches to enhance the density of the borderline. It uses k-means clustering on the borderline area and identify which clusters to oversample to achieve better results. Experimental results show that HCAB-SMOTE outperformed SMOTE, Borderline SMOTE, AB-SMOTE and CAB-SMOTE approaches which were developed before reaching HCAB-SMOTE, as it provided the highest classification accuracy with the least number of generated instances.
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING

Suggestions

Adaptive Oversampling for Imbalanced Data Classification
Ertekin Bolelli, Şeyda (2013-01-01)
Data imbalance is known to significantly hinder the generalization performance of supervised learning algorithms. A common strategy to overcome this challenge is synthetic oversampling, where synthetic minority class examples are generated to balance the distribution between the examples of the majority and minority classes. We present a novel adaptive oversampling algorithm, Virtual, that combines the benefits of oversampling and active learning. Unlike traditional resampling methods which require preproce...
A Methodology to Implement Box-Cox Transformation When No Covariate is Available
Dag, Osman; Asar, Ozgur; İlk Dağ, Özlem (2014-01-01)
Box-Cox transformation is one of the most commonly used methodologies when data do not follow normal distribution. However, its use is restricted since it usually requires the availability of covariates. In this article, the use of a non-informative auxiliary variable is proposed for the implementation of Box-Cox transformation. Simulation studies are conducted to illustrate that the proposed approach is successful in attaining normality under different sample sizes and most of the distributions and in esti...
Kernel probabilistic distance clustering algorithms
Özkan, Dilay; İyigün, Cem; Department of Industrial Engineering (2022-7)
Clustering is an unsupervised learning method that groups data considering the similarities between objects (data points). Probabilistic Distance Clustering (PDC) is a soft clustering approach based on some principles. Instead of directly assigning an object to a cluster, it assigns them to clusters with a membership probability. PDC is a simple yet effective clustering algorithm that performs well on spherical-shaped and linearly separable data sets. Traditional clustering algorithms fail when the data ...
Usage of Tinker Plots to Address and Remediate 6th Grade Students' Misconceptions about Mean and Median
Yilmaz, Zuhal (2013-07-01)
Current need for interpreting data, making inferences from existing data, leads to an increased emphasis on the teaching of statistics in mathematics curricula. Recent studies suggested that using educational technology supports students' meaningful understanding of statistics. This study addresses the impor tance of technological tool usage to introduce introductory statistical concepts; mean and median and diagnose student's misconceptions about these concepts. Three teaching experiment sessions were cond...
A memetic algorithm for clustering with cluster based feature selection
Şener, İlyas Alper; İyigün, Cem; Department of Operational Research (2022-8)
Clustering is a well known unsupervised learning method which aims to group the similar data points and separate the dissimilar ones. Data sets that are subject to clustering are mostly high dimensional and these dimensions include relevant and redundant features. Therefore, selection of related features is a significant problem to obtain successful clusters. In this study, it is considered that relevant features for each cluster can be varied as each cluster in a data set is grouped by different set of fe...
Citation Formats
H. Al Majzoub, I. Elgedawy, O. Akaydin, and M. K. Ulukok, “HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification,” ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, pp. 3205–3222, 2020, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/67770.