HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification

2020-04-01
Al Majzoub, Hisham
Elgedawy, Islam
Akaydin, Oyku
Ulukok, Mehtap Kose
Binary datasets are considered imbalanced when one of their two classes has less than 40% of the total number of the data instances (i.e., minority class). Existing classification algorithms are biased when applied on imbalanced binary datasets, as they misclassify instances of minority class. Many techniques are proposed to minimize the bias and to increase the classification accuracy. Synthetic Minority Oversampling Technique (SMOTE) is a well-known approach proposed to address this problem. It generates new synthetic data instances to balance the dataset. Unfortunately, it generates these instances randomly, leading to the generation of useless new instances, which is time and memory consuming. Different SMOTE derivatives were proposed to overcome this problem (such as Borderline SMOTE), yet the number of generated instances slightly changed. To overcome such problem, this paper proposes a novel approach for generating synthesized data instances known as Hybrid Clustered Affinitive Borderline SMOTE (HCAB-SMOTE). It managed to minimize the number of generated instances while increasing the classification accuracy. It combines undersampling for removing majority noise instances and oversampling approaches to enhance the density of the borderline. It uses k-means clustering on the borderline area and identify which clusters to oversample to achieve better results. Experimental results show that HCAB-SMOTE outperformed SMOTE, Borderline SMOTE, AB-SMOTE and CAB-SMOTE approaches which were developed before reaching HCAB-SMOTE, as it provided the highest classification accuracy with the least number of generated instances.
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING

Suggestions

Adaptive Oversampling for Imbalanced Data Classification
Ertekin Bolelli, Şeyda (2013-01-01)
Data imbalance is known to significantly hinder the generalization performance of supervised learning algorithms. A common strategy to overcome this challenge is synthetic oversampling, where synthetic minority class examples are generated to balance the distribution between the examples of the majority and minority classes. We present a novel adaptive oversampling algorithm, Virtual, that combines the benefits of oversampling and active learning. Unlike traditional resampling methods which require preproce...
A Methodology to Implement Box-Cox Transformation When No Covariate is Available
Dag, Osman; Asar, Ozgur; İlk Dağ, Özlem (2014-01-01)
Box-Cox transformation is one of the most commonly used methodologies when data do not follow normal distribution. However, its use is restricted since it usually requires the availability of covariates. In this article, the use of a non-informative auxiliary variable is proposed for the implementation of Box-Cox transformation. Simulation studies are conducted to illustrate that the proposed approach is successful in attaining normality under different sample sizes and most of the distributions and in esti...
Adapting a Robust Model into Hybrid Implementations of Machine Learning Algorithms and Statistical Methods for Longitudinal Data
Erduran, İbrahim Hakkı; Gökalp Yavuz, Fulya; Ebegil, Meral; Department of Statistics (2021-9)
Data structures in which the same characteristics are measured repeatedly at different time points are counted among the longitudinal data types. These datasets require the use of advanced modeling techniques because of the dependency structure amongst replicates. Linear mixed models (LMM) is an advanced regression method used in the analysis of such data sets. Although the LMM method provides many flexibility and advantages, the model setup is based on a number of assumptions that are challenging to provid...
Improving the scalability of ILP-based multi-relational concept discovery system through parallelization
Mutlu, Ayşe Ceyda; Karagöz, Pınar; Kavurucu, Yusuf (2012-03-01)
Due to the increase in the amount of relational data that is being collected and the limitations of propositional problem definition in relational domains, multi-relational data mining has arisen to be able to extract patterns from relational data. In order to cope with intractably large search space and still to be able to generate high-quality patterns. ILP-based multi-relational data mining and concept discovery systems employ several search strategies and pattern limitations. Another direction to cope w...
Independently weighted value difference metric
Ortakaya, Ahmet Fatih (2017-10-01)
The majority of the difference metrics used in categorical classification algorithms do not take the dependence structure among attributes into account. Some of these metrics even make strong assumptions on attribute independence which are not realistic for many real-world datasets. In addition, these metrics do not consider attribute importance on the class variable. In this paper, a new difference metric is proposed which is named as Independently Weighted Value Difference Metric (IWVDM). IWVDM includes a...
Citation Formats
H. Al Majzoub, I. Elgedawy, O. Akaydin, and M. K. Ulukok, “HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification,” ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, pp. 3205–3222, 2020, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/67770.