A Similarity Based Oversampling Method for Multi-Label Imbalanced Text Data

Download
2022-9-1
Karaman, İsmail Hakkı
In the real world, while the amount of data increases, it is not easy to find labeled data for Machine Learning projects, because of the compelling cost and effort requirements for labeling data. Also, most Machine Learning projects, especially multi-label classification problems, struggle with the data imbalance problem. In these problems, some classes, even, do not have enough data to train a classifier. In this study, an over sampling method for multi-label text classification problems is developed and stud ied to solve performance problems arising from the data imbalance. The proposed method finds new samples from unlabeled data by utilizing similarities between instances. It finds similar instances for a class from the unlabeled set and checks for improvement to see the effect on the performance of these instances. The unlabeled set is searched iteratively and the instances that assist the performance improvement are added to the labeled set. The experiments show that our method works well and the performance of the classifier is improved after oversampling.

Suggestions

A clustering method for web data with multi-type interrelated components
Bolelli, Levent; Ertekin Bolelli, Şeyda; Zhou, Ding; Giles, C Lee (2007-05-08)
Traditional clustering algorithms work on "flat" data, making the assumption that the data instances can only be represented by a set of homogeneous and uniform features. Many real world data, however, is heterogeneous in nature, comprising of multiple types of interrelated components. We present a clustering algorithm, K-SVMeans, that integrates the well known K-Means clustering with the highly popular Support Vector Machines(SVM) in order to utilize the richness of data. Our experimental results on author...
A Hybrid Approach for Process Mining Using From to Chart Arranged by Genetic Algorithms LNCS San Sebastian Spain June 2010
Esgin, Eren; Karagöz, Pınar (2010-06-18)
In the scope of this study, a hybrid data analysis methodology to business process modeling is proposed in such a way that; From-to Chart, which is basically used as the front-end to figure out the observed patterns among the activities at realistic event logs, is rearranged by Genetic Algorithms to convert these derived raw relations into activity sequence. According to experimental results, acceptably good (sub-optimal or optimal) solutions are obtained for relatively complex business processes at a reaso...
A new framework of multi-objective evolutionary algorithms for feature selection and multi-label classification of video data
Karagoz, Gizem Nur; Yazıcı, Adnan; Dokeroglu, Tansel; Coşar, Ahmet (2020-06-01)
There are few studies in the literature to address the multi-objective multi-label feature selection for the classification of video data using evolutionary algorithms. Selecting the most appropriate subset of features is a significant problem while maintaining/improving the accuracy of the prediction results. This study proposes a framework of parallel multi-objective Non-dominated Sorting Genetic Algorithms (NSGA-II) for exploring a Pareto set of non-dominated solutions. The subsets of non-dominated featu...
A deep learning approach for the transonic flow field predictions around airfoils
Duru, Cihat; Alemdar, Hande; Baran, Özgür Uğraş (2022-01-01)
Learning from data offers new opportunities for developing computational methods in research fields, such as fluid dynamics, which constantly accumulate a large amount of data. This study presents a deep learning approach for the transonic flow field predictions around airfoils. The physics of transonic flow is integrated into the neural network model by utilizing Reynolds-averaged Navier–Stokes (RANS) simulations. A detailed investigation on the performance of the model is made both qualitatively and quant...
A Hybrid Computational Method based on Convex Optimizationfor Outlier Problems
Yerlikaya Ozkurt, Fatma; Askan Gündoğan, Ayşegül; Weber, Gerhard Wiehelm (2015-11-01)
Statistical modeling plays a central role for any prediction problem of interest.However, predictive models may give misleading results when the data containoutliers. In many applications, it is important to identify and treat the outlierswithout direct elimination. To handle such issues, a hybrid computational methodbased on conic quadratic programming is introduced and employed onearthquake ground motion data set. Results are compared against widely-usedground motion prediction models.
Citation Formats
İ. H. Karaman, “A Similarity Based Oversampling Method for Multi-Label Imbalanced Text Data,” M.S. - Master of Science, Middle East Technical University, 2022.