A Similarity Based Oversampling Method for Multi-Label Imbalanced Text Data

Karaman, İsmail Hakkı
In the real world, while the amount of data increases, it is not easy to find labeled data for Machine Learning projects, because of the compelling cost and effort requirements for labeling data. Also, most Machine Learning projects, especially multi-label classification problems, struggle with the data imbalance problem. In these problems, some classes, even, do not have enough data to train a classifier. In this study, an over sampling method for multi-label text classification problems is developed and stud ied to solve performance problems arising from the data imbalance. The proposed method finds new samples from unlabeled data by utilizing similarities between instances. It finds similar instances for a class from the unlabeled set and checks for improvement to see the effect on the performance of these instances. The unlabeled set is searched iteratively and the instances that assist the performance improvement are added to the labeled set. The experiments show that our method works well and the performance of the classifier is improved after oversampling.


A clustering method for web data with multi-type interrelated components
Bolelli, Levent; Ertekin Bolelli, Şeyda; Zhou, Ding; Giles, C Lee (2007-05-08)
Traditional clustering algorithms work on "flat" data, making the assumption that the data instances can only be represented by a set of homogeneous and uniform features. Many real world data, however, is heterogeneous in nature, comprising of multiple types of interrelated components. We present a clustering algorithm, K-SVMeans, that integrates the well known K-Means clustering with the highly popular Support Vector Machines(SVM) in order to utilize the richness of data. Our experimental results on author...
A Hybrid Approach for Process Mining Using From to Chart Arranged by Genetic Algorithms LNCS San Sebastian Spain June 2010
Esgin, Eren; Karagöz, Pınar (2010-06-18)
In the scope of this study, a hybrid data analysis methodology to business process modeling is proposed in such a way that; From-to Chart, which is basically used as the front-end to figure out the observed patterns among the activities at realistic event logs, is rearranged by Genetic Algorithms to convert these derived raw relations into activity sequence. According to experimental results, acceptably good (sub-optimal or optimal) solutions are obtained for relatively complex business processes at a reaso...
A deep learning approach for the transonic flow field predictions around airfoils
Duru, Cihat; Alemdar, Hande; Baran, Özgür Uğraş (2022-01-01)
Learning from data offers new opportunities for developing computational methods in research fields, such as fluid dynamics, which constantly accumulate a large amount of data. This study presents a deep learning approach for the transonic flow field predictions around airfoils. The physics of transonic flow is integrated into the neural network model by utilizing Reynolds-averaged Navier–Stokes (RANS) simulations. A detailed investigation on the performance of the model is made both qualitatively and quant...
A Graph-Based Concept Discovery Method for n-Ary Relations
Abay, Nazmiye Ceren; MUTLU, ALEV; Karagöz, Pınar (2015-09-04)
Concept discovery is a multi-relational data mining task for inducing definitions of a specific relation in terms of other relations in the data set. Such learning tasks usually have to deal with large search spaces and hence have efficiency and scalability issues. In this paper, we present a hybrid approach that combines association rule mining methods and graph-based approaches to cope with these issues. The proposed method inputs the data in relational format, converts it into a graph representation, and...
A Hybrid Computational Method based on Convex Optimizationfor Outlier Problems
Yerlikaya Ozkurt, Fatma; Askan Gündoğan, Ayşegül; Weber, Gerhard Wiehelm (2015-11-01)
Statistical modeling plays a central role for any prediction problem of interest.However, predictive models may give misleading results when the data containoutliers. In many applications, it is important to identify and treat the outlierswithout direct elimination. To handle such issues, a hybrid computational methodbased on conic quadratic programming is introduced and employed onearthquake ground motion data set. Results are compared against widely-usedground motion prediction models.
Citation Formats
İ. H. Karaman, “A Similarity Based Oversampling Method for Multi-Label Imbalanced Text Data,” M.S. - Master of Science, Middle East Technical University, 2022.