Mixed integer programming and heuristics approachesfor clustering with cluster-based feature selection

Download
2019
Önen Öz, Sen
Cluster analysis tries to figure out the hidden similarities between data points in orderto place similar data points into the same group and different data points into separategroups using unlabeled data. Understanding the data becomes difficult and the powerof obtaining informative clusters for an algorithm decreases as the dimensionality ofthe data set gets high. Identifying the relevant features of high dimensional data setsis the mostly used technique in order to increase the performance of the algorithm tofind the best clusters. However, selecting or deselecting the features comes up withthe assumption that all the selected features have the same relevance for all clusters.In this study, it is assumed that the features to be used in clustering may differ foreach cluster. Number of clusters and number of relevant features in each clusterare given in advance. By using a center-based clustering approach, identifying thecluster centers, assigning data points to a cluster and selecting relevant features foreach cluster are performed simultaneously. A mixed integer mathematical model isproposed which minimizes the total distance between data points and their clustercenter by using the selected features for each cluster. Since the proposed model is not linear, mathematical models using different linearization methods have been used tosolve the problem. In addition to those mathematical models, we propose BendersDecomposition solution method implemented on our problem. Besides, two differentheuristic algorithms have been developed by taking into account the nature of thementioned problem. The proposed mathematical models and heuristic algorithmshave been experimented on several data sets in different problem sizes in terms of number of clusters, number of relevant features and number of data points.

Suggestions

A memetic algorithm for clustering with cluster based feature selection
Şener, İlyas Alper; İyigün, Cem; Department of Operational Research (2022-8)
Clustering is a well known unsupervised learning method which aims to group the similar data points and separate the dissimilar ones. Data sets that are subject to clustering are mostly high dimensional and these dimensions include relevant and redundant features. Therefore, selection of related features is a significant problem to obtain successful clusters. In this study, it is considered that relevant features for each cluster can be varied as each cluster in a data set is grouped by different set of fe...
Mixed integer programming and heuristics approaches for clustering with cluster-based feature selection
İyigün, Cem (null; 2019-10-20)
In this study, we work on a clustering problem where it is assumed that the features identifying the clusters may differ for each cluster. Number of clusters and number of relevant features in each cluster are given in advance. A centerbased clustering approach is proposed. Finding the cluster centers, assigning the data points and selecting relevant features for each cluster are performed simultaneously. A non-linear mixed integer mathematical model is proposed which minimizes the total distance between da...
Temporal clustering of time series via threshold autoregressive models: application to commodity prices
Aslan, Sipan; Yozgatlıgil, Ceylan; İyigün, Cem (2018-01-01)
The primary aim in this study is grouping time series according to the similarity between their data generating mechanisms (DGMs) rather than comparing pattern similarities in the time series trajectories. The approximation to the DGM of each series is accomplished by fitting the linear autoregressive and the non-linear threshold autoregressive models, and outputs of the estimates are used for feature extraction. Threshold autoregressive models are recognized for their ability to represent nonlinear feature...
A marginalized multilevel model for bivariate longitudinal binary data
Inan, Gul; İlk Dağ, Özlem (Springer Science and Business Media LLC, 2019-06-01)
This study considers analysis of bivariate longitudinal binary data. We propose a model based on marginalized multilevel model framework. The proposed model consists of two levels such that the first level associates the marginal mean of responses with covariates through a logistic regression model and the second level includes subject/time specific random intercepts within a probit regression model. The covariance matrix of multiple correlated time-specific random intercepts for each subject is assumed to ...
Parallel computing in linear mixed models
Gökalp Yavuz, Fulya (Springer Science and Business Media LLC, 2020-09-01)
In this study, we propose a parallel programming method for linear mixed models (LMM) generated from big data. A commonly used algorithm, expectation maximization (EM), is preferred for its use of maximum likelihood estimations, as the estimations are stable and simple. However, EM has a high computation cost. In our proposed method, we use a divide and recombine to split the data into smaller subsets, running the algorithm steps in parallel on multiple local cores and combining the results. The proposed me...
Citation Formats
S. Önen Öz, “Mixed integer programming and heuristics approachesfor clustering with cluster-based feature selection,” Thesis (M.S.) -- Graduate School of Natural and Applied Sciences. Industrial Engineering., Middle East Technical University, 2019.