Kernel probabilistic distance clustering algorithms

Download
2022-7
Özkan, Dilay
Clustering is an unsupervised learning method that groups data considering the similarities between objects (data points). Probabilistic Distance Clustering (PDC) is a soft clustering approach based on some principles. Instead of directly assigning an object to a cluster, it assigns them to clusters with a membership probability. PDC is a simple yet effective clustering algorithm that performs well on spherical-shaped and linearly separable data sets. Traditional clustering algorithms fail when the data set is non-spherical or non-linearly separable, as in the case of PDC. The kernel method overcomes this problem by implicitly mapping the data into a higher dimensional space via a nonlinear transformation, where the data may be linearly separable. This study focuses on developing kernel-based clustering algorithms using the principles of PDC to overcome the problem of clustering non-spherical or non-linearly separable data sets and proposes three kernel-based PDC algorithms. In addition, different than the classical approach, Mahalanobis distance is also considered in kernel clustering, and a new kernel-based Mahalanobis distance is developed to be used in soft kernel clustering techniques. An experimental study is conducted for real and synthetic data sets to measure the performance of the proposed kernel-based PDC algorithms.

Suggestions

K-median clustering algorithms for time series data
Gökçem, Yiğit; İyigün, Cem; Department of Industrial Engineering (2021-3-10)
Clustering is an unsupervised learning method, that groups the unlabeled data forgathering valuable information. Clustering can be applied on various types of data. Inthis study, we have focused on time series clustering. When the studies about timeseries clustering are reviewed in the literature, for the time series data, the centers ofthe formed clusters are selected from the existing time series samples in the clusters.In this study, we have changed that view and have proposed clustering algorithmsbased...
Temporal clustering of time series via threshold autoregressive models: application to commodity prices
Aslan, Sipan; Yozgatlıgil, Ceylan; İyigün, Cem (2018-01-01)
The primary aim in this study is grouping time series according to the similarity between their data generating mechanisms (DGMs) rather than comparing pattern similarities in the time series trajectories. The approximation to the DGM of each series is accomplished by fitting the linear autoregressive and the non-linear threshold autoregressive models, and outputs of the estimates are used for feature extraction. Threshold autoregressive models are recognized for their ability to represent nonlinear feature...
MODELLING OF KERNEL MACHINES BY INFINITE AND SEMI-INFINITE PROGRAMMING
Ozogur-Akyuz, S.; Weber, Gerhard Wilhelm (2009-06-03)
In Machine Learning (ML) algorithms, one of the crucial issues is the representation of the data. As the data become heterogeneous and large-scale, single kernel methods become insufficient to classify nonlinear data. The finite combinations of kernels are limited up to a finite choice. In order to overcome this discrepancy, we propose a novel method of "infinite" kernel combinations for learning problems with the help of infinite and semi-infinite programming regarding all elements in kernel space. Looking...
Similarity matrix framework for data from union of subspaces
Aldroubi, Akram; Sekmen, Ali; Koku, Ahmet Buğra; Cakmak, Ahmet Faruk (2018-09-01)
This paper presents a framework for finding similarity matrices for the segmentation of data W = [w(1)...w(N)] subset of R-D drawn from a union U = boolean OR(M)(i=1) S-i, of independent subspaces {S-i}(i=1)(M), of dimensions {d(i)}(i=1)(M). It is shown that any factorization of W = BP, where columns of B form a basis for data W and they also come from U, can be used to produce a similarity matrix Xi w. In other words, Xi w(i, j) not equal 0, when the columns w(i) and w(j) of W come from the same subspace, ...
A memetic algorithm for clustering with cluster based feature selection
Şener, İlyas Alper; İyigün, Cem; Department of Operational Research (2022-8)
Clustering is a well known unsupervised learning method which aims to group the similar data points and separate the dissimilar ones. Data sets that are subject to clustering are mostly high dimensional and these dimensions include relevant and redundant features. Therefore, selection of related features is a significant problem to obtain successful clusters. In this study, it is considered that relevant features for each cluster can be varied as each cluster in a data set is grouped by different set of fe...
Citation Formats
D. Özkan, “Kernel probabilistic distance clustering algorithms,” M.S. - Master of Science, Middle East Technical University, 2022.