Clustering of manifold-modeled data based on tangent space variations

Download
2017
Gökdoğan, Gökhan
An important research topic of the recent years has been to understand and analyze data collections for clustering and classification applications. In many data analysis problems, the data sets at hand have an intrinsically low-dimensional structure and admit a manifold model. Most state-of-the-art clustering methods developed for data of non-linear and low-dimensional structure are based on local linearity assumptions. However, clustering algorithms based on locally linear representations can tolerate difficult sampling conditions only to some extent, and may fail for scarcely sampled data manifolds or at high-curvature regions. In this thesis, we consider a setting where each cluster is concentrated around a manifold and propose a manifold clustering algorithm that relies on the observation that the variation of the tangent space must be consistent along curves over the same data manifold. We argue that the non linear geometric structure of manifold-modeled data sets can be better handled by taking into account the global data geometry via the change in the tangent space over the whole manifold. We first theoretically characterize some properties of manifolds of bounded curvature. We then use these observations to develop a geometry-based clustering approach. Finally, we evaluate the performance of the presented method with experiments on synthetic and real data sets and the results show that the proposed method outperforms the manifold clustering algorithms in comparison based on Euclidean distance, geodesic distance and sparse representations in some kind of data sets. Our study suggests that geometry-based dissimilarity measures can provide promising tools for the clustering of intrinsically low dimensional data sets. 

Suggestions

PROGRESSIVE CLUSTERING OF MANIFOLD-MODELED DATA BASED ON TANGENT SPACE VARIATIONS
Gokdogan, Gokhan; Vural, Elif (2017-09-28)
An important research topic of the recent years has been to understand and analyze manifold-modeled data for clustering and classification applications. Most clustering methods developed for data of non-linear and low-dimensional structure are based on local linearity assumptions. However, clustering algorithms based on locally linear representations can tolerate difficult sampling conditions only to some extent, and may fail for scarcely sampled data manifolds or at high-curvature regions. In this paper, w...
Cluster based model diagnostic for logistic regression
Tanju, Özge; Kalaylıoğlu Akyıldız, Zeynep Işıl; Department of Statistics (2016)
Model selection methods are commonly used to identify the best approximation that explains the data. Existing methods are generally based on the information theory, such as Akaike Information Criterion (AIC), corrected Akaike Information Criterion (AICc), Consistent Akaike Information Criterion (CAIC), and Bayesian Information Criterion (BIC). These criteria do not depend on any modeling purposes. In this thesis, we propose a new method for logistic regression model selection where the modeling purpose is c...
CLUSTER STABILITY ESTIMATION BASED ON A MINIMAL SPANNING TREES APPROACH
Volkovich, Zeev (Vladimir); Barzily, Zeev; Weber, Gerhard Wilhelm; Toledano-Kitai, Dvora (2009-06-03)
Among the areas of data and text mining which are employed today in science, economy and technology, clustering theory serves as a preprocessing step in the data analyzing. However, there are many open questions still waiting for a theoretical and practical treatment, e.g., the problem of determining the true number of clusters has not been satisfactorily solved. In the current paper, this problem is addressed by the cluster stability approach. For several possible numbers of clusters we estimate the stabil...
Multiobjective evolutionary feature subset selection algorithm for binary classification
Deniz Kızılöz, Firdevsi Ayça; Coşar, Ahmet; Dökeroğlu, Tansel; Department of Computer Engineering (2016)
This thesis investigates the performance of multiobjective feature subset selection (FSS) algorithms combined with the state-of-the-art machine learning techniques for binary classification problem. Recent studies try to improve the accuracy of classification by including all of the features in the dataset, neglecting to determine the best performing subset of features. However, for some problems, the number of features may reach thousands, which will cause too much computation power to be consumed during t...
Learning Multi-Modal Nonlinear Embeddings: Performance Bounds and an Algorithm
Kaya, Semih; Vural, Elif (2021-01-01)
While many approaches exist in the literature to learn low-dimensional representations for data collections in multiple modalities, the generalizability of multi-modal nonlinear embeddings to previously unseen data is a rather overlooked subject. In this work, we first present a theoretical analysis of learning multi-modal nonlinear embeddings in a supervised setting. Our performance bounds indicate that for successful generalization in multi-modal classification and retrieval problems, the regularity of th...
Citation Formats
G. Gökdoğan, “Clustering of manifold-modeled data based on tangent space variations,” M.S. - Master of Science, Middle East Technical University, 2017.