Dimension reduction for tree-structured data

2021-9
Durak, Erdinç
Statistical analysis of tree-structured data is one of the exciting research areas with expanding application areas. In classical data analysis, the data objects are points in the Euclidean space, whereas they are trees in the analysis of tree-structured data. The replacement of points in the Euclidean space with trees brings in additional complexity in data analysis and necessitates the use of dimension reduction techniques. In this study, we aim to develop dimension reduction techniques for tree-structured data where each tree is rooted and labeled. We consider two classical dimension reduction techniques; namely, the principal component analysis (PCA) and multidimensional scaling (MDS), and adapt them to tree-structured data. Unlike a previous study on the PCA for tree-structured data, the PCA techniques proposed in this thesis maximize a measure of variance. Computational experiments on randomly generated data and real life data show the superiority of the proposed PCA techniques over the existing one. In the literature, there is no study that performs MDS on tree-structured data to project them in the tree space. In this direction, we propose the first MDS method for tree-structured data, where the edges of the trees are considered as the dimensions. In the proposed MDS method, the aim is to keep the Hamming distances between pairs of trees proportionally similar. For this purpose, we propose a mixed-integer linear programming model which finds the edges to be kept in an optimal way and heuristic methods where the edges are greedily selected one by one. Computational experiments show that the proposed MDS methods are successful in keeping useful information as high clustering accuracy is achieved with only a fraction of the edges. To be able to perform the computational experiments in a systematic way, a random tree generator algorithm is developed. This algorithm is able to generate clusters of trees with different skewness and density parameters. By changing these parameters systematically, we are able to understand the strength and weaknesses of the proposed methods.

Suggestions

Mask Combination of Multi-Layer Graphs for Global Structure Inference
Bayram, Eda; Thanou, Dorina; Vural, Elif; Frossard, Pascal (2020-01-01)
Structure inference is an important task for network data processing and analysis in data science. In recent years, quite a few approaches have been developed to learn the graph structure underlying a set of observations captured in a data space. Although real-world data is often acquired in settings where relationships are influenced by a priori known rules, such domain knowledge is still not well exploited in structure inference problems. In this paper, we identify the structure of signals defined in a da...
Dimension reduction using global and local pattern information-based maximum margin criterion
Sakarya, Ufuk (2016-07-01)
Dimension reduction is an important research area in pattern recognition when dealing with high-dimensional data. In this paper, a novel supervised dimension reduction approach is introduced for classification. Advantages of using not only global pattern information but also local pattern information are examined in the maximum margin criterion framework. Experimental comparative results in object recognition, handwritten digit recognition, and hyperspectral image classification are presented. According to ...
PROGRESSIVE CLUSTERING OF MANIFOLD-MODELED DATA BASED ON TANGENT SPACE VARIATIONS
Gokdogan, Gokhan; Vural, Elif (2017-09-28)
An important research topic of the recent years has been to understand and analyze manifold-modeled data for clustering and classification applications. Most clustering methods developed for data of non-linear and low-dimensional structure are based on local linearity assumptions. However, clustering algorithms based on locally linear representations can tolerate difficult sampling conditions only to some extent, and may fail for scarcely sampled data manifolds or at high-curvature regions. In this paper, w...
Uncertainty models for vector based functional curves and assessing the reliability of G-Band
Kurtar, Ahmet Kürşat; Düzgün, H. Şebnem; Department of Geodetic and Geographical Information Technologies (2006)
This study is about uncertainty medelling for vector features in geographic information systems (GIS). It has mainly two objectives which are about the band models used for uncertainty modelling . The first one is the assessment of accuracy of GBand model, which is the latest and the most complex uncertainty handling model for vector features. Some simulations and tests are applied to test the reliability of accuracy of G-Band with comparing Chrisman’s epsilon band model, which is the most frequently used b...
Integration of geophysical - geological data using geographic information systems
Şirinyıldız, Tunç; Toprak, Vedat; Department of Geodetic and Geographical Information Technologies (2003)
This study attempts to integrate geophysical data with other spatial data using Geographic Information Systems (GIS). The study is carried out in a part of Galatean Volcanic Province, north of Ankara. Gravity, magnetic, topographic, rock type and volcanic eruption center data are the data layers used in the study. All data layers are converted to raster format with a grid spacing of 100 m. The first step in the analysis is the pair-wise analyses of all data layers. For the geophysical data, different layers...
Citation Formats
E. Durak, “Dimension reduction for tree-structured data,” M.S. - Master of Science, Middle East Technical University, 2021.