The Hubness Phenomenon in High-Dimensional Spaces

Mani, Priya
Vazquez, Marilyn
Metcalf-Burton, Jessica Ruth
Domeniconi, Carlotta
Fairbanks, Hillary
Bal Bozkurt, Gülce
Beer, Elizabeth
Tarı, Zehra Sibel
High-dimensional data analysis is often negatively affected by the curse of dimensionality. In high-dimensional spaces, data becomes extremely sparse and distances between points become indistinguishable. As a consequence, reliable estimations of density, or meaningful distance-based similarity measures, cannot be obtained. This issue is particularly prevalent in clustering, which is commonly employed in exploratory data analysis. Another challenge for clustering high-dimensional data is that data often exist in subspaces consisting of combinations of dimensions, with different subspaces being relevant for different clusters. The hubness phenomenon is a recently discovered aspect of high-dimensional spaces. It is observed that the distribution of neighbor occurrences becomes skewed in intrinsically high-dimensional data, with few points, the hubs, having high occurrence counts. Hubness is observed to be more pronounced with increasing dimensionality. Hubs are also known to exhibit useful clustering properties and could be leveraged to mitigate the challenges in high-dimensional data analysis. In this chapter, we identify new geometric relationships between hubness, data density, and data distance distribution, as well as between hubness, subspaces, and intrinsic dimensionality of data. In addition, we formulate various potential research directions to leverage hubness for clustering and for subspace estimation.


The effect of data set characteristics on the choice of clustering validity index type
Taşkaya Temizel, Tuğba; Inkaya, Tulin; Yucebas, Sait Can (2007-11-09)
Clustering techniques are widely used to give insight about the similarities/dissimilarities between data set items. Most algorithms require the user to tune parameters such as number of clusters or threshold for cut-off point in a dendrogram. Such parameters also affect the clustering quality. In a good quality cluster, the intra-cluster similarity should be high, whereas the inter-cluster similarity should be low. To determine the optimal cluster number, several cluster validity methods have been proposed...
An approach to the mean shift outlier model by Tikhonov regularization and conic programming
TAYLAN, PAKİZE; Yerlikaya-Oezkurt, Fatma; Weber, Gerhard Wilhelm (IOS Press, 2014-01-01)
In statistical research, regression models based on data play a central role; one of these models is the linear regression model. However, this model may give misleading results when data contain outliers. The outliers in linear regression can be resolved in two stages: by using the Mean Shift Outlier Model (MSOM) and by providing a new solution for this model. First, we construct a Tikhonov regularization problem for the MSOM. Then, we treat this problem using convex optimization techniques, specifically c...
Effect of Using Regression in Sentiment Analysis
Onal, Itir; Ertuğrul, Ali Mert (2014-04-25)
In this study, the effect of using regression on sentiment classification of Twitter data was analyzed. In other words, whether the strength of sentiment better discriminates the classes or not. Since our dataset includes class confidence scores rather than discrete class labels, regression analysis was employed on each class separately. Then, each tweet was assigned the class whose estimated confidence score is maximum among others after regression. The feature set used includes unigrams, POS tags, emotico...
The effect of temporal aggregation on univariate time series analysis
Sarıaslan, Nazlı; Yozgatlıgil, Ceylan; Department of Statistics (2010)
Most of the time series are constructed by some kind of aggregation and temporal aggregation that can be defined as aggregation over consecutive time periods. Temporal aggregation takes an important role in time series analysis since the choice of time unit clearly influences the type of model and forecast results. A totally different time series model can be fitted on the same variable over different time periods. In this thesis, the effect of temporal aggregation on univariate time series models is studie...
Şinşek, Muhammed Yasin; Bahçecik, Şerif Onur; Department of International Relations (2022-1)
This thesis analyzes the implications of big data on subjectivity with a genealogical approach through the case of Cambridge Analytica. The changes in the epistemology, episteme, rationalities and the regimes of truth as a result of data pervasion are discussed. The statistics and the cybernetics as the antecedents of data politics are reviewed and data politics as a new mode of power is put forward. The targets, the objectives, the technologies and the rationalities of this new mode of power are analyzed. ...
Citation Formats
P. Mani et al., The Hubness Phenomenon in High-Dimensional Spaces. 2019, p. 45.