Cluster based model diagnostic for logistic regression

Download
2016
Tanju, Özge
Model selection methods are commonly used to identify the best approximation that explains the data. Existing methods are generally based on the information theory, such as Akaike Information Criterion (AIC), corrected Akaike Information Criterion (AICc), Consistent Akaike Information Criterion (CAIC), and Bayesian Information Criterion (BIC). These criteria do not depend on any modeling purposes. In this thesis, we propose a new method for logistic regression model selection where the modeling purpose is classification. This method is based on a measure of distance between two clusterings. There are many clustering similarity measures in the literature. Our model selection procedure is based on Jaccard index (Downton and Brennan, 1980) and Fowlkes-Mallows Index (Fowlkes and Mallows, 1983). The new model selection approach is compared against the currently used common methods in an extensive simulation study concerned with many different realistic scenarios. Scenarios are divided into two based on modeling purposes. Simulation scenarios are also grouped whether the true model is in the candidate models or not. We consider linear and nonlinear logistic models which are nested and non-nested, random-effects and fixed-effects models as true models. Simulation results show that the new method is highly promising. Apart from the new method, this thesis also provides an extensive comparison of the current methods based on information criteria. Finally, cluster based and information based criteria are applied to a real data set to select a binary model.

Suggestions

Clustering of manifold-modeled data based on tangent space variations
Gökdoğan, Gökhan; Vural, Elif; Department of Electrical and Electronics Engineering (2017)
An important research topic of the recent years has been to understand and analyze data collections for clustering and classification applications. In many data analysis problems, the data sets at hand have an intrinsically low-dimensional structure and admit a manifold model. Most state-of-the-art clustering methods developed for data of non-linear and low-dimensional structure are based on local linearity assumptions. However, clustering algorithms based on locally linear representations can tolerate diff...
CMARS: a new contribution to nonparametric regression with multivariate adaptive regression splines supported by continuous optimization
Weber, Gerhard-Wilhelm; Batmaz, İnci; Köksal, Gülser; Taylan, Pakize; Yerlikaya-Ozkurt, Fatma (2012-01-01)
Regression analysis is a widely used statistical method for modelling relationships between variables. Multivariate adaptive regression splines (MARS) especially is very useful for high-dimensional problems and fitting nonlinear multivariate functions. A special advantage of MARS lies in its ability to estimate contributions of some basis functions so that both additive and interactive effects of the predictors are allowed to determine the response variable. The MARS method consists of two parts: forward an...
Consensus clustering of time series data
Yetere Kurşun, Ayça; Batmaz, İnci; İyigün, Cem; Department of Scientific Computing (2014)
In this study, we aim to develop a methodology that merges Dynamic Time Warping (DTW) and consensus clustering in a single algorithm. Mostly used time series distance measures require data to be of the same length and measure the distance between time series data mostly depends on the similarity of each coinciding data pair in time. DTW is a relatively new measure used to compare two time dependent sequences which may be out of phase or may not have the same lengths or frequencies. DTW aligns two time serie...
Estimation and hypothesis testing in stochastic regression
Sazak, Hakan Savaş; Tiku, Moti Lal; İslam, Qamarul; Department of Statistics (2003)
Regression analysis is very popular among researchers in various fields but almost all the researchers use the classical methods which assume that X is nonstochastic and the error is normally distributed. However, in real life problems, X is generally stochastic and error can be nonnormal. Maximum likelihood (ML) estimation technique which is known to have optimal features, is very problematic in situations when the distribution of X (marginal part) or error (conditional part) is nonnormal. Modified maximum...
Micro-level analysis of unregistered employment in Turkey with group comparisons
İner, Mehmet; Akkaya, Ayşen D.; Department of Statistics (2019)
Group comparison of logistic regression models in a similar way with OLS is manipulating depending on the unobserved heterogeneity in logistic regression. In this sense, this study focuses on the group comparison problem in logistic regression. In order to get to the root of the comparison problem in logistic regression, the theoretical background of the logistic regression is explained with the latent propensity interpretation in which the extent of the dependent variable’s closeness to success is taken in...
Citation Formats
Ö. Tanju, “Cluster based model diagnostic for logistic regression,” M.S. - Master of Science, Middle East Technical University, 2016.