Variable Selection and Classification for Longitudinal Binary Data Through Three-Step Sparse Boosting

Download

220628_Deniz Esin Emer_PhD Thesis.pdf

Date

2022-6

Author

Emer, Deniz Esin

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

310
views

301
downloads

With the rapid evolution of technology, it is now possible to obtain the gene expression levels of thousands of genes in a single experiment. In these experiments, the sample size is relatively small but the number of covariates under consideration is extremely large, whereas only a small number of expressions may be related to the outcome of interest. Hence, the selection of causal features is much-needed along with the model estimation. In this thesis, we propose a three-step sparse boosting model for detecting the most important covariates that classify the individuals into binary groups considering the longitudinal data having spatial and temporal correlations. Following the idea of Yue, Li and Cheng (2019), in the first step, the independence of the observations is assumed and the estimation of coefficients of logistic regression is obtained by directly minimizing the binary-cross entropy loss using boosting method. Then, in the second step, the temporal correlation is considered by executing a weight matrix constructed based on the errors made in the first step. Finally, in the third step, the spatial correlation is added via a weight matrix considering the correlation structure. A Monte Carlo Simulation Study was designed and nine different temporal and spatial correlation structure scenario were run using parallel computing methods. The proposed model decreased the number of mistakenly chosen significant covariates while establishing all the true ones as significant. As the classification performance, it got higher when the spatial and temporal correlations got higher. Also, a comparison study, including Boruta (RF), Support Vector Machine (SVM), Logistic Regression, Ridge Regression, Lasso Regression and Elastic Net algorithms, showed that i) they considered very large number of mistakenly chosen significant covariates, whereas our proposed algorithm identified at most one along with the true significant variables and, ii) Three-Step Sparse Boosting algorithm performs the best in terms of specificity and precision metrics in the simulation study. In addition, the algorithm had been applied on a real life data set of Type 1 Diabetes Prediction and Prevention (DIPP) study, using both balanced and unbalanced sets. Our algorithm identified a few numbers of genes as significant which can be beneficial regarding time and money. The comparison results showed that The Three-Step Sparse Boosting technique can be considered as performing well in terms of variable selection, estimation and classification.

Subject Keywords

Binary longitudinal data, Logistic regression, Spatial and temporal correlations, Sparse boosting, Variable selection, Classification

URI

https://hdl.handle.net/11511/98095

Collections

Graduate School of Natural and Applied Sciences, Thesis

Suggestions

OpenMETU
Core

Reconstruction of gene regulatory networks Balcı, Sibel; Akkaya, Ayşen; Can, Tolga; Department of Statistics (2014) With the development of microarray technology, it is now possible to obtain the concentration levels of thousands of genes at a given time or in a given state. By following the changes in the gene expression levels, the responsible genes for cell differentiation or certain diseases can be identified. Gene expression changes are regulated by the interactions between the genes and their products. Gene regulatory networks (GRNs) identify these interactions using the gene expression changes. There are a number ...
Mining microarray data for biologically important gene sets Korkmaz, Gülberal Kırçiçeği Yoksul; Atalay, Mehmet Volkan; Department of Computer Engineering (2012) Microarray technology enables researchers to measure the expression levels of thousands of genes simultaneously to understand relationships between genes, extract pathways, and in general understand a diverse amount of biological processes such as diseases and cell cycles. While microarrays provide the great opportunity of revealing information about biological processes, it is a challenging task to mine the huge amount of information contained in the microarray datasets. Generally, since an accurate model ...
Robust estimation and hypothesis testing in microarray analysis Ülgen, Burçin Emre; Akkaya, Ayşen; Department of Statistics (2010) Microarray technology allows the simultaneous measurement of thousands of gene expressions simultaneously. As a result of this, many statistical methods emerged for identifying differentially expressed genes. Kerr et al. (2001) proposed analysis of variance (ANOVA) procedure for the analysis of gene expression data. Their estimators are based on the assumption of normality, however the parameter estimates and residuals from this analysis are notably heavier-tailed than normal as they commented. Since non-no...
Selection of representative SNP sets for genome-wide association studies: a metaheuristic approach Ustunkar, Gurkan; AKYÜZ, SÜREYYA; Weber, Gerhard W.; Friedrich, Christoph M.; Aydın Son, Yeşim (2012-08-01) After the completion of Human Genome Project in 2003, it is now possible to associate genetic variations in the human genome with common and complex diseases. The current challenge now is to utilize the genomic data efficiently and to develop tools to improve our understanding of etiology of complex diseases. Many of the algorithms needed to deal with this task were originally developed in management science and operations research (OR). One application is to select a subset of the Single Nucleotide Polymor...
Comprehensive Analyses of Gaussian Graphical Model under Different Biological Networks Dokuzoglu, D.; Purutçuoğlu Gazi, Vilda (2017-09-01) Naturally, genes interact with each other by forming a complicated network and the relationship between groups of genes can be shown by different functions as gene networks. Recently, there has been a growing concern in uncovering these complex structures from gene expression data by modeling them mathematically. The Gaussian graphical model is one of the very popular parametric approaches for modelling the underlying types of biochemical systems. In this study, we evaluate the performance of this probabili...

Citation Formats

D. E. Emer, “Variable Selection and Classification for Longitudinal Binary Data Through Three-Step Sparse Boosting,” Ph.D. - Doctoral Program, Middle East Technical University, 2022.