Is it necessary to apply the outlier detection for protein-protein interaction data?

Ayyıldız, Ezgi
Purutçuoğlu Gazi, Vilda
Objective: Outlier detection is a crucial problem in many fields. Although there are too many outlier detection methods in the literature, only a few methods suitable for dependent, sparse and high dimensional data structure. In this study, we perform various univariate and multivariate outlier detection methods as a pre-processing step before modeling the protein-protein interaction networks in order to investigate whether the outlier detection can improve the accuracy of the model. Material and Methods: Within the univariate approaches, we implement the z-score and Box-plot methods which are the most well-known outlier detection approaches. Besides them, we also apply the multivariate outlier detection methods, called PCOut and Sign which are based on the robust principal component analysis and the BACON method which is a distance-based approach. These methods are applicable for the data type such that the number of variables are bigger than the observation. In the analysis, we use several synthetic and real benchmark biological datasets. Then, we infer the networks with 3 network models, namely, GGM, MARS and CMARS and finally, we check the validity of models via F-measure and the accuracy measures. Conclusion: From the results, it has been seen that the use of outlier detection methods before the modeling cannot contribute to the performance of the models in our datasets. Results: Based on the results obtained from different datasets, we suggest that the estimations of protein-protein interaction networks can be made with GGM, MARS and CMARS methods without the need for an outlier detection process.
Journal of Biostatistics-Turkish Clinics


CMARS: a new contribution to nonparametric regression with multivariate adaptive regression splines supported by continuous optimization
Weber, Gerhard-Wilhelm; Batmaz, İnci; Köksal, Gülser; Taylan, Pakize; Yerlikaya-Ozkurt, Fatma (2012-01-01)
Regression analysis is a widely used statistical method for modelling relationships between variables. Multivariate adaptive regression splines (MARS) especially is very useful for high-dimensional problems and fitting nonlinear multivariate functions. A special advantage of MARS lies in its ability to estimate contributions of some basis functions so that both additive and interactive effects of the predictors are allowed to determine the response variable. The MARS method consists of two parts: forward an...
A Comparative Study on Feature Selection based Improvement of Medium-Term Demand Forecast Accuracy
Ilseven, Engin; Göl, Murat (2019-01-01)
Use of the proper demand forecasting method and data set is very important for reliable system operation and planning. In this study, we compare performances of various feature selection method forecasting algorithm pairs in terms of forecast accuracy for medium-term demand forecasting case. We utilize correlation, recursive feature elimination, random forest, multivariate adaptive regression splines (MARS), stepwise regression and genetic algorithms as feature selection methods. As for forecasting algorith...
Do Base Functional Component types affect the relationship between software functional size and effort?
Gencel, Cigdem; Buglione, Luigi (2007-11-07)
One of the most debated issues in Software Engineering is effort estimation and one of the main points is about which could be (and how many) the right data from an historical database to use in order to obtain reliable estimates. In many of these studies, software size (measured in either lines of code or functional size units) is the primary input. However, the relationship between effort and the components of functional size (BFC - Base Functional Components) has not yet been fully analyzed. This study e...
Batmaz, İnci; Kartal-Koc, Elcin; Köksal, Gülser (2010-02-04)
Multivariate Adaptive Regression Splines (MARS) is a very popular nonparametric regression method particularly useful for modeling nonlinear relationships that may exist among the variables. Recently, we developed CMARS method as an alternative to backward stepwise part of the MARS algorithm. Comparative studies have indicated that CMARS performs better than MARS for modeling nonlinear relationships. In those studies, however, only main and two-factor interaction effects were sufficient to model the nonline...
A computational approach to nonparametric regression: bootstrapping CMARS method
Yazici, Ceyda; Yerlikaya-Ozkurt, Fatma; Batmaz, İnci (2015-10-01)
Bootstrapping is a computer-intensive statistical method which treats the data set as a population and draws samples from it with replacement. This resampling method has wide application areas especially in mathematically intractable problems. In this study, it is used to obtain the empirical distributions of the parameters to determine whether they are statistically significant or not in a special case of nonparametric regression, conic multivariate adaptive regression splines (CMARS), a statistical machin...
Citation Formats
E. Ayyıldız and V. Purutçuoğlu Gazi, “Is it necessary to apply the outlier detection for protein-protein interaction data?,” Journal of Biostatistics-Turkish Clinics, pp. 173–186, 2018, Accessed: 00, 2021. [Online]. Available: