Privacy-preserving horizontal federated learning methodology through a novel boosting-based federated random forest algorithm

2023-1-04
Gençtürk, Mert
In this thesis, a novel federated ensemble classification algorithm for horizontally partitioned data called Boosting-based Federated Random Forest (BOFRF) is proposed, which not only increases the predictive power of all participating sites, but also provides significantly high improvement on the predictive power of sites having unsuccessful local models. In this regard, a federated version of random forest, which is a well-known bagging algorithm, is implemented by adapting the idea of boosting to it. In the integration step, a novel aggregation and weight calculation methodology is introduced that assigns weights to local classifiers based on their classification performance at each site instead of proportioning them with the sample size or site index without increasing the communication or computation cost. To increase the predictive power of the federated models built through the proposed algorithm, a personalized implementation is presented where each participant fine-tunes the hyperparameters of BOFRF locally and come up with a better-performing federated model on their own datasets. In addition, a clustered extension is proposed where participants are clustered according to their data distribution similarities or differences prior to running the algorithm. Finally, to prevent security breaches from happening and increase the level of privacy, two different implementations are proposed for BOFRF, which are centralized implementation with a trusted third party and decentralized implementation using secure sum protocol. The performance of the proposed solution was evaluated in different federated environments that were set up by using four healthcare datasets. The empirical results show that the BOFRF algorithm and its extensions improve the predictive power of local random forest models in all cases. The advantage of the proposed methodology is that the level of improvement it provides for sites having unsuccessful local models is significantly high unlike existing solutions.

Suggestions

Activity Learning from Lifelogging Images
Belli, Kader; Akbaş, Emre; Yazıcı, Adnan (2019-01-01)
The analytics of lifelogging has generated great interest for data scientists because big and multi-dimensional data are generated as a result of lifelogging activities. In this paper, the NTCIR Lifelog dataset is used to learn activities from an image point of view. Minute definitions are classified into activity classes using images and annotations, which serve as a basis for various classification techniques, namely SVMs and convolutional neural network structures (CNN), for learning activities. The perf...
BOFRF: A Novel Boosting-Based Federated Random Forest Algorithm on Horizontally Partitioned Data
Gencturk, Mert; Sınacı, Ali Anıl; Cicekli, Nihan Kesim (2014-1-01)
The application of federated learning on ensemble methods is a common practice with the goal of increasing the predictive power of local models. However, although existing federated solutions utilizing ensemble methods can achieve this when the datasets of sites are balanced and of good quality, i.e., the local models are already above a certain accuracy threshold, they usually fail to provide the same level of improvement to the models of sites that have an unsuccessful classifier because of their poor qua...
A new approach to multivariate adaptive regression splines by using Tikhonov regularization and continuous optimization
TAYLAN, PAKİZE; Weber, Gerhard Wilhelm; Ozkurt, Fatma Yerlikaya (2010-12-01)
This paper introduces a model-based approach to the important data mining tool Multivariate adaptive regression splines (MARS), which has originally been organized in a more model-free way. Indeed, MARS denotes a modern methodology from statistical learning which is important in both classification and regression, with an increasing number of applications in many areas of science, economy and technology. It is very useful for high-dimensional problems and shows a great promise for fitting nonlinear multivar...
Activity prediction from auto-captured lifelog images
Belli, Kader; Akbaş, Emre; Department of Computer Engineering (2019)
The analysis of lifelogging has generated great interest among data scientists because large-scale, multidimensional and multimodal data are generated as a result of lifelogging activities. In this study, we use the NTCIR Lifelog dataset where daily lives of two users are monitored for a total of 90 days, and archived as a set of minute-based records consisting of details like semantic location, body measurements, listening history, and user activity. In addition, images which are captured automatically by ...
MODELLING OF KERNEL MACHINES BY INFINITE AND SEMI-INFINITE PROGRAMMING
Ozogur-Akyuz, S.; Weber, Gerhard Wilhelm (2009-06-03)
In Machine Learning (ML) algorithms, one of the crucial issues is the representation of the data. As the data become heterogeneous and large-scale, single kernel methods become insufficient to classify nonlinear data. The finite combinations of kernels are limited up to a finite choice. In order to overcome this discrepancy, we propose a novel method of "infinite" kernel combinations for learning problems with the help of infinite and semi-infinite programming regarding all elements in kernel space. Looking...
Citation Formats
M. Gençtürk, “Privacy-preserving horizontal federated learning methodology through a novel boosting-based federated random forest algorithm,” Ph.D. - Doctoral Program, Middle East Technical University, 2023.