Computing the Truck Factor in a Software Repository: A Machine Learning Approach

Download

Computing_the_Truck_factor_in_a_Software_Repository__A_Machine_Learning_Approach.pdf

Date

2024-7

Author

El Cheikh Ammar, Ahmad

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

91
views

0
downloads

In every software engineering project, it is crucial to be aware of members playing a key role in the progression to ensure that they do not halt the project’s advancement. This is where the Truck or Bus factor comes into play, a metric that evaluates which developers would cause the development process to decelerate should they get removed (or hit by a truck/bus). Measuring the truck factor in software development is complex due to the many variables involved. Several algorithms have been developed to address this, utilizing data from version control systems where developers ``commit'' changes, providing insights into who changed what, when, and where, which ultimately grants algorithms aiming to study the Truck Factor access to immense data. The existing algorithms, however, suffer from the fact that they tend to tunnel vision on code-centric metrics such as commits made by a developer. While such a feature is important in assessing the contribution of a developer, it does not tell the whole story behind a contribution. Henceforth, this thesis aims to examine what features the algorithms in the literature utilize and design a feature set that addresses various coding-based metrics, collaborative behaviours, developer activity patterns, and the broader technological context of a project. Afterwards, multiple supervised machine learning models with different algorithms, such as Random Forest, Naive Bayes, etc., are designed to utilize this feature set to predict the key contributors in GitHub repositories, ultimately computing the truck factor. Random Forest with hypertuned parameters and an aggregated model of hypertuned Random Forest and Naive Bayes with priors achieve the best performance, with mean F1-Scores equaling 84% and 86%, respectively. These models outperform existing algorithms with consistently high precision and recall across most repositories, demonstrating robust identification of true Truck Factor members.

Subject Keywords

Truck Factor, Bus Factor, Machine Learning, Software Repositories, Version Control System, Random Forest, Naive Bayes

URI

https://hdl.handle.net/11511/111566

Collections

Northern Cyprus Campus, Thesis

Citation Formats

A. El Cheikh Ammar, “Computing the Truck Factor in a Software Repository: A Machine Learning Approach,” M.S. - Master of Science, Middle East Technical University, 2024.