Income Classification Benchmark: From R (Academic Study) to Python (ML Pipeline)

2025-10-01
The primary aim of this research is to construct a robust machine learning pipeline for income classification, predicting whether an individual earns above $50K based on demographic attributes such as work class, education, race, and gender.Initially developed as a statistical study in R-Studio to explore variable relationships and perform exploratory data analysis (EDA), the project has been significantly refactored into a production-ready Python environment to demonstrate modern MLOps standards.The methodology involves an end-to-end pipeline utilizing Scikit-Learn, incorporating advanced data cleaning, K-Nearest Neighbors (KNN) imputation for missing values, and automated feature scaling. While the initial research explored a broad range of algorithms, the current benchmark focuses on comparing the performance of Logistic Regression, Decision Trees, and Random Forest algorithms to establish a strong baseline. Model performance was rigorously assessed using Accuracy, Sensitivity, and F1-Score to account for categorical complexity. This dual-language approach highlights the transition from academic statistical inference to applied machine learning engineering.
Citation Formats
M. A. Erkan, “Income Classification Benchmark: From R (Academic Study) to Python (ML Pipeline),” 2025, Accessed: 00, 2025. [Online]. Available: https://zenodo.org/records/17662766.