Multiobjective evolutionary feature subset selection algorithm for binary classification

Deniz Kızılöz, Firdevsi Ayça
This thesis investigates the performance of multiobjective feature subset selection (FSS) algorithms combined with the state-of-the-art machine learning techniques for binary classification problem. Recent studies try to improve the accuracy of classification by including all of the features in the dataset, neglecting to determine the best performing subset of features. However, for some problems, the number of features may reach thousands, which will cause too much computation power to be consumed during the feature evaluation and classification phases, also possibly reducing the accuracy of the results. Therefore, selecting the minimum number of features while preserving the accuracy of the results at a high level becomes an important issue for achieving fast and accurate binary classification. The multiobjective algorithms implemented in this thesis include two phases, selecting feature subsets and applying supervised/unsupervised machine learning techniques to these selected subsets. For the FSS part of the algorithms, first a brute-force approach is implemented. Since exhaustively investigating all of the feature subsets is unfeasible when the number of features is larger than 20, secondly, a greedy algorithm implemented to find good-enough feature subsets. Finally, in order to select the most appropriate feature subsets intelligently, a genetic algorithm is proposed at the FSS part of the algorithms. Crossover and mutation operators are used to improve a population of individuals (each representing a selected feature subset) and obtain (near-)optimal solutions through generations. At the second phase of the algorithms, the performance of the selected feature subsets is evaluated by using five different machine learning techniques: Logistic Regression, Support Vector Machines, Extreme Learning Machine, K-means, and Affinity Propagation. The best performing multiobjective evolutionary algorithm is selected after comprehensive experiments and compared with the state-of-the-art algorithms in literature; Particle Swarm Optimization, Greedy Search, Tabu Search, and Scatter Search. 11 different datasets, mostly obtained from the well-known machine learning data repository of University of California UCI Machine Learning Repository, are used for the performance evaluation of the implemented algorithms. Experimental results show that the classification accuracy increases significantly with the most suitable subset of features and also execution time reduces greatly after applying proposed algorithm on the datasets.