Rahman, Mehtab Ur
Voice disorders are a widespread issue affecting people of all ages, and accurate di agnosis is crucial for effective treatment. Recent advances in Artificial Intelligence based audio and speech processing have led to a focus on binary and multi-classclassification of voice disorders. However, existing work has mostly focused on the binary (two class) classification of voice disorders. Some researchers have also ex plored multi-class classification, but their results are not promising. The primary objective was to enhance the performance of a machine learning-powered system for voice disorder diagnosis in multi-class classification. his research proposes a two-stage framework for the binary and multi-class classi fication of voice disorders. First, high-level feature embeddings are extracted from spectrograms of voice data using the pre-trained VGGish model. In the second stage, we employ four classifiers: Support Vector Machine (SVM), Logistic Regres sion (LR), Multi-layer Perceptron (MLP), and an Ensemble Classifier (EC). Sepa rate experiments are conducted for males, females and both speakers combined. We evaluated our models on a subset of the Saarbruecken Voice Database (SVD). In bi nary classification, VGGish-SVM achieved the highest accuracy for male speakers, while VGGish-EC performed best for female speakers. In multiclass classification, VGGish-SVM outperformed other models in terms of mean accuracy for both gen ders, but VGGishEC demonstrated the best performance for minority classes. We conducted a comparative analysis against baseline methods, including the mel fre quency cepstral coefficient (MFCC), MFCCglottal features and features extracted with the wav2vec and HuBERT models, where SVM was employed as a classifier. The findings show that our approach outperforms these baselines in all classification tasks except for the binary classification of healthy vs. disordered for female speakers. Additionally, we proposed a framework for the multi-class classification of voice dis orders using OpenL3 embeddings. A pre-trained OpenL3 model is utilized to extract high-level embedding features from the mel spectrogram. Then different classifiers are evaluated after the Neighbourhood Component Analysis (NCA) based feature se lection. Random Forest (RF), Support Vector Machine (SVM) and K-Nearest Neigh bors (KNN) are utilized separately to classify the selected features. The evaluation and comparison are performed on a balanced subset of the SVD. Without any speech enhancement preprocessing, our best model, OpenL3-KNN improves the existing work accuracy by 4.9% and F1 score by 8.7%.
Citation Formats
M. U. Rahman, “MULTI-CLASS CLASSIFICATION OF VOICE DISORDERS USING DEEP LEARNING,” M.S. - Master of Science, Middle East Technical University, 2024.