Automatic Identification and Classification of Web Tables Using Machine Learning

2024-9-05
F S ALGHARABLI, SAMEH
Tables are considered one of the most important and common ways of presenting data, especially high-density and complex data. However, the wide variety of web table formats and styles complicates the task of providing clear presentations for all users, especially those with cognitive or visual disabilities. This thesis focuses on the automatic identification and classification of web tables using Machine Learning and Deep Learning techniques. We address the limitations of existing methods that rely solely on HTML structure features for the classification of web tables. In this thesis, we explore the challenge of web table classification based on their header locations, using both rendered features and image-based approaches. By using the features and images of tables after they have been rendered by the browser, we capture the true essence of web tables as experienced by users, thereby improving the accuracy and robustness of classification models. This thesis combines an available dataset with newly collected and manually-labeled data, totaling 5,437 tables. In this research, different models using Machine Learning and Deep Learning algorithms are developed to automatically identify and classify different types of web tables. Strong results are demonstrated in the conducted experiments, with the best performances being achieved by Random Forest with rendered features (F1-Score = 0.92) and the custom-built Convolutional Neural Network model with images (F1-Score = 0.93). The primary contribution of this thesis is the automated processing and understanding of types of web tables, which improves existing automated web table mining approaches and can potentially improve accessibility for visually impaired web users.
Citation Formats
S. F S ALGHARABLI, “Automatic Identification and Classification of Web Tables Using Machine Learning,” M.S. - Master of Science, Middle East Technical University, 2024.