Although table recognition is an old research area, due to the diversity of table formats and styles on the web, the results are far from satisfactory. Automatically recognising tables, is especially important for visually disabled people who cannot see tables, because tables are among the most common way of presenting and structuring data with a high information density. This thesis aims to automatically detect table headings using Artificial Intelligence (AI) techniques, especially machine learning algorithms. In this thesis, we first try to understand various table structures and accessibility challenges for visually disabled people to achieve the best results. We then analyse the existing studies on table recognition on web pages and PDF documents. Based on the existing work, we propose first identifying relational tables, and then their headings using the rendered web page features with machine learning algorithms. We also investigate the existing data sets and conclude that we cannot use them for the proposed approach, and therefore we create our dataset based on HTML and the related information such as CSS. After that, we create a tool that extracts tables from a given page and then use the created dataset with machine learning algorithms to detect the table headings automatically. To do this, we also investigate existing machine learning algorithms to identify the best that can be used for this purpose. This thesis has two main contributions: (1) understanding table structures on the web and a large dataset of tables on the web; (2) investigating machine learning algorithms for automatically identifying the headings of tables. Based on our results achieved by using machine learning with rendered pages to detect relational tables and their headings, we concluded that by using features extracted from rendered pages perform better than using HTML structural features alone.


