AUTOMATICALLY DETECTING TABLE HEADINGS IN WEB PAGES WITH MACHINE LEARNING

Download

Waqar Haider_MSc Thesis.pdf

Date

2021-1

Author

Haider, Waqar

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

366
views

628
downloads

Although table recognition is an old research area, due to the diversity of table formats and styles on the web, the results are far from satisfactory. Automatically recognising tables, is especially important for visually disabled people who cannot see tables, because tables are among the most common way of presenting and structuring data with a high information density. This thesis aims to automatically detect table headings using Artificial Intelligence (AI) techniques, especially machine learning algorithms. In this thesis, we first try to understand various table structures and accessibility challenges for visually disabled people to achieve the best results. We then analyse the existing studies on table recognition on web pages and PDF documents. Based on the existing work, we propose first identifying relational tables, and then their headings using the rendered web page features with machine learning algorithms. We also investigate the existing data sets and conclude that we cannot use them for the proposed approach, and therefore we create our dataset based on HTML and the related information such as CSS. After that, we create a tool that extracts tables from a given page and then use the created dataset with machine learning algorithms to detect the table headings automatically. To do this, we also investigate existing machine learning algorithms to identify the best that can be used for this purpose. This thesis has two main contributions: (1) understanding table structures on the web and a large dataset of tables on the web; (2) investigating machine learning algorithms for automatically identifying the headings of tables. Based on our results achieved by using machine learning with rendered pages to detect relational tables and their headings, we concluded that by using features extracted from rendered pages perform better than using HTML structural features alone.

Subject Keywords

Accessibility, Table Mining, Information Extraction, Table Classification, Web Mining, Machine Learning

URI

https://hdl.handle.net/11511/93189

Collections

Northern Cyprus Campus, Thesis

Suggestions

OpenMETU
Core

Optimization of an online course with web usage mining Akman, LE; Akkan, B; Baykal, Nazife (2004-02-18) The huge amount of information existing in the World Wide Web constitutes an ideal environment to implement data mining techniques. Web mining is the mining of web data. There are different applications of web mining: web content mining, web structure mining and web usage mining. In our study we analyzed an online course by web usage mining techniques in order to optimize the navigation paths, the duration of the time spend on each page and the number of visits throughout the semester of the course. Moreove...
Characterizing web search queries that match very few or no results Altıngövde, İsmail Sengör; Cambazoglu, Berkant Barla; Ozcan, Rifat; Sarigil, Erdem; Ulusoy, Özgür (2012-12-19) Despite the continuous efforts to improve the web search quality, a non-negligible fraction of user queries end up with very few or even no matching results in leading web search engines. In this work, we provide a detailed characterization of such queries based on an analysis of a real-life query log. Our experimental setup allows us to characterize the queries with few/no results and compare the mechanisms employed by the major search engines in handling them.
Using Social Networks to Solve Data Sparsity Problem in One-Class Collaborative Filtering KAYA, hamza; Alpaslan, Ferda Nur (2010-04-14) One-Class Collaborative Filtering (OCCF) problems are more problematic than traditional collaborative filtering problems, since OCCF datasets lack counter-examples. Social networks can be used to remedy dataset issues faced by OCCF applications. In this work, we compare social networks belong to specific domains and the ones belong to more generic domains in terms of their usability in OCCF problems. Our experiments show that social networks that belong to a specific domain may better be appropriate for use...
Evolution of web search results within years Altıngövde, İsmail Sengör; Ulusoy, Özgür (2011-01-01) We provide a first large-scale analysis of the evolution of query results obtained from a real search engine at two distant points in time, namely, in 2007 and 2010, for a set of 630,000 real queries.
INTEGRATED INSTANCE-BASED AND KERNEL METHODS FOR POWER QUALITY KNOWLEDGE MODELING Guder, Mennan; Salor, Ozgul; Cadirci, Isik (2010-10-28) In this paper, an integrated knowledge discovery strategy for high dimensional spatial power quality event data is proposed. Real time, distributed measuring of the electricity transmission system parameters provides huge number of time series power quality events. The proposed method aims to construct characteristic event distribution and interaction models for individual power quality sensors and the whole electricity transmission system by considering feasibility, time and accuracy concerns. In order to ...

Citation Formats

W. Haider, “AUTOMATICALLY DETECTING TABLE HEADINGS IN WEB PAGES WITH MACHINE LEARNING,” M.S. - Master of Science, Middle East Technical University, 2021.