Exploiting word and sentence embeddings for diversification in crawling and ranking

Download

Can_Thesis.pdf

Date

2022-9

Author

Ünaldı, Can Duran

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

184
views

103
downloads

The increase in the volume of the Web and Microblogging sites caused copious amounts of duplicate or near duplicate content which emerged the diversification paradigm. On a typical search system, there are three main components, namely, a crawler, an indexer and a query processor. While most diversification approaches aim at the query processing stage of the search system, in this work, we aim to apply the diversification paradigm to both crawling and query processing stages. First, we introduce a diversification-aware focused crawler, which considers all the aspects of a given search query in order to construct a collection that contains equal coverage of them. Second, we focus on the diversification of short texts, such as social media posts, for the query processing stage. For both contributions, we apply well-known diversification approaches in the literature and extend them by exploiting the neural language models that are state-of-the-art for several information retrieval and natural language processing tasks. Our experiments, in which we evaluate both approaches with well-crafted experimental setups, show that the diversification paradigm is successful for both the crawling stage and short texts. Moreover, neural language models perform comparable results for the diversification paradigm.

Subject Keywords

Word Embeddings, Sentence Transformers, Document Embeddings, Diversification, Focused Crawling, Microblogging

URI

https://hdl.handle.net/11511/102533

Collections

Graduate School of Natural and Applied Sciences, Thesis

Suggestions

OpenMETU
Core

Spam detection by using network and text embedding approaches Yılmaz, Cennet Merve; İyigün, Cem; Department of Industrial Engineering (2019) Authenticity and reliability of the information spread over the cyberspace is becoming increasingly important, especially in e-commerce. This is because potential customers check reviews and customer feedbacks online before making a purchasing decision. Although this information is easily accessible through related websites, lack of verification of the authenticity of these reviews raises concerns about their reliability. Besides, fraudulent users disseminate disinformation to deceive people into acting aga...
Exploiting Navigational Queries for Result Presentation and Caching in Web Search Engines Ozcan, Rifat; Altıngövde, İsmail Sengör; Ulusoy, Ozgur (Wiley, 2011-04-01) Caching of query results is an important mechanism for efficiency and scalability of web search engines. Query results are cached and presented in terms of pages, which typically include 10 results each. In navigational queries, users seek a particular website, which would be typically listed at the top ranks (maybe, first or second) by the search engine, if found. For this type of query, caching and presenting results in the 10-per-page manner may waste cache space and network bandwidth. In this article, w...
Characterizing web search queries that match very few or no results Altıngövde, İsmail Sengör; Cambazoglu, Berkant Barla; Ozcan, Rifat; Sarigil, Erdem; Ulusoy, Özgür (2012-12-19) Despite the continuous efforts to improve the web search quality, a non-negligible fraction of user queries end up with very few or even no matching results in leading web search engines. In this work, we provide a detailed characterization of such queries based on an analysis of a real-life query log. Our experimental setup allows us to characterize the queries with few/no results and compare the mechanisms employed by the major search engines in handling them.
Adaptive time-to-live strategies for query result caching in web search engines Alici, Sadiye; Altıngövde, İsmail Sengör; Ozcan, Rifat; Barla Cambazoglu, B.; Ulusoy, Özgür (2012-04-27) An important research problem that has recently started to receive attention is the freshness issue in search engine result caches. In the current techniques in literature, the cached search result pages are associated with a fixed time-to-live (TTL) value in order to bound the staleness of search results presented to the users, potentially as part of a more complex cache refresh or invalidation mechanism. In this paper, we propose techniques where the TTL values are set in an adaptive manner, on a per-quer...
Image classification for content based indexing Taner, Serdar; Severcan, Mete; Department of Electrical and Electronics Engineering (2003) As the size of image databases increases in time, the need for content based image indexing and retrieval become important. Image classification is a key to content based image indexing. In this thesis supervised learning with feed forward back propagation artificial neural networks is used for image classification. Low level features derived from the images are used to classify the images to interpret the high level features that yield semantics. Features are derived using detail histogram correlations obt...

Citation Formats

C. D. Ünaldı, “Exploiting word and sentence embeddings for diversification in crawling and ranking,” M.S. - Master of Science, Middle East Technical University, 2022.