Show/Hide Menu
Hide/Show Apps
Logout
Türkçe
Türkçe
Search
Search
Login
Login
OpenMETU
OpenMETU
About
About
Open Science Policy
Open Science Policy
Open Access Guideline
Open Access Guideline
Postgraduate Thesis Guideline
Postgraduate Thesis Guideline
Communities & Collections
Communities & Collections
Help
Help
Frequently Asked Questions
Frequently Asked Questions
Guides
Guides
Thesis submission
Thesis submission
MS without thesis term project submission
MS without thesis term project submission
Publication submission with DOI
Publication submission with DOI
Publication submission
Publication submission
Supporting Information
Supporting Information
General Information
General Information
Copyright, Embargo and License
Copyright, Embargo and License
Contact us
Contact us
Exploiting word and sentence embeddings for diversification in crawling and ranking
Download
Can_Thesis.pdf
Date
2022-9
Author
Ünaldı, Can Duran
Metadata
Show full item record
This work is licensed under a
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
.
Item Usage Stats
162
views
82
downloads
Cite This
The increase in the volume of the Web and Microblogging sites caused copious amounts of duplicate or near duplicate content which emerged the diversification paradigm. On a typical search system, there are three main components, namely, a crawler, an indexer and a query processor. While most diversification approaches aim at the query processing stage of the search system, in this work, we aim to apply the diversification paradigm to both crawling and query processing stages. First, we introduce a diversification-aware focused crawler, which considers all the aspects of a given search query in order to construct a collection that contains equal coverage of them. Second, we focus on the diversification of short texts, such as social media posts, for the query processing stage. For both contributions, we apply well-known diversification approaches in the literature and extend them by exploiting the neural language models that are state-of-the-art for several information retrieval and natural language processing tasks. Our experiments, in which we evaluate both approaches with well-crafted experimental setups, show that the diversification paradigm is successful for both the crawling stage and short texts. Moreover, neural language models perform comparable results for the diversification paradigm.
Subject Keywords
Word Embeddings
,
Sentence Transformers
,
Document Embeddings
,
Diversification
,
Focused Crawling
,
Microblogging
URI
https://hdl.handle.net/11511/102533
Collections
Graduate School of Natural and Applied Sciences, Thesis
Suggestions
OpenMETU
Core
Spam detection by using network and text embedding approaches
Yılmaz, Cennet Merve; İyigün, Cem; Department of Industrial Engineering (2019)
Authenticity and reliability of the information spread over the cyberspace is becoming increasingly important, especially in e-commerce. This is because potential customers check reviews and customer feedbacks online before making a purchasing decision. Although this information is easily accessible through related websites, lack of verification of the authenticity of these reviews raises concerns about their reliability. Besides, fraudulent users disseminate disinformation to deceive people into acting aga...
Exploiting Navigational Queries for Result Presentation and Caching in Web Search Engines
Ozcan, Rifat; Altıngövde, İsmail Sengör; Ulusoy, Ozgur (Wiley, 2011-04-01)
Caching of query results is an important mechanism for efficiency and scalability of web search engines. Query results are cached and presented in terms of pages, which typically include 10 results each. In navigational queries, users seek a particular website, which would be typically listed at the top ranks (maybe, first or second) by the search engine, if found. For this type of query, caching and presenting results in the 10-per-page manner may waste cache space and network bandwidth. In this article, w...
Characterizing web search queries that match very few or no results
Altıngövde, İsmail Sengör; Cambazoglu, Berkant Barla; Ozcan, Rifat; Sarigil, Erdem; Ulusoy, Özgür (2012-12-19)
Despite the continuous efforts to improve the web search quality, a non-negligible fraction of user queries end up with very few or even no matching results in leading web search engines. In this work, we provide a detailed characterization of such queries based on an analysis of a real-life query log. Our experimental setup allows us to characterize the queries with few/no results and compare the mechanisms employed by the major search engines in handling them.
Adaptive time-to-live strategies for query result caching in web search engines
Alici, Sadiye; Altıngövde, İsmail Sengör; Ozcan, Rifat; Barla Cambazoglu, B.; Ulusoy, Özgür (2012-04-27)
An important research problem that has recently started to receive attention is the freshness issue in search engine result caches. In the current techniques in literature, the cached search result pages are associated with a fixed time-to-live (TTL) value in order to bound the staleness of search results presented to the users, potentially as part of a more complex cache refresh or invalidation mechanism. In this paper, we propose techniques where the TTL values are set in an adaptive manner, on a per-quer...
Image classification for content based indexing
Taner, Serdar; Severcan, Mete; Department of Electrical and Electronics Engineering (2003)
As the size of image databases increases in time, the need for content based image indexing and retrieval become important. Image classification is a key to content based image indexing. In this thesis supervised learning with feed forward back propagation artificial neural networks is used for image classification. Low level features derived from the images are used to classify the images to interpret the high level features that yield semantics. Features are derived using detail histogram correlations obt...
Citation Formats
IEEE
ACM
APA
CHICAGO
MLA
BibTeX
C. D. Ünaldı, “Exploiting word and sentence embeddings for diversification in crawling and ranking,” M.S. - Master of Science, Middle East Technical University, 2022.