Show/Hide Menu
Hide/Show Apps
Logout
Türkçe
Türkçe
Search
Search
Login
Login
OpenMETU
OpenMETU
About
About
Open Science Policy
Open Science Policy
Open Access Guideline
Open Access Guideline
Postgraduate Thesis Guideline
Postgraduate Thesis Guideline
Communities & Collections
Communities & Collections
Help
Help
Frequently Asked Questions
Frequently Asked Questions
Guides
Guides
Thesis submission
Thesis submission
MS without thesis term project submission
MS without thesis term project submission
Publication submission with DOI
Publication submission with DOI
Publication submission
Publication submission
Supporting Information
Supporting Information
General Information
General Information
Copyright, Embargo and License
Copyright, Embargo and License
Contact us
Contact us
Natural Language Querying for Databases: Cross-Domain Datasets and Retrieval-Augmented Approaches to Query Generation
Download
arifgorkemozer_doktora_tezi_final.pdf
Date
2026-2-23
Author
ÖZER, ARİF GÖRKEM
Metadata
Show full item record
This work is licensed under a
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
.
Item Usage Stats
130
views
519
downloads
Cite This
Natural Language Querying (NLQ) aims to enable users to access and interact with databases using natural language, reducing the need for expertise in formal query languages and database schemas. While substantial progress has been made in text-to-SQL research for relational databases, existing approaches remain limited in two key respects: their strong dependence on SQL-centric benchmarks and their reliance on data-intensive, fine-tuned models that often fail to generalize across domains and schemas. This thesis addresses these challenges by advancing NLQ research along two complementary directions. First, it introduces DocSpider, the first publicly available cross-domain text-to-MongoDB dataset. DocSpider is constructed by systematically translating gold-standard SQL queries from the Spider benchmark into MongoDB Query Language (MQL) while preserving the original natural language questions and database schemas. This dataset enables, for the first time, large-scale and systematic evaluation of NLQ systems for document-oriented NoSQL databases and provides a foundation for studying cross-domain generalization beyond relational settings. Second, the thesis proposes LWRAG-SQL, a lightweight Retrieval-Augmented Generation (RAG) framework for text-to-SQL that entirely avoids task-specific fine-tuning. The framework generates semantic descriptions of database schema elements and employs similarity-based retrieval to select relevant schema context at inference time. By injecting this retrieved context into large language model prompts, LWRAG-SQL guides query generation in a schema-aware manner while remaining resource-efficient, natural language-agnostic and adaptable to unseen domains. Extensive experimental evaluations demonstrate that DocSpider enables meaningful cross-paradigm NLQ analysis for NoSQL systems and that LWRAG-SQL achieves competitive performance on text-to-SQL benchmarks despite requiring no annotated training data or pre-training. Together, these contributions broaden the scope of NLQ research to document-oriented databases and present a practical alternative to fine-tuning-heavy approaches, advancing more general, scalable, and deployable natural language interfaces to databases.
Subject Keywords
Natural language querying
,
text-to-SQL
,
text-to-MongoDB query
,
Document oriented databases
,
Cross-domain generalization
,
Retrieval-augmented generation
,
Large language models
,
Schema-aware query generation
,
Resource-efficient NLP systems
URI
https://hdl.handle.net/11511/118782
Collections
Graduate School of Natural and Applied Sciences, Thesis
Citation Formats
IEEE
ACM
APA
CHICAGO
MLA
BibTeX
A. G. ÖZER, “Natural Language Querying for Databases: Cross-Domain Datasets and Retrieval-Augmented Approaches to Query Generation,” Ph.D. - Doctoral Program, Middle East Technical University, 2026.