Natural Language Querying for Databases: Cross-Domain Datasets and Retrieval-Augmented Approaches to Query Generation

2026-2-23
ÖZER, ARİF GÖRKEM
Natural Language Querying (NLQ) aims to enable users to access and interact with databases using natural language, reducing the need for expertise in formal query languages and database schemas. While substantial progress has been made in text-to-SQL research for relational databases, existing approaches remain limited in two key respects: their strong dependence on SQL-centric benchmarks and their reliance on data-intensive, fine-tuned models that often fail to generalize across domains and schemas. This thesis addresses these challenges by advancing NLQ research along two complementary directions. First, it introduces DocSpider, the first publicly available cross-domain text-to-MongoDB dataset. DocSpider is constructed by systematically translating gold-standard SQL queries from the Spider benchmark into MongoDB Query Language (MQL) while preserving the original natural language questions and database schemas. This dataset enables, for the first time, large-scale and systematic evaluation of NLQ systems for document-oriented NoSQL databases and provides a foundation for studying cross-domain generalization beyond relational settings. Second, the thesis proposes LWRAG-SQL, a lightweight Retrieval-Augmented Generation (RAG) framework for text-to-SQL that entirely avoids task-specific fine-tuning. The framework generates semantic descriptions of database schema elements and employs similarity-based retrieval to select relevant schema context at inference time. By injecting this retrieved context into large language model prompts, LWRAG-SQL guides query generation in a schema-aware manner while remaining resource-efficient, natural language-agnostic and adaptable to unseen domains. Extensive experimental evaluations demonstrate that DocSpider enables meaningful cross-paradigm NLQ analysis for NoSQL systems and that LWRAG-SQL achieves competitive performance on text-to-SQL benchmarks despite requiring no annotated training data or pre-training. Together, these contributions broaden the scope of NLQ research to document-oriented databases and present a practical alternative to fine-tuning-heavy approaches, advancing more general, scalable, and deployable natural language interfaces to databases.
Citation Formats
A. G. ÖZER, “Natural Language Querying for Databases: Cross-Domain Datasets and Retrieval-Augmented Approaches to Query Generation,” Ph.D. - Doctoral Program, Middle East Technical University, 2026.