DocSpider: a dataset of cross-domain natural language querying for MongoDB

Date

2025-02-12

Author

Özer, Arif Görkem
Çekinel, Recep Fırat
Toroslu, İsmail Hakkı
Karagöz, Pınar

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

9
views

0
downloads

Natural language querying allows users to formulate questions in a natural language without requiring specific knowledge of the database query language. Large language models have been very successful in addressing the text-to-SQL problem, which is about translating given questions in textual form into SQL statements. Document-oriented NoSQL databases are gaining popularity in the era of big data due to their ability to handle vast amounts of semi-structured data and provide advanced querying functionalities. However, studies on text-to-NoSQL systems, particularly on systems targeting document databases, are very scarce. In this study, we utilize large language models to create a cross-domain natural language to document database query dataset, DocSpider, leveraging the well-known text-to-SQL challenge dataset Spider. As a document database, we use MongoDB. Furthermore, we conduct experiments to assess the effectiveness of the DocSpider dataset to fine-tune a text-to-NoSQL model against a cross-language transfer learning approach, SQL-to-NoSQL, and zero-shot instruction prompting. The experimental results reveal a significant improvement in the execution accuracy of fine-tuned language models when utilizing the DocSpider dataset.

URI

https://hdl.handle.net/11511/113599

Journal

NATURAL LANGUAGE PROCESSING

DOI

https://doi.org/10.1017/nlp.2024.63

Collections

Department of Computer Engineering, Article

Citation Formats

A. G. Özer, R. F. Çekinel, İ. H. Toroslu, and P. Karagöz, “DocSpider: a dataset of cross-domain natural language querying for MongoDB,” NATURAL LANGUAGE PROCESSING, pp. 0–0, 2025, Accessed: 00, 2025. [Online]. Available: https://hdl.handle.net/11511/113599.