Turkish legal NLP: a comprehensive AI framework for legal document understanding, summarization, and retrieval-augmented generation

2025-8-26
Erkan, Mehmet Ali
Natural Language Processing (NLP) has been widely used across various domains; however, legal texts especially particularly in Turkish remain relatively underexplored. The complexity and richness of the Turkish legal language make it a valuable yet challenging source for artificial intelligence (AI) systems. This thesis presents an end-to-end AI-based framework designed to understand, summarize, and evaluate Turkish legal texts. A large-scale dataset was compiled, consisting of 23,035 judicial decisions and 9,277 legislative and regulatory documents collected from Yargıtay, Danıştay, the Constitutional Court, UYAP, legislative bulletins, and ministry decisions. These texts are cleaned, normalized, and structured to create benchmark datasets for legal NLP in Turkish. A detailed tokenization analysis was conducted, highlighting the trade-offs between efficiency and semantic representation across character-level, word-based, and legal-aware approaches. Embedding methods were systematically compared, and findings indicated that no single technique universally dominates; instead, their effectiveness varies depending on the legal text category and downstream task. Additionally, a new summarization algorithm was created that was specifically designed for the structural features of Turkish legal language. It beat traditional baselines by being more relevant, coherent, and legally comprehensive. Finally, a Retrieval- Augmented Generation (RAG) system was built by combining dense retrieval and generation models that work best in the legal field. This system showed that it could help lawyers and citizens with research, making decisions, and accessing legal information by being very accurate and reliable when it came to facts and citations.
Citation Formats
M. A. Erkan, “Turkish legal NLP: a comprehensive AI framework for legal document understanding, summarization, and retrieval-augmented generation,” M.S. - Master of Science, Middle East Technical University, 2025.