REFERENCE SELECTION IN TURKISH: A CORPUS-BASED APPROACH

2025-6-18
Büyüktekin, Faruk
This thesis investigates reference selection in natural language, focusing on the mech- anisms that shape the form of referring expressions. Drawing from both linguis- tic theory and data-driven computational methods, the study seeks to uncover how grammatical, discourse, and cognitive factors jointly influence referential form. As the target language, Turkish offers a testing ground to explore referential choices be- yond the patterns observed in well-studied languages due to its typologically distinct characteristics specifically, its rich morphology, frequent use of null pronouns, and flexible word order. A central contribution of this work is the creation of a novel coreference corpus based on spontaneous, goal-directed dialog. Unlike prior research that has typically relied on semi-artificial or isolated sentences, or written texts, this study uses situated task-based interaction, capturing reference in real-time naturalistic speech. To facilitate this, a new annotation scheme was developed to represent the full range of referential forms, including full noun phrases, overt pronouns, and null pro- nouns, and their contextual and grammatical properties. The resulting corpus, which is the most comprehensive coreference corpus of Turkish dialogs to date, enables sys- tematic and computationally viable analyses of referential phenomena. Building on this resource, the thesis conducts extensive statistical analyses and employs machine learning to evaluate the effects and interactions of multiple features on referential form. These include speaker role, turn-taking, grammatical role, competition, dis- tance, topicality, and sentential position. Among the findings, competition and dis- tance emerged as the most predictive features in model performance, while speaker role and turn-taking showed weaker but interpretable effects. Statistical tests con- firmed that many of these factors significantly influence form choice, supporting and extending theoretical predictions of major theories and models related to referential form selection. By integrating corpus development, feature engineering, statistical modeling, and explanatory machine learning, this thesis offers a unified framework for analyzing reference in Turkish. It not only contributes to theoretical accounts of referential choice in typologically diverse settings but also provides scalable tools for future research in natural language processing, cognitive modeling, and dialog systems.
Citation Formats
F. Büyüktekin, “REFERENCE SELECTION IN TURKISH: A CORPUS-BASED APPROACH,” Ph.D. - Doctoral Program, Middle East Technical University, 2025.