Show/Hide Menu
Hide/Show Apps
Logout
Türkçe
Türkçe
Search
Search
Login
Login
OpenMETU
OpenMETU
About
About
Open Science Policy
Open Science Policy
Open Access Guideline
Open Access Guideline
Postgraduate Thesis Guideline
Postgraduate Thesis Guideline
Communities & Collections
Communities & Collections
Help
Help
Frequently Asked Questions
Frequently Asked Questions
Guides
Guides
Thesis submission
Thesis submission
MS without thesis term project submission
MS without thesis term project submission
Publication submission with DOI
Publication submission with DOI
Publication submission
Publication submission
Supporting Information
Supporting Information
General Information
General Information
Copyright, Embargo and License
Copyright, Embargo and License
Contact us
Contact us
IMPROVING VISION LLM PERFORMANCE ON STANDARDIZED TEST QUESTIONS
Download
10755108.pdf
ceng-ş.e.sert.pdf
Date
2025-8-29
Author
Sert, Sefik Egemen
Metadata
Show full item record
This work is licensed under a
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
.
Item Usage Stats
183
views
0
downloads
Cite This
In our research, we show that open-source vision-language models can be trained to rival proprietary systems on complex, multimodal Turkish high-school exam ques- tions — a domain where no benchmark previously existed. This thesis introduces the first standardized benchmark for evaluating Vision Language Models (VLMs) on the Turkish high-school curriculum. We present a manually curated dataset of 1,854 five-choice Yüksekög ̆retim Kurumları Sınavı (YKS) questions, evenly sampled from 309 topics, designed to comprehensively test multimodal reasoning over complex, exam-style problems. We establish performance baselines by evaluating both open- source and proprietary VLMs, revealing a 23-point accuracy gap between the best proprietary model (Gemini 2.5 Flash, 84.68%) and the strongest open-source model (Qwen-2.5VL-32B, 62.46%). To close this gap, we curated three large-scale multimodal training datasets (D, M, L) totaling 161.4 million tokens, augmented with solutions from advanced models (Gemini 2.0, Gemini 2.5) and video-assisted prompting for complex questions. Using our optimized QMSA (Question–Metadata–Solution–Answer) syntax, we fine-tuned Qwen-2.5VL-32B, achieving 78.59% accuracy — a 25.82% relative improvement —narrowing the gap to proprietary performance to 7.9%. This work delivers three contributions: (1) a publicly available benchmark for Turkish academic evaluation of VLMs, (2) a high-quality, domain-specific training dataset enabling competitive open-source performance, and (3) an empirical demonstration that data-centric fine-tuning can substantially close the open–proprietary performance gap. We also outline key challenges, such as spatial reasoning and domain-specific diagram interpretation, and propose targeted post-training, tool-assisted reasoning, and synthetic data generation as promising future directions.
Subject Keywords
Large language models
,
Vision language models
,
Multimodal reasoning
URI
https://hdl.handle.net/11511/116126
Collections
Graduate School of Natural and Applied Sciences, Thesis
Citation Formats
IEEE
ACM
APA
CHICAGO
MLA
BibTeX
S. E. Sert, “IMPROVING VISION LLM PERFORMANCE ON STANDARDIZED TEST QUESTIONS,” M.S. - Master of Science, Middle East Technical University, 2025.