Exploring the capabilities of large language models in visual question answering: a new approach using question-driven image captions as prompts

Download

ovgu_ozdemir_tez.pdf

Date

2024-9

Author

Özdemir, Övgü

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

37
views

48
downloads

Visual question answering (VQA) refers to the artificial intelligence task of providing natural language answers to natural language questions based on a visual input. Due to its requirement for multimodal processing, VQA is considered a challenging problem. It often necessitates understanding the scene depicted in the image, extracting the relationships between objects and their attributes, and performing multi-step reasoning. Over the past few years, numerous deep learning architectures have been proposed for VQA. More recently, pre-trained vision-language models and Multimodal Large Language Models (MLLMs) with billions of parameters have demonstrated superior performance on benchmark tests. Nevertheless, there are still gaps for performance improvement in zero-shot VQA. Zero-shot VQA requires adapting to tasks without input-output guidance, necessitating advanced reasoning abilities. Consequently, recent research has focused on designing prompts that can elicit reasoning capabilities in MLLMs. This thesis proposes a new approach aimed at improving performance in zero-shot VQA by using LLMs and integrating context-aware image captioning as an intermediate step. Evaluations and comparisons were conducted using recent MLLMs, such as CogVLM, GPT-4, and GPT-4o, on the GQA test set, which includes structurally and semantically diverse questions that often require multi-step reasoning. Additionally, the thesis examines the impact of different prompt designs on VQA performance. The findings highlight the potential of using image captions and optimized prompts to enhance VQA performance under zero-shot settings.

Subject Keywords

Visual question answering, Large language models, Image captioning, Prompt design, Zero-shot VQA

URI

https://hdl.handle.net/11511/111434

Collections

Graduate School of Informatics, Thesis

Citation Formats

Ö. Özdemir, “Exploring the capabilities of large language models in visual question answering: a new approach using question-driven image captions as prompts,” M.S. - Master of Science, Middle East Technical University, 2024.