Comparison of whole scene image caption models

Görgülü, Tuğrul
Image captioning is one of the most challenging processes in deep learning area which automatically describes the content of an image by using words and grammar. In recent years, studies are published constantly to improve the quality of this task. However, a detailed comparison of all possible approaches has not been done yet and we cannot know comparative performances of the proposed solutions in the literature. Thus, this thesis aims to redress this problem by making a comparative analysis among six different models by implementing them. The selected models are generally trained only for the MsCOCO dataset in the literature. In order to make a more objective comparison, they are also trained for the Flickr30k dataset in this study. The selected models are as follows: Self-critical Sequence Training for Image Captioning, Neural Baby Talk, Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, Unsupervised Image Caption, Meshed Memory Transformer for Image Caption [7], and Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. First, the captions from all these models are extracted and the results are compared with the ones in their respective papers. In addition to popular metrics usually used in the papers, the captions from models are also evaluated by Word Mover’s Distance and BERT metrics. The findings of this thesis demonstrate that even though Bottom-up and Top-down attention and Neural Baby Talk can generate highly proper captions, Meshed Memory Transformer for Image Caption generally provides more promising results than the rest. Unsupervised Image Caption, on the other hand, is a far less successful algorithm since it does not use the direct relationship between images and their descriptions during the training stage.


