Comparison of whole scene image caption models

Görgülü, Tuğrul
Image captioning is one of the most challenging processes in deep learning area which automatically describes the content of an image by using words and grammar. In recent years, studies are published constantly to improve the quality of this task. However, a detailed comparison of all possible approaches has not been done yet and we cannot know comparative performances of the proposed solutions in the literature. Thus, this thesis aims to redress this problem by making a comparative analysis among six different models by implementing them. The selected models are generally trained only for the MsCOCO dataset in the literature. In order to make a more objective comparison, they are also trained for the Flickr30k dataset in this study. The selected models are as follows: Self-critical Sequence Training for Image Captioning, Neural Baby Talk, Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, Unsupervised Image Caption, Meshed Memory Transformer for Image Caption [7], and Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. First, the captions from all these models are extracted and the results are compared with the ones in their respective papers. In addition to popular metrics usually used in the papers, the captions from models are also evaluated by Word Mover’s Distance and BERT metrics. The findings of this thesis demonstrate that even though Bottom-up and Top-down attention and Neural Baby Talk can generate highly proper captions, Meshed Memory Transformer for Image Caption generally provides more promising results than the rest. Unsupervised Image Caption, on the other hand, is a far less successful algorithm since it does not use the direct relationship between images and their descriptions during the training stage.


Kilickaya, Mert; Erdem, Erkut; Erdem, Aykut; İKİZLER CİNBİŞ, NAZLI; Çakıcı, Ruket (2014-04-25)
Automatic image captioning, the process cif producing a description for an image, is a very challenging problem which has only recently received interest from the computer vision and natural language processing communities. In this study, we present a novel data-driven image captioning strategy, which, for a given image, finds the most visually similar image in a large dataset of image-caption pairs and transfers its caption as the description of the input image. Our novelty lies in employing a recently' pr...
Image annotation with semi-supervised clustering
Sayar, Ahmet; Yarman Vural, Fatoş Tunay; Department of Computer Engineering (2009)
Image annotation is defined as generating a set of textual words for a given image, learning from the available training data consisting of visual image content and annotation words. Methods developed for image annotation usually make use of region clustering algorithms to quantize the visual information. Visual codebooks are generated from the region clusters of low level visual features. These codebooks are then, matched with the words of the text document related to the image, in various ways. In this th...
Contrast Enhancement of Microscopy Images Using Image Phase Information
Çakır, Serhat; Atalay, Rengül; ÇETİN, AHMET ENİS (2018-01-01)
Contrast enhancement is an important preprocessing step for the analysis of microscopy images. The main aim of contrast enhancement techniques is to increase the visibility of the cell structures and organelles by modifying the spatial characteristics of the image. In this paper, phase information-based contrast enhancement framework is proposed to overcome the limitations of existing image enhancement techniques. Inspired by the groundbreaking design of the phase contrast microscopy (PCM), the proposed ima...
Alignment of uncalibrated images for multi-view classification
Arık, Sercan Ömer; Vural, Elif; Frossard, Pascal (2011-12-29)
Efficient solutions for the classification of multi-view images can be built on graph-based algorithms when little information is known about the scene or cameras. Such methods typically require a pairwise similarity measure between images, where a common choice is the Euclidean distance. However, the accuracy of the Euclidean distance as a similarity measure is restricted to cases where images are captured from nearby viewpoints. In settings with large transformations and viewpoint changes, alignment of im...
Novel refinement method for automatic image annotation systems
Demircioğlu, Erşan; Yarman Vural, Fatoş Tunay; Department of Computer Engineering (2011)
Image annotation could be defined as the process of assigning a set of content related words to the image. An automatic image annotation system constructs the relationship between words and low level visual descriptors, which are extracted from images and by using these relationships annotates a newly seen image. The high demand on image annotation requirement increases the need to automatic image annotation systems. However, performances of current annotation methods are far from practical usage. The most ...
Citation Formats
T. Görgülü, “Comparison of whole scene image caption models,” M.S. - Master of Science, Middle East Technical University, 2021.