Caption generation on scenes with seen and unseen object categories

2022-08-01
Demirel, Berkan
Cinbiş, Ramazan Gökberk
Image caption generation is one of the most challenging problems at the intersection of vision and language domains. In this work, we propose a realistic captioning task where the input scenes may incorporate visual objects with no corresponding visual or textual training examples. For this problem, we propose a detection-driven approach that consists of a single-stage generalized zero-shot detection model to recognize and localize instances of both seen and unseen classes, and a template-based captioning model that transforms detections into sentences. To improve the generalized zero-shot detection model, which provides essential information for captioning, we define effective class representations in terms of class-to-class semantic similarities, and leverage their special structure to construct an effective unseen/seen class confidence score calibration mechanism. We also propose a novel evaluation metric that provides additional insights for the captioning outputs by separately measuring the visual and non-visual contents of generated sentences. Our experiments highlight the importance of studying captioning in the proposed zero-shot setting, and verify the effectiveness of the proposed detection-driven zero-shot captioning approach.
IMAGE AND VISION COMPUTING

Suggestions

Image Captioning with Unseen Objects
Berkan, Demirel; Cinbiş, Ramazan Gökberk; İkizler Cinbiş, Nazlı (2019-09-12)
Image caption generation is a long standing and challenging problem at the intersection of computer vision and natural language processing. A number of recently proposed approaches utilize a fully supervised object recognition model within the captioning approach. Such models, however, tend to generate sentences which only consist of objects predicted by the recognition models, excluding instances of the classes without labelled training examples. In this paper, we propose a new challenging scenario that ta...
Comparison of whole scene image caption models
Görgülü, Tuğrul; Ulusoy, İlkay; Department of Electrical and Electronics Engineering (2021-2-10)
Image captioning is one of the most challenging processes in deep learning area which automatically describes the content of an image by using words and grammar. In recent years, studies are published constantly to improve the quality of this task. However, a detailed comparison of all possible approaches has not been done yet and we cannot know comparative performances of the proposed solutions in the literature. Thus, this thesis aims to redress this problem by making a comparative analysis among six diff...
DATA-DRIVEN IMAGE CAPTIONING WITH META-CLASS BASED RETRIEVAL
Kilickaya, Mert; Erdem, Erkut; Erdem, Aykut; İKİZLER CİNBİŞ, NAZLI; Çakıcı, Ruket (2014-04-25)
Automatic image captioning, the process cif producing a description for an image, is a very challenging problem which has only recently received interest from the computer vision and natural language processing communities. In this study, we present a novel data-driven image captioning strategy, which, for a given image, finds the most visually similar image in a large dataset of image-caption pairs and transfers its caption as the description of the input image. Our novelty lies in employing a recently' pr...
Image resolution enhancement using wavelet domain Hidden Markov Tree and coefficient sign estimation
Temizel, Alptekin (2007-01-01)
Image resolution enhancement using wavelets is a relatively new subject and many new algorithms have been proposed recently. These algorithms assume that the low resolution image is the approximation subband of a higher resolution image and attempts to estimate the unknown detail coefficients to reconstruct a high resolution image. A subset of these recent approaches utilized probabilistic models to estimate these unknown coefficients. Particularly, hidden Markov tree (HMT) based methods using Gaussian mixt...
Analysis of dataset, object tag, and object attribute components in novel object captioning
Şahin, Enes Muvahhid; Akar, Gözde; Department of Electrical and Electronics Engineering (2022-7)
Image captioning is a popular yet challenging task which lies at the intersection of Computer Vision and Natural Language Processing. A specific branch of image captioning called Novel Object Captioning draw attention in recent years. Different from general image captioning, Novel Object Captioning focuses on describing images with novel objects which are not seen during training. Recently, numerous image captioning approaches are proposed in order to increase quality of the generated captions for both gene...
Citation Formats
B. Demirel and R. G. Cinbiş, “Caption generation on scenes with seen and unseen object categories,” IMAGE AND VISION COMPUTING, vol. 124, pp. 104515–104515, 2022, Accessed: 00, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0262885622001445.