Analysis of dataset, object tag, and object attribute components in novel object captioning

Şahin, Enes Muvahhid
Image captioning is a popular yet challenging task which lies at the intersection of Computer Vision and Natural Language Processing. A specific branch of image captioning called Novel Object Captioning draw attention in recent years. Different from general image captioning, Novel Object Captioning focuses on describing images with novel objects which are not seen during training. Recently, numerous image captioning approaches are proposed in order to increase quality of the generated captions for both general image captioning and Novel Object Captioning. These methods benefit from large object detection datasets for Novel Object Captioning. They also utilize specific set of object tags (class names) in the image. Even though these approaches are very successful in many aspects, they require GPU-weeks of training on several large datasets. Furthermore, captions generated by these methods may lack visual grounding and overlook details in the image. Thus, in this thesis, we analyze the dataset, object tag, and object attribute components for Novel Object Captioning. We perform Visual Vocabulary Pretraining (VIVO) [1] on small-scale [2] and large-scale [3] datasets and compare the captioning performances of a state-of-the-art method [4] in order to analyze the effect of dataset size. To analyze the effect of tag quality on Novel Object Captioning performance, we compare the performance of captioning methods [4] trained with two different set of object tags: a large set of tags but lacking novel objects, a small set of tags with novel objects. Finally, to obtain richer captions and alleviate overlooked details in the image, we propose a novel approach in which object attributes in the image are exploited. Experimental results are demonstrated on both Novel Object Captioning and general image captioning tasks. The results show that novel object tags play a vital role for Novel Object Captioning and proposed method generates richer and more detailed captions compared to the baseline.


E. M. Şahin, “Analysis of dataset, object tag, and object attribute components in novel object captioning,” M.S. - Master of Science, Middle East Technical University, 2022.