Text Generation and Comprehension for Objects in Images and Videos

Anayurt Özyeğin, Hazan
Text generation from visual data is a problem often studied using deep learning, having a wide range of applications. This thesis focuses on two different aspects of this problem by proposing both supervised and unsupervised methods to solve it. In the first part of the thesis, we work on referring expression comprehension and generation from videos. We specifically work with relational referring expressions which we define to be expressions that describe an object with respect to another object. For this, we first collect a novel dataset of referring expressions and videos where there are multiple copies of the same object, making relational referring expressions necessary to describe them. Moreover, we train two baseline deep networks on this dataset, which show promising results. Finally, we propose a deep attention network that significantly outperforms the baselines on our dataset. In the second part of the thesis, we tackle the problem we solved in the first part in an unsupervised way. Models that generate text from videos or images tend to be supervised, which means that there needs to be corresponding textual description for every visual example in the datasets they use. However, collecting such paired data is a costly task and much of the data we have is not labeled. As the lack of data was one of the bottlenecks in the supervised part of our thesis, in this part we consider the same problem in an unsupervised setting. For this, we adapt the CycleGAN architecture by Zhu et al. to be between the visual and text domains. Moreover, we use this architecture to perform experiments on different video and image captioning datasets, for some of which we achieve promising results.


Kilickaya, Mert; Erdem, Erkut; Erdem, Aykut; İKİZLER CİNBİŞ, NAZLI; Çakıcı, Ruket (2014-04-25)
Automatic image captioning, the process cif producing a description for an image, is a very challenging problem which has only recently received interest from the computer vision and natural language processing communities. In this study, we present a novel data-driven image captioning strategy, which, for a given image, finds the most visually similar image in a large dataset of image-caption pairs and transfers its caption as the description of the input image. Our novelty lies in employing a recently' pr...
Multi-Modal Learning With Generalizable Nonlinear Dimensionality Reduction
KAYA, SEMİH; Vural, Elif (2019-08-26)
In practical machine learning settings, there often exist relations or links between data from different modalities. The goal of multimodal learning algorithms is to efficiently use the information available in different modalities to solve multi-modal classification or retrieval problems. In this study, we propose a multi-modal supervised representation learning algorithm based on nonlinear dimensionality reduction. Nonlinear embeddings often yield more flexible representations compared to linear counterpa...
Mesh Learning for Object Classification using fMRI Measurements
Ekmekci, Ömer; Ozay, Mete; Oztekin, Ilke; GİLLAM, İLKE; Oztekin, Uygar (2013-09-18)
Machine learning algorithms have been widely used as reliable methods for modeling and classifying cognitive processes using functional Magnetic Resonance Imaging (fMRI) data. In this study, we aim to classify fMRI measurements recorded during an object recognition experiment. Previous studies focus on Multi Voxel Pattern Analysis (MVPA) which feeds a set of active voxels in a concatenated vector form to a machine learning algorithm to train and classify the cognitive processes. In most of the MVPA methods,...
Learning semi-supervised nonlinear embeddings for domain-adaptive pattern recognition
Vural, Elif (null; 2019-05-20)
We study the problem of learning nonlinear data embeddings in order to obtain representations for efficient and domain-invariant recognition of visual patterns. Given observations of a training set of patterns from different classes in two different domains, we propose a method to learn a nonlinear mapping of the data samples from different domains into a common domain. The nonlinear mapping is learnt such that the class means of different domains are mapped to nearby points in the common domain in order to...
Data-driven image captioning via salient region discovery
Kilickaya, Mert; Akkuş, Burak Kerim; Çakıcı, Ruket; Erdem, Aykut; Erdem, Erkut; İKİZLER CİNBİŞ, NAZLI (Institution of Engineering and Technology (IET), 2017-09-01)
n the past few years, automatically generating descriptions for images has attracted a lot of attention in computer vision and natural language processing research. Among the existing approaches, data-driven methods have been proven to be highly effective. These methods compare the given image against a large set of training images to determine a set of relevant images, then generate a description using the associated captions. In this study, the authors propose to integrate an object-based semantic image r...
Citation Formats
H. Anayurt Özyeğin, “Text Generation and Comprehension for Objects in Images and Videos,” M.S. - Master of Science, Middle East Technical University, 2021.