Text Generation and Comprehension for Objects in Images and Videos

Download

10425333.pdf

Date

2021-9-09

Author

Anayurt Özyeğin, Hazan

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

432
views

121
downloads

Text generation from visual data is a problem often studied using deep learning, having a wide range of applications. This thesis focuses on two different aspects of this problem by proposing both supervised and unsupervised methods to solve it. In the first part of the thesis, we work on referring expression comprehension and generation from videos. We specifically work with relational referring expressions which we define to be expressions that describe an object with respect to another object. For this, we first collect a novel dataset of referring expressions and videos where there are multiple copies of the same object, making relational referring expressions necessary to describe them. Moreover, we train two baseline deep networks on this dataset, which show promising results. Finally, we propose a deep attention network that significantly outperforms the baselines on our dataset. In the second part of the thesis, we tackle the problem we solved in the first part in an unsupervised way. Models that generate text from videos or images tend to be supervised, which means that there needs to be corresponding textual description for every visual example in the datasets they use. However, collecting such paired data is a costly task and much of the data we have is not labeled. As the lack of data was one of the bottlenecks in the supervised part of our thesis, in this part we consider the same problem in an unsupervised setting. For this, we adapt the CycleGAN architecture by Zhu et al. to be between the visual and text domains. Moreover, we use this architecture to perform experiments on different video and image captioning datasets, for some of which we achieve promising results.

Subject Keywords

Referring Expressions, Video Captioning, Image Captioning, Unsupervised Learning

URI

https://hdl.handle.net/11511/93203

Collections

Graduate School of Natural and Applied Sciences, Thesis

Suggestions

OpenMETU
Core

Mesh Learning for Object Classification using fMRI Measurements Ekmekci, Ömer; Ozay, Mete; Oztekin, Ilke; GİLLAM, İLKE; Oztekin, Uygar (2013-09-18) Machine learning algorithms have been widely used as reliable methods for modeling and classifying cognitive processes using functional Magnetic Resonance Imaging (fMRI) data. In this study, we aim to classify fMRI measurements recorded during an object recognition experiment. Previous studies focus on Multi Voxel Pattern Analysis (MVPA) which feeds a set of active voxels in a concatenated vector form to a machine learning algorithm to train and classify the cognitive processes. In most of the MVPA methods,...
Learning semi-supervised nonlinear embeddings for domain-adaptive pattern recognition Vural, Elif (null; 2019-05-20) We study the problem of learning nonlinear data embeddings in order to obtain representations for efficient and domain-invariant recognition of visual patterns. Given observations of a training set of patterns from different classes in two different domains, we propose a method to learn a nonlinear mapping of the data samples from different domains into a common domain. The nonlinear mapping is learnt such that the class means of different domains are mapped to nearby points in the common domain in order to...
Multi-Modal Learning With Generalizable Nonlinear Dimensionality Reduction KAYA, SEMİH; Vural, Elif (2019-08-26) In practical machine learning settings, there often exist relations or links between data from different modalities. The goal of multimodal learning algorithms is to efficiently use the information available in different modalities to solve multi-modal classification or retrieval problems. In this study, we propose a multi-modal supervised representation learning algorithm based on nonlinear dimensionality reduction. Nonlinear embeddings often yield more flexible representations compared to linear counterpa...
Data-driven image captioning via salient region discovery Kilickaya, Mert; Akkuş, Burak Kerim; Çakıcı, Ruket; Erdem, Aykut; Erdem, Erkut; İKİZLER CİNBİŞ, NAZLI (Institution of Engineering and Technology (IET), 2017-09-01) n the past few years, automatically generating descriptions for images has attracted a lot of attention in computer vision and natural language processing research. Among the existing approaches, data-driven methods have been proven to be highly effective. These methods compare the given image against a large set of training images to determine a set of relevant images, then generate a description using the associated captions. In this study, the authors propose to integrate an object-based semantic image r...
PROGRESSIVE CLUSTERING OF MANIFOLD-MODELED DATA BASED ON TANGENT SPACE VARIATIONS Gokdogan, Gokhan; Vural, Elif (2017-09-28) An important research topic of the recent years has been to understand and analyze manifold-modeled data for clustering and classification applications. Most clustering methods developed for data of non-linear and low-dimensional structure are based on local linearity assumptions. However, clustering algorithms based on locally linear representations can tolerate difficult sampling conditions only to some extent, and may fail for scarcely sampled data manifolds or at high-curvature regions. In this paper, w...

Citation Formats

H. Anayurt Özyeğin, “Text Generation and Comprehension for Objects in Images and Videos,” M.S. - Master of Science, Middle East Technical University, 2021.