Exploring deep spatio-temporal fusion architectures towards late temporal modeling of human action recognition

Kalafoğlu, Muhammet Esat
Visual action recognition (AR) is the problem of identifying the labels of activities that occur in a video. In this thesis, different spatio-temporal representations are analyzed and the factors making these representations better suited for AR are determined. To be specific, three main concepts are analyzed in this thesis study which are the effects of different architectural selections, the input modalities (RGB, optical flow, human pose), and temporal modeling concepts. Additionally, the joint utilization of BERT-based late temporal modeling with 3D CNN architectures is proposed and a novel distillation concept is recommended within this approach. Firstly, for architectural analysis, both 2D and 3D CNN structures are considered. For 3D CNN architectures, the effects of clip length, input spatial resolution, group convolution, and separable 3D convolution are analyzed. During this analysis, popular 3D CNN architectures for AR, such as MFNET, SlowFast Networks, R(2+1)D networks, I3D, MARS networks (knowledge distillation), and various ResNet architectures are all considered. Temporal shift modules are also investigated as an extension to 2D CNN architectures. For input modality analysis, popular two-stream architectures (RGB+Flow) are analyzed within both 2D and 3D CNN architectures. Moreover, as an extension to RGB v and flow modalities, pose input modality is utilized with a different approach from the literature and studied within the 2D CNN architectures in this thesis. For the temporal modeling analysis, various techniques are analyzed such as average pooling, LSTM, convolutional GRU, BERT, and non-local blocks within 2D CNN architectures. As a novel extension, conventional 3D convolutions are combined with late temporal modeling for AR. The popular temporal global average pooling layer (TGAP) at the end of 3D convolutional architecture is replaced with the recent Bidirectional Encoder epresentations from Transformers (BERT) layer in order to better exploit the attention mechanism of BERT. Such a replacement is shown to improve the performances of many popular 3D convolution architectures, including ResNeXt, I3D, SlowFast, and R(2+1)D. The-state-of-the-art performances are obtained on both HMDB51 and UCF101 datasets with 85.10% and 98.69% Top-1 accuracy, respectively. Finally, anovel knowledge distillation concept is proposed using a 3D-BERT architecture that yields quite promising performances.


Comparison of Cuboid and Tracklet Features for Action Recognition on Surveillance Videos
Bayram, Ulya; Ulusoy, İlkay; Cicekli, Nihan Kesim (2013-01-01)
For recognition of human actions in surveillance videos, action recognition methods in literature are analyzed and coherent feature extraction methods that are promising for success in such videos are identified. Based on local methods, most popular two feature extraction methods (Dollar's "cuboid" feature definition and Raptis and Soatto's "tracklet" feature definition) are tested and compared. Both methods were classified by different methods in their original applications. In order to obtain a more fair ...
Comparison of deep networks for gesture recognition
Sofu, Buğra; Ulusoy, İlkay; Department of Electrical and Electronics Engineering (2021-9-06)
Gesture recognition is an important problem and has been studied over the years especially in the fields such as surveillance systems, analysis of human behavior, robotics etc. In this thesis, different state of art algorithms, which are based on deep learning, were implemented and compared considering model complexities and accuracies. Also, a new approach was proposed and compared with them. Tested algorithms can be classified into two main categories: hybrid approaches, which use CNN and LSTM architectu...
Evaluation of Feature Channels for Correlation Filter Based Visual Object Tracking in Infrared Spectrum
Gündoğdu, Erhan; SOLMAZ, Berkan; Koç, AYKUT; HAMMOUD, RI; Alatan, Abdullah Aydın (2016-07-01)
Correlation filters for visual object tracking in visible imagery has been well-studied. Most of the correlation-filter-based methods use either raw image intensities or feature maps of gradient orientations or color channels. However, well-known features designed for visible spectrum may not be ideal for infrared object tracking, since infrared and visible spectra have dissimilar characteristics in general. We assess the performance of two state-of-the-art correlation-filter-based object tracking methods o...
Automatic target recognition and detection in infrared imagery under cluttered background
GÜNDOĞDU, ERHAN; KOÇ, AYKUT; Alatan, Abdullah Aydın (2017-09-14)
Visual object classification has long been studied in visible spectrum by utilizing conventional cameras. Since the labeled images has recently increased in number, it is possible to train deep Convolutional Neural Networks (CNN) with significant amount of parameters. As the infrared (IR) sensor technology has been improved during the last two decades, labeled images extracted from IR sensors have been started to be used for object detection and recognition tasks. We address the problem of infrared object ...
Aygunes, Bulut; AKSOY, SELİM; Cinbiş, Ramazan Gökberk (2019-01-01)
The challenging task of training object detectors for fine-grained classification faces additional difficulties when there are registration errors between the image data and the ground truth. We propose a weakly supervised learning methodology for the classification of 40 types of trees by using fixed-sized multispectral images with a class label but with no exact knowledge of the object location. Our approach consists of an end-to-end trainable convolutional neural network with separate branches for learni...
Citation Formats
M. E. Kalafoğlu, “Exploring deep spatio-temporal fusion architectures towards late temporal modeling of human action recognition,” M.S. - Master of Science, Middle East Technical University, 2020.