Comparison of deep networks for gesture recognition

Sofu, Buğra
Gesture recognition is an important problem and has been studied over the years especially in the fields such as surveillance systems, analysis of human behavior, robotics etc. In this thesis, different state of art algorithms, which are based on deep learning, were implemented and compared considering model complexities and accuracies. Also, a new approach was proposed and compared with them. Tested algorithms can be classified into two main categories: hybrid approaches, which use CNN and LSTM architectures successively, and three dimensional convolutional neural networks (3D-CNNs). For the hybrid approaches, we studied CNN-LSTM models and investigated the effect of different feature extractors such as Inception-V3 and ResNext50 models. For the ResNext50 architecture, additional to original network, we included an attention model called Squeeze and Excitation Block (SE). By this new approach, 21% accuracy increase was reached while the number of parameters was decreased, which means less model complexity than the original approach. For the 3D-CNNs, I3D model, which has pre-trained ImageNet weights, was applied and compared with C3D models, which cannot use ImageNet weights directly. Ability to use ImageNet weights gives the advantage of fast training, since network is initialized with ImageNet features, and can also result in a more accurate and effective model overall. 16.5% accuracy increase was obtained for the 3D-CNN architecture when I3D model was trained on Kinetics dataset.


B. Sofu, “Comparison of deep networks for gesture recognition,” M.S. - Master of Science, Middle East Technical University, 2021.