Transformer Based Sensor Fusion and Pose Estimation in End-to-End Supervised Learning of Visual Inertial Odometry

Download
2024-9
Kurt, Yunus Bilge
This thesis investigates the application of Transformer architecture for temporal modeling in visual-inertial odometry (VIO) networks. The objective is to improve pose estimation accuracy by leveraging the attention mechanisms in Transformers, which better utilize historical data compared to Long Short Term Memory (LSTM) networks seen in recent methods. The proposed method is end-to-end trainable and requires only monocular camera and IMU measurements during inference. We observe that latent visual-inertial features contain essential information for pose estimation, enabling Transformers to perform effective temporal updates from past measurements within a local window. To facilitate real-time deployment, all attention mechanisms are designed to work with causal masks. This thesis also explores the use of tokenization mechanisms for continuous data in time series prediction problems, and evaluates regression by classification in odometry task. The study examines the impact of data uncertainty in supervised end-to-end odometry learning and considers specialized loss functions for the pose space. Experimental results demonstrate that Transformer-based architectures enhance the accuracy of monocular VIO networks, achieving better or comparable results compared to state-of-the-art methods on standard odometry datasets.
Citation Formats
Y. B. Kurt, “Transformer Based Sensor Fusion and Pose Estimation in End-to-End Supervised Learning of Visual Inertial Odometry,” M.S. - Master of Science, Middle East Technical University, 2024.