Abstract:
In this paper, we propose spatiotemporal co-attention hybrid neural network (SC-HNN), a novel hybrid neural network model with both spatial and temporal attention mechanisms for pose-invariant inertial odometry. The main idea is to extract both local and global features from a window of IMU measurements for velocity prediction. SC-HNN leverages the convolutional neural network (CNN) to capture the sectional features and long short-term memory (LSTM) recurrent neural network (RNN) to extract the long-range dependencies. Attention mechanisms are designed and embedded in both CNN and LSTM modules for better model representation. Specifically, in the CNN attention block, the convolved features are refined along both channel and element dimensions. For the LSTM module, softmax scoring is applied to update the weights of the hidden states along the temporal axis. We evaluate SC-HNN on the benchmark with the largest and most natural IMU data, RoNIN. Extensive ablation experiments demonstrate the effectiveness of our SC-HNN model. Compared with the state of the art, the 50th percentile accuracy of SC-HNN is 18.21% higher and the 90th percentile accuracy is 21.15% higher for all the phone holders not appeared in the training set. The real scenario inertial tracking trials in the CUHK campus further prove the superior generalization ability of the SC-HNN model. Note to Practitioners—This paper aims at improving the localization accuracy of deep inertial odometry. We focus on the problem of indoor localization only from the low-cost IMU embedded in the smartphone without any restriction on the phone’s daily use. IMU is a perfect solution for indoor localization because of its low power consumption, high privacy protection, and external infrastructure free. This paper suggests a novel hybrid convolutional and recurrent neural network with a set of carefully designed attention mechanisms to improve the representation ability of deep inertial odometry model. Specifically, the convolutional layer is applied to extract the local spatial features among the 6D IMU signals, following a cascaded channel attention module and element attention module to boost the representation ability of CNN. The complex long-term dependencies are then identified by the LSTM layers. To adaptively capture the temporal features of the multimodal inertial signals, an attention mechanism is applied to weigh the hidden states for the generation of the final features. The effectiveness of the SC-HNN design is validated by extensive ablation studies. To the best of our knowledge, our model is the first HNN fused attention mechanism for inertial tracking. Extensive experiments show that the proposed method outperforms the state of the art.