Abstract:
For AR applications, the present monocular VO methods can be further improved to provide more real-time and accurate self-localization in a dynamic environment with motion disturbance. This letter proposes the CCVO (Cascaded CNNs for Visual Odometry) which is a monocular VO approach to realize end-to-end pose estimation based on two cascaded CNNs. The first CNN detects trackable feature points and conducts semantic segmentation concurrently in milliseconds. The feature points belonging to the dynamic objects are removed as outliers to reduce their interference effect on the pose estimation. The second CNN takes the static feature points of two consecutive images as inputs and predicts the transformation matrix in true scale. Our experiment shows that the CCVO has a better real-time performance as well as relatively satisfactory positioning accuracy and generalization ability when compared with traditional and DL (Deep Learning)-based VO methods. The results of geometry consistency check and forward-backward consistency also show its potential as an effective front-end solution of vSLAM for AR applications.