Abstract:
Photometric differences are widely used as supervision signals to train neural networks for estimating depth and camera pose from unlabeled monocular videos. However, this approach is detrimental for model optimization because occlusions and moving objects in a scene violate the underlying static scenario assumption. In addition, pixels in textureless regions or less discriminative pixels hinder model training. To address these problems, in this paper, we deal with moving objects and occlusions by utilizing the differences between the flow fields, and the differences between the depth structure generated by affine transformation and view synthesis, respectively. Secondly, we mitigate the effect of textureless regions on model optimization by measuring the differences between features with more semantic and contextual information without requiring additional networks. In addition, although the bidirectionality component is used in each sub-objective function, a pair of images is reasoned about only once, which helps reduce overhead. Extensive experiments and visual analysis demonstrate the effectiveness of the proposed method, which outperforms existing state-of-the-art self-supervised methods under the same conditions and without introducing additional auxiliary information.