Abstract:
Stereo matching is a classical problem in computer vision. It has been widely used in many fields, especially autonomous driving in recent years. Two key aspects of speed and accuracy are both desirable but conflicting characteristics in autonomous driving. In this paper, we present CMNet, a lightweight stereo matching architecture for improving the trade-off between speed and accuracy on resource-limited devices. A novel feature extraction network consisted of a patch embedding layer and a ConvMLP-mixer is proposed. The patch embedding layer enhances the receptive field and makes the feature vectors compact. The accuracy of the disparity map is increased by mixing the spatial information in the channel dimension through the ConvMLP-mixer. The absolute difference volume is concatenated with the group-wise correlation volume to provide multi-dimensional matching cost information for the cost aggregation stage. Being evaluated on KITTI 2012 and KITTI 2015 stereo matching datasets, the inference time of CMNet on NVIDIA GTX 2080ti GPU is 8.7 ms. While realizing fast predictions beyond real-time, the results of D1-all are 3.41% on KITTI 2012 and 3.84% on KITTI 2015, achieving state-of-the-art result between speed and accuracy. Besides, the lightweight architecture of CMNet enables a fast inference time of 40.7 ms on Nvidia Jetson Nano to realize real-time applications on edge devices.