Abstract:
Image-based vehicle re-identification (ReID) has witnessed much progress in recent years. However, most of existing works struggled to extract robust but discriminative features from a single image to represent one vehicle instance. We argue that images taken from distinct viewpoints, e.g., front and back, have significantly different appearances and patterns for recognition. In order to identify each vehicle, these models have to capture consistent “ID codes” from totally different views, causing learning difficulties. Additionally, we claim that part-level correspondences among views, i.e., various vehicle parts observed from the identical image and the same part visible from different viewpoints, contribute to instance-level feature learning as well. Motivated by these, we propose to extract comprehensive vehicle instance representations from multiple views through modelling part-wise correlations. To this end, we present our efficient transformer-based framework to exploit both inner- and inter-view correlations for vehicle ReID. In specific, we first adopt a convnet encoder to condense a series of patch embeddings from each view. Then our efficient transformer, consisting of a distillation token and a noise token in addition to a regular classification token, is constructed for enforcing these patch embeddings to interact with each other regardless of whether they are taken from identical or different views. We conduct extensive experiments on widely used vehicle ReID benchmarks, and our approach achieves the state-of-the-art performance, showing the effectiveness of our method.