Abstract:
Visible-thermal person re-identification (VT-ReID) is an image retrieval task that aims at matching the target pedestrian across the visible and thermal modalities. However, intra-class variations and cross-modality discrepancy degrade the performance of VT-ReID. Recent methods focus on extracting discriminative local features of each modality to alleviate the intra-class variations and cross-modality discrepancy, but these methods ignore semantic relations between the local features of two modalities, i.e. , the spatial relations and channel relations. In this paper, we proposed a feature aggregation module (FAM) to enhance the correlation between local features including spatial dependencies and channel dependencies. Furthermore, FAM implements cross-modality feature aggregation on the enhanced features to reduce the cross-modality discrepancy. Moreover, we also proposed near neighbor cross-modality loss (NNCLoss) to mine feature consistency between modalities by constructing a cross-modality near neighbor set, which facilitates feature alignment between two modalities. Extensive experiments on two datasets demonstrate the superior performance of our approach over the existing state-of-the-arts.