Abstract:
Video person retrieval aims at matching video clips of the same person across non-overlapping camera views, where video sequences contain more comprehensive information, e.g., temporal cues. How to extract useful temporal cues is the key to the success of a video person retrieval system. Gait, as a unique biometric modality indicating the way people walk, contains informative temporal information. To date, it is not clear how to fully utilize gait to boost the performance of video person retrieval. In this paper, to validate whether gait could help retrieve person in videos, we build a two-stream architecture, named appearance-gait network (AGNet), to jointly learn the appearance features and gait features from RGB video clips and silhouette video clips. We further explore how to fully utilize gait features to enhance the video feature representation. Specifically, we propose an appearance-gait attention module (AGA) to fuse a discriminative feature representation for the person retrieval task. Furthermore, to eliminate the requirement of silhouette video clips during inference, we propose a simple yet effective appearance-gait distillation module (AGD) which transfers the gait knowledge to appearance stream. As such, we are able to perform the enhanced video person retrieval without silhouette video clips, which makes the inference more flexible and practical. To the best of our knowledge, our work is the first to successfully introduce such appearance-gait knowledge distillation design for video person retrieval. We verify the effectiveness of the proposed methods on two large-scale challenging benchmarks of MARS and DukeMTMC-VideoReID. Extensive experiments demonstrate superior or comparable performance compared to the state-of-the-art methods while being much simpler.