Spatial Temporal Graph Network for Video Crowd Counting

Spatial Temporal Graph Network for Video Crowd Counting

Abstract:

In recent years, researchers have developed many deep-learning-based methods to count crowd numbers in static images. However, much fewer works focus on video-based crowd counting, in which the critical challenge of temporal correlation has not been well explored. This paper proposes a Spatial-Temporal Graph Network (STGN) to achieve efficient and accurate crowd counting in videos via learning pixel-wise and patch-wise relations in local spatial-temporal domains. Specifically, we design a pyramid graph module to leverage multi-scale features. In each scale, we sequentially construct three graphs: spatial-temporal pixel graph, temporal patch graph, and spatial pixel graph, in which we apply the self-attention mechanism to capture pixel-wise relation, learn structure-aware relation, and aggregate local features, respectively. Furthermore, we propose spatial-aware channel-wise attention to effectively fuse multi-scale features. To demonstrate the effectiveness of the proposed method, we conduct experiments on five crowd counting datasets, including a large-scale video crowd dataset (FDST). Moreover, the proposed model is also applied in the vehicle counting dataset (TRANCOS). The results show that the proposed model outperforms existing spatial-temporal crowd counting models and achieves state-of-the-art.