Abstract:
As an interesting and important problem in computer vision, learning-based video saliency detection aims to discover the visually interesting regions in a video sequence. Capturing the information within frame and between frame at different aspects (such as spatial contexts, motion information, temporal consistency across frames, and multiscale representation) is important for this task. A key issue is how to jointly model all these factors within a unified data-driven scheme in an end-to-end fashion. In this article, we propose an end-to-end spatiotemporal deep video saliency detection approach, which captures the information on spatial contexts and motion characteristics. Furthermore, it encodes the temporal consistency information across the consecutive frames by implementing a convolutional long short-term memory (Conv-LSTM) model. In addition, the multiscale saliency properties for each frame are adaptively integrated for final saliency prediction in a collaborative feature-pyramid way. Finally, the proposed deep learning approach unifies all the aforementioned parts into an end-to-end joint deep learning scheme. Experimental results demonstrate the effectiveness of our approach in comparison with the state-of-the-art approaches.