Contextual Attention Network for Emotional Video Captioning

Contextual Attention Network for Emotional Video Captioning

Abstract:

This paper investigates an emerging and challenging task—emotional video captioning. Formally, given a video, the task aims to not only describe the factual content of the video, but also discover the emotional clues in the video. We propose a novel Contextual Attention Network (CANet), which recognizes and describes the fact and emotion in the video by semantic-rich context learning. To be specific, at each time step, we first extract visual and textual features from both input video and previously generated words. Then, we apply the attention mechanism to these features to capture informative contexts for captioning. We train the CANet model with the joint optimization of cross-entropy loss LCE and contrastive loss LCL , where LCE constrains the semantics of the generated sentence to be close to human annotation and LCL encourages discriminative representation learning from positive and negative pairs of video and caption. Experiments on two emotional video captioning datasets (i.e., EmVidCap and EmVidCap-S) demonstrate the superiority of CANet compared to the state-of-the-art approaches.