Abstract:
Over the past decades, researches on facial expression recognition have been restricted within six basic expressions (anger, fear, disgust, happiness, sadness and surprise). However, these six words can not fully describe the richness and diversity of human beings' emotions. To enhance the recognitive capabilities for computers, in this paper, we focus on fine-grained facial expression recognition in the wild and build a brand new benchmark FG-Emotions to push the research frontiers on this topic, which extends the original six classes to more elaborate thirty-three classes. Our FG-Emotions contains 10,371 images and 1,491 video clips annotated with corresponding fine-grained facial expression categories and landmarks. FG-Emotions also provides several features (e.g., LBP features and dense trajectories features) to facilitate related research. Moreover, on top of FG-Emotions, we propose a new end-to-end Multi-Scale Action Unit (AU)-based Network (MSAU-Net) for facial expression recognition with image which learns a more powerful facial representation by directly focusing on locating facial action units and utilizing “zoom in” operation to aggregate distinctive local features. As for recognition with video, we further extend the MSAU-Net to a two-stream model (TMSAU-Net) by adding a module with attention mechanism and a temporal stream branch to jointly learn spatial and temporal features. (T)MSAU-Net consistently outperforms existing state-of-the-art solutions on our FG-Emotions and several other datasets, and serves as a strong baseline to drive the future research towards fine-grained facial expression recognition in the wild.