Typical Facial Expression Network Using a Facial Feature Decoupler and Spatial Temporal Learning

Typical Facial Expression Network Using a Facial Feature Decoupler and Spatial Temporal Learning

Abstract:

Facial expression recognition (FER) accuracy is often affected by an individual’s unique facial characteristics. Recognition performance can be improved if the influence from these physical characteristics is minimized. Using video instead of single image for FER provides better results but requires extracting temporal features and the spatial structure of facial expressions in an integrated manner. We propose a new network called Typical Facial Expression Network (TFEN) to address both challenges. TFEN uses two deep two-dimensional (2D) convolutional neural networks (CNNs) to extract facial and expression features from input video. A facial feature decoupler decouples facial features from expression features to minimize the influence from inter-subject face variations. These networks combine with a 3D CNN and form a spatial-temporal learning network to jointly explore the spatial-temporal features in a video. A facial recognition network works as an adversarial network to refine the facial feature decoupler and the network performance by minimizing the residual influence of facial features after decoupling. The whole network is trained with an adversarial algorithm to improve FER performance. TFEN was evaluated on four popular dynamic FER datasets. Experimental results show TFEN achieves or outperforms the recognition accuracy of state-of-the-art approaches.