Abstract:
Facial expression recognition from images is a challenging problem in computer vision applications. Convolutional neural network (CNN), the state-of-the-art method for various computer vision tasks, has had limited success in predicting expressions from faces having extreme poses, illumination, and occlusion conditions. To mitigate this issue, CNNs are often accompanied by techniques like transfer, multitask, or ensemble learning that provide high accuracy at the cost of increased computational complexity. In this article, the authors propose a part-based ensemble transfer learning network that models how humans recognize facial expressions by correlating visual patterns emanating from facial muscles’ motor movements with a specific expression. The proposed network performs transfer learning from facial landmark localization to facial expression recognition. It consists of five subnetworks, and each subnetwork performs transfer learning from one of the five subsets of facial landmarks: eyebrows, eyes, nose, mouth, or jaw to expression classification. The network’s performance is evaluated using the Cohn-Kanade (CK+), Japanese female facial expression (JAFFE), and static facial expressions in the wild datasets, and it outperforms the benchmark for CK+ and JAFFE datasets by 0.51% and 5.34%, respectively. Additionally, the proposed ensemble network consists of only 1.65 M model parameters, ensuring computational efficiency during training and real-time deployment. Gradient-weighted class activation mapping visualizations of the network reveal the complementary nature of its subnetworks, a key design parameter of an effective ensemble network. Lastly, cross-dataset evaluation results show that the the proposed ensemble has a high generalization capacity, making it suitable for real-world usage.