Abstract:
In the circumstance of social big data, sentiment analysis is attracting increasing attention for its capacity in understanding individuals' attitudes and feelings. Traditional sentiment analysis methods focus on single modality and become ineffective as enormous data are emerging on the social websites with multiple manifestations. In this article, multimodal learning approaches are proposed to capture the relations between image and text, which only stay at the region level and ignore the fact that the channels are also closely correlated with the semantic information. In addition, social images in the social platforms are closely connected by various types of relations, which are also conducice to sentiment classification but neglected by most existing works. In this article, we propose an attention-based heterogeneous relational model to improve the multimodal sentiment analysis performance by incorporating rich social information. Specifically, we propose a progressive dual attention module to capture the correlations between image and text, and then learn the joint image-text representation from the perspective of content information. A channel attention schema is proposed here to highlight semantically rich image channels and a region attention schema is further designed to highlight the emotional regions based on the attended channels. After that, we construct a heterogeneous relation network and extend graph convolutional network to aggregate the content information from social contexts as complements to learn high-quality representations of social images. Our proposal is thoroughly evaluated on two benchmark datasets, and experimental results demonstrate the superiority of the proposed model.