A Multimodal Aggregation Network With Serial Self Attention Mechanism for Micro Video Multi Label Cl

A Multimodal Aggregation Network With Serial Self Attention Mechanism for Micro Video Multi Label Cl

Abstract:

Currently, micro-videos have attracted increasing attention due to their unique properties and great commercial value. Considering that micro-videos naturally incorporate multimodal information, a powerful representation method for distinct joint multimodal representations is essential for real applications. Inspired by the potential of attention neural network architectures over various tasks, we propose a multimodal aggregation network (MANET) with a serial self-attention mechanism to perform tasks of micro-video multi-label classification. Specifically, we first propose a parallel content-dependent graph neural networks (CDGNN) module, which explores category-related embeddings of micro-videos by disentangling category relations into modality-specific and modality-shared category dependency patterns. Then we introduce a serial self-attention (SSA) module to transmit the multimodal information in sequential order, in which an aggregation bottleneck is incorporated to better collect and condense the significant information. Experiments conducted on a large-scale multi-label micro-video dataset demonstrate that our proposed method has achieved competitive results compared with several state-of-the-art methods.