Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning

Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning

Abstract:

Most works of image captioning are implemented under the full supervision of paired image–caption data. Limited to expensive cost of data collection, the task of unpaired image captioning has attracted researchers’ attention. In this article, we propose a novel memorial GAN (MemGAN) with the joint semantic optimization for unpaired image captioning. The core idea is to explore implicit semantic correlation between disjointed images and sentences through building a multimodal semantic-aware space (SAS). Concretely, each modality is mapped into a unified multimodal SAS, where SAS includes the semantic vectors of image I , visual concepts O , unpaired sentence S , and the generated caption C . We adopt the memory unit based on multihead attention and relational gate as a backbone to preserve and transit crucial multimodal semantics in the SAS for image caption generation and sentence reconstruction. Then, the memory unit is embedded into a GAN framework to exploit the semantic similarity and relevance in SAS, that is, imposing a joint semantic-aware optimization on SAS without supervision clues. To summarize, the proposed MemGAN learns the latent semantic relevance of SAS’s multimodalities in an adversarial manner. Extensive experiments and qualitative results demonstrate the effectiveness of MemGAN, achieving improvements over state of the arts on unpaired image captioning benchmarks.