A Survey on Deep Learning Based Image Captioning

A Survey on Deep Learning Based Image Captioning

Abstract:

Image caption is to use computer vision technology to extract the semantic content information contained in the image, and use natural language processing technology to generate a reasonable text caption. This paper takes the development of image description technology based on deep learning as the main line, and introduces the typical image description methods in each period. Image description generally adopts an encoding-decoding structure, which is the most commonly used method at present. By improving the encoder or decoder, increasing the attention mechanism, introducing Transformer technology, the advantages and disadvantages of different methods are discussed and summarized. Data sets and evaluation indicators such as MS-COCO and Flicker commonly used in this field are introduced in detail.