Vision Language Transformer and Query Generation for Referring Segmentation

Vision Language Transformer and Query Generation for Referring Segmentation

Abstract:

It is challenging for an agent to simultaneously decipher visual and language information and make decisions to perform corresponding actions. Recently, the vision-and-language navigation task has been proposed to allow the agent to navigate based on a language instruction and the currently visible visual point information in a 3-D indoor real environment. The key to this task is that the agent needs to understand the information of the two models of vision and language in an unknown environment to navigate effectively. In this study, we capture the alignment relationship between visual features and language features using a cross-modal feature fusion method. Attention is used to set up the cross-modal fusion module so that visual features contain language information and language features contain visual information, thereby allowing the model to learn more feature relationships and improving the success rate (SR) of agent navigation. Considering the practical significance of the navigation of the agent, we aim to shorten the trajectory length of the agent as much as possible while ensuring that the agent reaches the target position successfully. We employ a reinforcement learning algorithm based on the advantage actor critic to constrain the action selection of the agent to shorten the trajectory length. In order to further improve the performance of the model and reduce the difference between the performance of the agent in known environments and unknown environments, we propose the data augmentation method Cro-Speaker, and the three training methods Speaker data augmentation (SD), Cro-Speaker data augmentation (CSD), and Speaker and Cro-Speaker data augmentation (SCSD) based on this method. We evaluate the proposed method based on the Room-to-Room data set. The results show that the proposed method improves the SR of the agent navigation, shortens the length of the navigation trajectory, and exhibits a good generalization performance in known and unknown environments.