Abstract:
Recently, several studies have applied the Unsupervised Domain Adaptation (UDA) method on detection transformers to improve their cross-domain detection performance. However, the majority of them directly apply adversarial alignment on expatiatory token sequences, which will introduce too much background information and disturb the alignment process. To tackle the problem, we propose a domain adaption method focused on the detection transformer named selective and representative sequence feature alignment (SR-SFA). Specifically, our SR-SFA contains two modules: self-guided weight map generation module (SWG) and classification-guided domain query generation module (CQG). The SWG module takes full advantage of transformer detection capability to locate the foreground parts of the token sequences for local alignment. The other CQG module introduces an image-level multi-label classification task as an auxiliary task to capture the representative information of the whole image for global level alignment. Therefore, more effective feature alignment is performed in a local and global fashion. Experiments on two adaptation scenarios demonstrate our method gets better performance compared with other approaches.