Abstract:
DEtection TRansformer (DETR) is a recently proposed method that streamlines the detection pipeline and achieves competitive results against two-stage detectors such as Faster-RCNN. The DETR models get rid of complex anchor generation and post-processing procedures thereby making the detection pipeline more intuitive. However, the numerous redundant parameters in transformers make the computation and storage of the DETR models intensive, which seriously hinder them to be deployed on the resources-constrained devices. In this paper, to obtain a compact end-to-end detection framework, we propose to deeply compress the transformers with low-rank tensor decomposition. The basic idea of our tensor-based compression method is to represent the large-scale weight matrix in one network layer with a chain of low-order matrices. Furthermore, we show that redundant attention heads will hinder the performance of detection transformers. We thus propose a gated multi-head attention (GMHA) module to suppress the redundant attention information by normalizing the attention heads. In GMHA, each attention head has an independent gate to determine the passed attention value, thereby down-weighting the uninformative heads. The accuracy drop of the tensor-compressed DETR models can be mitigated by applying GMHA modules. Lastly, to obtain fully compressed DETR models, a low-bitwidth quantization technique is introduced for further reducing the model storage size. Based on the proposed methods, we can achieve significant parameter and model size reduction while maintaining high detection performance. We conduct extensive experiments on the COCO and PASCAL VOC datasets to validate the effectiveness of our tensor-compressed (tensorized) DETR models. The experimental results on the COCO benchmark show that we can attain 3.7× full model compression with 482× feed forward network (FFN) parameter reduction and only 0.6 points accuracy drop.