Real Time Object Detection Network in UAV Vision Based on CNN and Transformer

Real Time Object Detection Network in UAV Vision Based on CNN and Transformer

Abstract:

Unmanned aerial vehicles (UAVs) play an important role in conducting automatic patrol inspections of cities, which can ensure the safety of urban residents’ life and property and the normal operation of cities. However, during the inspection process, problems may arise. For example, numerous small objects in UAV images are difficult to detect, objects in UAV images are severely occluded, and requirements for real-time performances are posed. To address these issues, we first propose a real-time object detection network (RTD-Net) for UAV images. Besides, to deal with the lack of visual features of small objects, we design a feature fusion module (FFM) to interact and fuse features at different levels and improve the feature expression ability of small objects. To achieve real-time detection, we design a lightweight feature extraction module (LEM) to build the backbone network to control the calculation quantity and parameters. To solve the issue of discontinuous features of occluded objects, an efficient convolutional transformer block (ECTB)-based convolutional multihead self-attention (CMHSA) is designed to improve the recognition ability of occluded objects by extracting the context information of objects. Compared with multihead self-attention (MHSA) in the traditional transformer, CMHSA uses convolutional projection to replace the position-linear projection, which can reduce a large amount of calculation without performance loss. Finally, an attention prediction head (APH) is designed based on the attention mechanism to improve the ability of the model to extract attention regions in complex scenarios. The proposed method reaches a detection accuracy of 86.4% mean average precision (mAP) in our UAV image dataset. In addition, it achieves a detection accuracy of 86.0% mAP and a detection speed of 33.4 frames/s in the NVIDIA Jeston TX2 embedded device.