Abstract:
This paper presents a novel Convolutional Neural Network (CNN) architecture for 2D human pose estimation from RGB images that balances between high 2D human pose/skeleton estimation accuracy and rapid inference. Thus, it is suitable for safety-critical embedded AI scenarios in autonomous systems, where computational resources are typically limited and fast execution is often required, but accuracy cannot be sacrificed. The architecture is composed of a shared feature extraction backbone and two parallel heads attached on top of it: one for 2D human body joint regression and one for global human body structure modelling through Image-to-Image Translation (I2I). A corresponding multitask loss function allows training of the unified network for both tasks, through combining a typical 2D body joint regression with a novel I2I term. Along with enhanced information flow between the parallel neural heads via skip synapses, this strategy is able to extract both ample semantic and rich spatial information, while using a less complex CNN; thus it permits fast execution. The proposed architecture is evaluated on public 2D human pose estimation datasets, achieving the best accuracy-speed ratio compared to the state-of-the-art. Additionally, it is evaluated on a pedestrian intention recognition task for self-driving cars, leading to increased accuracy and speed in comparison to competing approaches.