Nowadays, enhancing the living standard with smart healthcare via Internet of Thing is one of the most critical goals of smart cities, in which Artificial Intelligence plays as the core technology. Many smart services, deployed according to wearable sensor-based physical activity (PA) recognition, have been able to early detect unhealthy daily behaviors and further medical risks. Numerous approaches have studied shallow handcrafted features coupled with traditional machine learning (ML) techniques, which find it difficult to model real-world activities. In this work, by revealing deep features from Deep Convolutional Neural Networks (DCNNs) in the fusion with conventional handcrafted features, we learn an intermediate fusion framework of human activity recognition (HAR). According to transforming raw signal value to pixel intensity value, a segmentation data acquired from a multi-sensor system is encoded to an activity image for deep model learning. Formulated by several novel Residual Triple-Convolutional blocks, the proposed DCNN allows extracting multi-scale spatiotemporal signal-and sensor-level correlations simultaneously from the activity image. In the fusion model, the hybrid feature merged from the handcrafted and deep features is learned by a multiclass Support Vector Machine (SVM) classifier. Based on several experiments of performance evaluation, our fusion approach for activity recognition has achieved the accuracy over 96.0% on three public benchmark datasets, including Daily and Sport Activities, Daily Life Activities, and RealWorld. Furthermore, the method outperforms several state-of-the-art HAR approaches and demonstrates the superiority of the proposed intermediate fusion model in multi-sensor systems.