Thermal and Reliability Management for Commercial Multicore Processors Considering Workload Dependen

Thermal and Reliability Management for Commercial Multicore Processors Considering Workload Dependen

Abstract:

This work proposes a new dynamic thermal and reliability management framework via task mapping and migration to improve thermal performance and reliability of commercial multicore processors considering workload-dependent thermal hot spot stress. The new method is motivated by the observation that different workloads activate different spatial power and thermal hot spots within each core of processors. Existing run-time thermal management, which is based on on-chip location-fixed thermal sensor information, can lead to suboptimal management solutions as the temperatures provided by those sensors may not be the true hot spots. The new method, called Hot-Trim, utilizes a machine learning-based approach to characterize the power density hot spots across each core, then a new task mapping/migration scheme is developed based on the hot spot stresses. Compared to existing works, the new approach is the first to optimize VLSI reliabilities by exploring workload-dependent power hot spots. The advantages of the proposed method over the Linux baseline task mapping and the temperature-based mapping method are demonstrated and validated on real commercial chips. Experiments on a real Intel Core i7 quad-core processor executing PARSEC-3.0 and SPLASH-2 benchmarks show that, compared to the existing Linux scheduler, core and hot spot temperature can be lowered by 1.15 °C–1.31 °C. In addition, Hot-Trim can improve the chip’s electro-migration (EM), negative biased temperature instability, and hot-carrier-injection (HCI) related reliability by 30.2%, 7.0%, and 31.1%, respectively, compared to Linux baseline without any performance degradation. Furthermore, it improves EM and HCI-related reliability by 29.6% and 19.6%, respectively, and at the same time even further reduces the temperature by half a degree compared to the conventional temperature-based mapping technique.