Abstract:
The emerging Resistive RAM (ReRAM) technology significantly boosts the performance and the energy efficiency of the deep learning accelerators (DLAs) via the Computing-in-Memory (CiM) architecture. However, ReRAM-based DLA also suffers a high occurrence rate of memory faults. How to detect and protect against the faults in ReRAM devices poses great challenges to ReRAM-based DLA design. In this work, we propose RRAMedy, an in-situ fault detection and network remedy framework for ReRAM-based DLAs. With the proposed Adversarial Example Testing, which is a lifetime on-device and on-line fault detection technique, it achieves high detection coverage of both hard faults and soft faults at a low run-time cost. In addition, it employs an edge-cloud collaborative model retraining method to tolerate the detected faults by leveraging the inherent fault-adaptive capability of DNNs. Meanwhile, to enable in-situ model remedy when the cloud assistance is absent due to security or overhead issues, we propose to accelerate the fault-masking retraining process on edge devices with parallelized Knowledge Transfer. Our experimental results show that the proposed fault detection technique achieves high fault detection accuracy and delivers real-time testing performance. Meanwhile, the proposed retraining approach greatly alleviates the accuracy degradation problem and achieves excellent performance speedups over the baselines.