R2F: A Remote Retraining Framework for AIoT Processors with Computing Errors

07/07/2021
by   Dawen Xu, et al.
0

AIoT processors fabricated with newer technology nodes suffer rising soft errors due to the shrinking transistor sizes and lower power supply. Soft errors on the AIoT processors particularly the deep learning accelerators (DLAs) with massive computing may cause substantial computing errors. These computing errors are difficult to be captured by the conventional training on general purposed processors like CPUs and GPUs in a server. Applying the offline trained neural network models to the edge accelerators with errors directly may lead to considerable prediction accuracy loss. To address the problem, we propose a remote retraining framework (R2F) for remote AIoT processors with computing errors. It takes the remote AIoT processor with soft errors in the training loop such that the on-site computing errors can be learned with the application data on the server and the retrained models can be resilient to the soft errors. Meanwhile, we propose an optimized partial TMR strategy to enhance the retraining. According to our experiments, R2F enables elastic design trade-offs between the model accuracy and the performance penalty. The top-5 model accuracy can be improved by 1.93 with 0 notice that the retraining requires massive data transmission and even dominates the training time, and propose a sparse increment compression approach for the data transmission optimization, which reduces the retraining time by 38 remote retraining.

READ FULL TEXT

page 1

page 5

research
04/10/2020

A Survey on Impact of Transient Faults on BNN Inference Accelerators

Over past years, the philosophy for designing the artificial intelligenc...
research
12/23/2021

Dependability Analysis of Data Storage Systems in Presence of Soft Errors

In recent years, high availability and reliability of Data Storage Syste...
research
08/29/2019

Survey and Benchmarking of Machine Learning Accelerators

Advances in multicore processors and accelerators have opened the flood ...
research
04/07/2022

Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors

Modern processors, in particular within the server segment, integrate mo...
research
04/20/2023

Learning a quantum computer's capability using convolutional neural networks

The computational power of contemporary quantum processors is limited by...
research
04/19/2023

Big-Little Adaptive Neural Networks on Low-Power Near-Subthreshold Processors

This paper investigates the energy savings that near-subthreshold proces...
research
11/01/2022

Apple Silicon Performance in Scientific Computing

With the release of the Apple Silicon System-on-a-Chip processors, and t...

Please sign up or login with your details

Forgot password? Click here to reset