CodeNet: Training Large Scale Neural Networks in Presence of Soft-Errors

by   Sanghamitra Dutta, et al.

This work proposes the first strategy to make distributed training of neural networks resilient to computing errors, a problem that has remained unsolved despite being first posed in 1956 by von Neumann. He also speculated that the efficiency and reliability of the human brain is obtained by allowing for low power but error-prone components with redundancy for error-resilience. It is surprising that this problem remains open, even as massive artificial neural networks are being trained on increasingly low-cost and unreliable processing units. Our coding-theory-inspired strategy, "CodeNet," solves this problem by addressing three challenges in the science of reliable computing: (i) Providing the first strategy for error-resilient neural network training by encoding each layer separately; (ii) Keeping the overheads of coding (encoding/error-detection/decoding) low by obviating the need to re-encode the updated parameter matrices after each iteration from scratch. (iii) Providing a completely decentralized implementation with no central node (which is a single point of failure), allowing all primary computational steps to be error-prone. We theoretically demonstrate that CodeNet has higher error tolerance than replication, which we leverage to speed up computation time. Simultaneously, CodeNet requires lower redundancy than replication, and equal computational and communication costs in scaling sense. We first demonstrate the benefits of CodeNet in reducing expected computation time over replication when accounting for checkpointing. Our experiments show that CodeNet achieves the best accuracy-runtime tradeoff compared to both replication and uncoded strategies. CodeNet is a significant step towards biologically plausible neural network training, that could hold the key to orders of magnitude efficiency improvements.


A Unified Coded Deep Neural Network Training Strategy Based on Generalized PolyDot Codes for Matrix Multiplication

This paper has two contributions. First, we propose a novel coded matrix...

Straggler Mitigation in Distributed Optimization Through Data Encoding

Slow running or straggler tasks can significantly reduce computation spe...

Hierarchical Coding for Distributed Computing

Coding for distributed computing supports low-latency computation by rel...

Magnetoresistive RAM for error resilient XNOR-Nets

We trained three Binarized Convolutional Neural Network architectures (L...

ESL-SNNs: An Evolutionary Structure Learning Strategy for Spiking Neural Networks

Spiking neural networks (SNNs) have manifested remarkable advantages in ...

Collage Inference: Achieving low tail latency during distributed image classification using coded redundancy models

Reducing the latency variance in machine learning inference is a key req...

Doubt and Redundancy Kill Soft Errors – Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software

Resilient algorithms in high-performance computing are subject to rigoro...

Please sign up or login with your details

Forgot password? Click here to reset