A Theory of I/O-Efficient Sparse Neural Network Inference

01/03/2023
by   Niels Gleinig, et al.
0

As the accuracy of machine learning models increases at a fast rate, so does their demand for energy and compute resources. On a low level, the major part of these resources is consumed by data movement between different memory units. Modern hardware architectures contain a form of fast memory (e.g., cache, registers), which is small, and a slow memory (e.g., DRAM), which is larger but expensive to access. We can only process data that is stored in fast memory, which incurs data movement (input/output-operations, or I/Os) between the two units. In this paper, we provide a rigorous theoretical analysis of the I/Os needed in sparse feedforward neural network (FFNN) inference. We establish bounds that determine the optimal number of I/Os up to a factor of 2 and present a method that uses a number of I/Os within that range. Much of the I/O-complexity is determined by a few high-level properties of the FFNN (number of inputs, outputs, neurons, and connections), but if we want to get closer to the exact lower bound, the instance-specific sparsity patterns need to be considered. Departing from the 2-optimal computation strategy, we show how to reduce the number of I/Os further with simulated annealing. Complementing this result, we provide an algorithm that constructively generates networks with maximum I/O-efficiency for inference. We test the algorithms and empirically verify our theoretical and algorithmic contributions. In our experiments on real hardware we observe speedups of up to 45× relative to the standard way of performing inference.

READ FULL TEXT

page 1

page 8

page 9

page 10

research
05/02/2019

Enabling Practical Processing in and near Memory for Data-Intensive Computing

Modern computing systems suffer from the dichotomy between computation o...
research
10/15/2019

Refresh Triggered Computation: Improving the Energy Efficiency of Convolutional Neural Network Accelerators

Recently, many studies proposed CNN accelerator architectures with custo...
research
05/08/2018

LISA: Increasing Internal Connectivity in DRAM for Fast Data Movement and Low Latency

This paper summarizes the idea of Low-Cost Interlinked Subarrays (LISA),...
research
05/20/2016

Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems

In most modern systems, the memory subsystem is managed and accessed at ...
research
03/30/2021

Enabling Homomorphically Encrypted Inference for Large DNN Models

The proliferation of machine learning services in the last few years has...
research
04/06/2022

SqueezeNeRF: Further factorized FastNeRF for memory-efficient inference

Neural Radiance Fields (NeRF) has emerged as the state-of-the-art method...
research
10/04/2020

Diagonal Memory Optimisation for Machine Learning on Micro-controllers

As machine learning spreads into more and more application areas, micro ...

Please sign up or login with your details

Forgot password? Click here to reset