A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

by   Wei Zhang, et al.

Modern Automatic Speech Recognition (ASR) systems rely on distributed deep learning to for quick training completion. To enable efficient distributed training, it is imperative that the training algorithms can converge with a large mini-batch size. In this work, we discovered that Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can work with much larger batch size than commonly used Synchronous SGD (SSGD) algorithm. On commonly used public SWB-300 and SWB-2000 ASR datasets, ADPSGD can converge with a batch size 3X as large as the one used in SSGD, thus enable training at a much larger scale. Further, we proposed a Hierarchical-ADPSGD (H-ADPSGD) system in which learners on the same computing node construct a super learner via a fast allreduce implementation, and super learners deploy ADPSGD algorithm among themselves. On a 64 Nvidia V100 GPU cluster connected via a 100Gb/s Ethernet network, our system is able to train SWB-2000 to reach a 7.6 the Hub5-2000 Switchboard (SWB) test-set and a 13.2 test-set in 5.2 hours. To the best of our knowledge, this is the fastest ASR training system that attains this level of model accuracy for SWB-2000 task to be ever reported in the literature.


page 1

page 2

page 3

page 4


Distributed Deep Learning Strategies For Automatic Speech Recognition

In this paper, we propose and investigate a variety of distributed deep ...

Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition

The past decade has witnessed great progress in Automatic Speech Recogni...

Improving Efficiency in Large-Scale Decentralized Distributed Training

Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchr...

Model Accuracy and Runtime Tradeoff in Distributed Deep Learning:A Systematic Study

This paper presents Rudra, a parameter server based distributed computin...

Asynchronous Decentralized Distributed Training of Acoustic Models

Large-scale distributed training of deep acoustic models plays an import...

Submodular Rank Aggregation on Score-based Permutations for Distributed Automatic Speech Recognition

Distributed automatic speech recognition (ASR) requires to aggregate out...

Impact of Experiencing Misrecognition by Teachable Agents on Learning and Rapport

While speech-enabled teachable agents have some advantages over typing-b...

Please sign up or login with your details

Forgot password? Click here to reset