Shrinking Bigfoot: Reducing wav2vec 2.0 footprint

03/29/2021
by   Zilun Peng, et al.
14

Wav2vec 2.0 is a state-of-the-art speech recognition model which maps speech audio waveforms into latent representations. The largest version of wav2vec 2.0 contains 317 million parameters. Hence, the inference latency of wav2vec 2.0 will be a bottleneck in production, leading to high costs and a significant environmental footprint. To improve wav2vec's applicability to a production setting, we explore multiple model compression methods borrowed from the domain of large language models. Using a teacher-student approach, we distilled the knowledge from the original wav2vec 2.0 model into a student model, which is 2 times faster and 4.8 times smaller than the original model. This increase in performance is accomplished with only a 7 (WER). Our quantized model is 3.6 times smaller than the original model, with only a 0.1 work that compresses wav2vec 2.0.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset