Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

05/04/2022
by   Alina Kolesnikova, et al.
0

Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from 17× to 49×, while maintaining quality of 1.7× compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/23/2023

GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models

Knowledge distillation is commonly used for compressing neural networks ...
research
09/25/2019

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Pre-trained deep neural network language models such as ELMo, GPT, BERT ...
research
06/29/2022

Extreme compression of sentence-transformer ranker models: faster inference, longer battery life, and less storage on edge devices

Modern search systems use several large ranker models with transformer a...
research
07/27/2023

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Knowledge distillation (KD) is the process of transferring knowledge fro...
research
06/11/2023

Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method

The large scale of pre-trained language models poses a challenge for the...
research
06/14/2023

Knowledge Distillation of Large Language Models

Knowledge Distillation (KD) is a promising technique for reducing the hi...
research
09/20/2023

Language-Oriented Communication with Semantic Coding and Knowledge Distillation for Text-to-Image Generation

By integrating recent advances in large language models (LLMs) and gener...

Please sign up or login with your details

Forgot password? Click here to reset