AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models

01/29/2022
by   Dongkuan Xu, et al.
6

Knowledge distillation (KD) methods compress large models into smaller students with manually-designed student architectures given pre-specified computational cost. This requires several trials to find a viable student, and further repeating the process for each student or computational budget change. We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model. Current works train a single SuperLM consisting of millions of subnetworks with weight-sharing, resulting in interference between subnetworks of different sizes. Our framework AutoDistil addresses above challenges with the following steps: (a) Incorporates inductive bias and heuristics to partition Transformer search space into K compact sub-spaces (K=3 for typical student sizes of base, small and tiny); (b) Trains one SuperLM for each sub-space using task-agnostic objective (e.g., self-attention distillation) with weight-sharing of students; (c) Lightweight search for the optimal student without re-training. Fully task-agnostic training and search allow students to be reused for fine-tuning on any downstream task. Experiments on GLUE benchmark against state-of-the-art KD and NAS methods demonstrate AutoDistil to outperform leading compression techniques with upto 2.7x reduction in computational cost and negligible loss in task performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/16/2023

Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

Large pre-trained language models have achieved state-of-the-art results...
research
05/30/2021

NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search

While pre-trained language models (e.g., BERT) have achieved impressive ...
research
11/05/2021

AUTOKD: Automatic Knowledge Distillation Into A Student Architecture Family

State-of-the-art results in deep learning have been improving steadily, ...
research
09/16/2020

Collaborative Group Learning

Collaborative learning has successfully applied knowledge transfer to gu...
research
02/16/2021

AlphaNet: Improved Training of Supernet with Alpha-Divergence

Weight-sharing neural architecture search (NAS) is an effective techniqu...
research
06/27/2022

Revisiting Architecture-aware Knowledge Distillation: Smaller Models and Faster Search

Knowledge Distillation (KD) has recently emerged as a popular method for...
research
05/26/2023

Meta-prediction Model for Distillation-Aware NAS on Unseen Datasets

Distillation-aware Neural Architecture Search (DaNAS) aims to search for...

Please sign up or login with your details

Forgot password? Click here to reset