Designing Effective Sparse Expert Models

02/17/2022
by   Barret Zoph, et al.
0

Scale has opened new frontiers in natural language processing – but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/13/2021

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Scaling language models with more data, compute and parameters has drive...
research
09/08/2023

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

Sparse Mixture-of-Experts models (MoEs) have recently gained popularity ...
research
09/04/2022

A Review of Sparse Expert Models in Deep Learning

Sparse expert models are a thirty-year old concept re-emerging as a popu...
research
03/13/2023

Scaling Vision-Language Models with Sparse Mixture of Experts

The field of natural language processing (NLP) has made significant stri...
research
11/18/2022

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Mixture of Experts (MoE) models with conditional execution of sparsely a...
research
06/05/2023

COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales ...
research
03/17/2021

SML: a new Semantic Embedding Alignment Transformer for efficient cross-lingual Natural Language Inference

The ability of Transformers to perform with precision a variety of tasks...

Please sign up or login with your details

Forgot password? Click here to reset