DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

by   Pengcheng He, et al.

This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an example, the DeBERTaV3 Large model achieves a 91.37 1.37 (SOTA) among the models with a similar structure. Furthermore, we have pre-trained a multi-lingual model mDeBERTa and observed a larger improvement over strong baselines compared to English models. For example, the mDeBERTa Base achieves a 79.8 improvement over XLM-R Base, creating a new SOTA on this benchmark. We have made our pre-trained models and inference code publicly available at


page 1

page 2

page 3

page 4


Pre-training Language Model as a Multi-perspective Course Learner

ELECTRA, the generator-discriminator pre-training framework, has achieve...

GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator

Pre-trained models have achieved remarkable success in natural language ...

DUMB: A Benchmark for Smart Evaluation of Dutch Models

We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a d...

Integrally Pre-Trained Transformer Pyramid Networks

In this paper, we present an integral pre-training framework based on ma...

Maximizing Efficiency of Language Model Pre-training for Learning Representation

Pre-trained language models in the past years have shown exponential gro...

GLM-130B: An Open Bilingual Pre-trained Model

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained lan...

Please sign up or login with your details

Forgot password? Click here to reset