New Students on Sesame Street: What Order-Aware Matrix Embeddings Can Learn from BERT

09/17/2021
by   Lukas Galke, et al.
0

Large-scale pretrained language models (PreLMs) are revolutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive in low-resource or large-scale applications. While common approaches reduce the size of PreLMs via same-architecture distillation or pruning, we explore distilling PreLMs into more efficient order-aware embedding models. Our results on the GLUE benchmark show that embedding-centric students, which have learned from BERT, yield scores comparable to DistilBERT on QQP and RTE, often match or exceed the scores of ELMo, and only fall behind on detecting linguistic acceptability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/23/2021

Distilling the Knowledge of Romanian BERTs Using Multiple Teachers

Running large-scale pre-trained language models in computationally const...
research
09/17/2020

Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

Pretrained large-scale language models have increasingly demonstrated hi...
research
09/18/2020

Will it Unblend?

Natural language processing systems often struggle with out-of-vocabular...
research
10/31/2019

Pseudolikelihood Reranking with Masked Language Models

We rerank with scores from pretrained masked language models like BERT t...
research
07/18/2022

BERT: A Review of Natural Language Processing and Understanding Applications

We cover the use of BERT, one of the most well-liked deep learning-based...
research
05/25/2022

Sparse*BERT: Sparse Models are Robust

Large Language Models have become the core architecture upon which most ...
research
12/27/2022

DeepCuts: Single-Shot Interpretability based Pruning for BERT

As language models have grown in parameters and layers, it has become mu...

Please sign up or login with your details

Forgot password? Click here to reset