Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

02/24/2022
by   Zhangyin Feng, et al.
15

The standard BERT adopts subword-based tokenization, which may break a word into two or more wordpieces (e.g., converting "lossless" to "loss" and "less"). This will bring inconvenience in following situations: (1) what is the best way to obtain the contextual vector of a word that is divided into multiple wordpieces? (2) how to predict a word via cloze test without knowing the number of wordpieces in advance? In this work, we explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces. We call such word-level BERT model as WordBERT. We train models with different vocabulary sizes, initialization configurations and languages. Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension. On many other natural language understanding tasks, including POS tagging, chunking and NER, WordBERT consistently performs better than BERT. Model analysis indicates that the major advantage of WordBERT over BERT lies in the understanding for low-frequency words and rare words. Furthermore, since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets. Lastly, the analyse on inference speed illustrates WordBERT has comparable time cost to BERT in natural language understanding tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/05/2019

Semantics-aware BERT for Language Understanding

The latest work on language representations carefully integrates context...
research
01/24/2020

PoWER-BERT: Accelerating BERT inference for Classification Tasks

BERT has emerged as a popular model for natural language understanding. ...
research
12/30/2020

Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks?

Do state-of-the-art natural language understanding models care about wor...
research
01/27/2021

KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding

A Lite BERT (ALBERT) has been introduced to scale up deep bidirectional ...
research
03/05/2021

Overcoming Poor Word Embeddings with Word Definitions

Modern natural language understanding models depend on pretrained subwor...
research
04/26/2022

Pretraining Chinese BERT for Detecting Word Insertion and Deletion Errors

Chinese BERT models achieve remarkable progress in dealing with grammati...
research
09/10/2021

IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

We present IndoBERTweet, the first large-scale pretrained model for Indo...

Please sign up or login with your details

Forgot password? Click here to reset