Bridging the Gap for Tokenizer-Free Language Models

08/27/2019
by   Dokook Choe, et al.
0

Purely character-based language models (LMs) have been lagging in quality on large scale datasets, and current state-of-the-art LMs rely on word tokenization. It has been assumed that injecting the prior knowledge of a tokenizer into the model is essential to achieving competitive results. In this paper, we show that contrary to this conventional wisdom, tokenizer-free LMs with sufficient capacity can achieve competitive performance on a large scale dataset. We train a vanilla transformer network with 40 self-attention layers on the One Billion Word (lm1b) benchmark and achieve a new state of the art for tokenizer-free LMs, pushing these models to be on par with their word-based counterparts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/20/2017

Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones

Syllabification does not seem to improve word-level RNN language modelin...
research
11/11/2019

Attending to Entities for Better Text Understanding

Recent progress in NLP witnessed the development of large-scale pre-trai...
research
02/24/2021

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

Large language models have become increasingly difficult to train becaus...
research
12/21/2016

An Empirical Study of Language CNN for Image Captioning

Language Models based on recurrent neural networks have dominated recent...
research
12/15/2016

Reflectance Adaptive Filtering Improves Intrinsic Image Estimation

Separating an image into reflectance and shading layers poses a challeng...
research
02/23/2018

Reusing Weights in Subword-aware Neural Language Models

We propose several ways of reusing subword embeddings and other weights ...
research
09/17/2023

A novel approach to measuring patent claim scope based on probabilities obtained from (large) language models

This work proposes to measure the scope of a patent claim as the recipro...

Please sign up or login with your details

Forgot password? Click here to reset