You should evaluate your language model on marginal likelihood overtokenisations

09/06/2021
by   Kris Cao, et al.
0

Neural language models typically tokenise input text into sub-word units to achieve an open vocabulary. The standard approach is to use a single canonical tokenisation at both train and test time. We suggest that this approach is unsatisfactory and may bottleneck our evaluation of language model performance. Using only the one-best tokenisation ignores tokeniser uncertainty over alternative tokenisations, which may hurt model out-of-domain performance. In this paper, we argue that instead, language models should be evaluated on their marginal likelihood over tokenisations. We compare different estimators for the marginal likelihood based on sampling, and show that it is feasible to estimate the marginal likelihood with a manageable number of samples. We then evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities, and show that the marginal perplexity can be significantly better than the one best, especially on out-of-domain data. We link this difference in perplexity to the tokeniser uncertainty as measured by tokeniser entropy. We discuss some implications of our results for language model training and evaluation, particularly with regard to tokenisation robustness.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/26/2020

Unigram-Normalized Perplexity as a Language Model Performance Measure with Different Vocabulary Sizes

Although Perplexity is a widely used performance metric for language mod...
research
11/23/2018

Unsupervised Word Discovery with Segmental Neural Language Models

We propose a segmental neural language model that combines the represent...
research
10/23/2021

Spanish Legalese Language Model and Corpora

There are many Language Models for the English language according to its...
research
08/05/2020

Efficient MDI Adaptation for n-gram Language Models

This paper presents an efficient algorithm for n-gram language model ada...
research
10/21/2020

German's Next Language Model

In this work we present the experiments which lead to the creation of ou...
research
08/28/2018

A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units

We address the design of a unified multilingual system for handwriting r...
research
07/28/2023

The Hydra Effect: Emergent Self-repair in Language Model Computations

We investigate the internal structure of language model computations usi...

Please sign up or login with your details

Forgot password? Click here to reset