Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

12/20/2021
by   Sabrina J. Mielke, et al.
21

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications.

READ FULL TEXT
research
05/23/2023

From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding

Current state-of-the-art models for natural language understanding requi...
research
07/09/2023

A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing

Optical Character Recognition (OCR) technology finds applications in dig...
research
05/14/2019

Is Word Segmentation Necessary for Deep Learning of Chinese Representations?

Segmenting a chunk of text into words is usually the first step of proce...
research
11/28/2017

Acoustic-To-Word Model Without OOV

Recently, the acoustic-to-word model based on the Connectionist Temporal...
research
03/05/2022

Extracting linguistic speech patterns of Japanese fictional characters using subword units

This study extracted and analyzed the linguistic speech patterns that ch...
research
05/11/2016

Tweet2Vec: Character-Based Distributed Representations for Social Media

Text from social media provides a set of challenges that can cause tradi...
research
01/22/2019

Deep learning and sub-word-unit approach in written art generation

Automatic poetry generation is novel and interesting application of natu...

Please sign up or login with your details

Forgot password? Click here to reset