Impact of Tokenization on Language Models: An Analysis for Turkish

by   Cagri Toraman, et al.

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20 tokenizers and 40 between model size and performance.


page 1

page 2

page 3

page 4


How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

This paper investigates the effect of tokenizers on the downstream perfo...

The Scenario Refiner: Grounding subjects in images at the morphological level

Derivationally related words, such as "runner" and "running", exhibit se...

Breaking Character: Are Subwords Good Enough for MRLs After All?

Large pretrained language models (PLMs) typically tokenize the input str...

Handling Compounding in Mobile Keyboard Input

This paper proposes a framework to improve the typing experience of mobi...

Unigram-Normalized Perplexity as a Language Model Performance Measure with Different Vocabulary Sizes

Although Perplexity is a widely used performance metric for language mod...

Stem-driven Language Models for Morphologically Rich Languages

Neural language models (LMs) have shown to benefit significantly from en...

Incorporating Context into Subword Vocabularies

Most current popular subword tokenizers are trained based on word freque...

Please sign up or login with your details

Forgot password? Click here to reset