IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation

03/07/2022
by   Gabriele Sarti, et al.
0

The T5 model and its unified text-to-text paradigm contributed in advancing the state-of-the-art for many natural language processing tasks. While some multilingual variants of the T5 model have recently been introduced, their performances were found to provide suboptimal performances for languages other than English if compared to monolingual variants. We are motivated by these findings to introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We perform a thorough cleaning of a web-crawled Italian corpus including more than 40 billion words and use it to pretrain three IT5 models of different sizes. The performance of IT5 models and their multilingual counterparts is then evaluated on a broad range of natural language understanding and generation benchmarks for Italian. We find the monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for most Italian conditional language generation tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/25/2021

DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders

While pretrained encoders have achieved success in various natural langu...
research
10/23/2020

BARThez: a Skilled Pretrained French Sequence-to-Sequence Model

Inductive transfer learning, enabled by self-supervised learning, have t...
research
07/21/2021

The Effectiveness of Intermediate-Task Training for Code-Switched Natural Language Understanding

While recent benchmarks have spurred a lot of new work on improving the ...
research
08/01/2022

Efficient Long-Text Understanding with Short-Text Models

Transformer-based pretrained language models (LMs) are ubiquitous across...
research
03/22/2023

MEGA: Multilingual Evaluation of Generative AI

Generative AI models have impressive performance on many Natural Languag...
research
08/14/2018

R-grams: Unsupervised Learning of Semantic Units in Natural Language

This paper introduces a novel type of data-driven segmented unit that we...
research
10/13/2020

Probing for Multilingual Numerical Understanding in Transformer-Based Language Models

Natural language numbers are an example of compositional structures, whe...

Please sign up or login with your details

Forgot password? Click here to reset