RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model

05/24/2021
by   Milan Straka, et al.
3

We present RobeCzech, a monolingual RoBERTa language representation model trained on Czech data. RoBERTa is a robustly optimized Transformer-based pretraining approach. We show that RobeCzech considerably outperforms equally-sized multilingual and Czech-trained contextualized language representation models, surpasses current state of the art in all five evaluated NLP tasks and reaches state-of-theart results in four of them. The RobeCzech model is released publicly at https://hdl.handle.net/11234/1-3691 and https://huggingface.co/ufal/robeczech-base.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2019

BERTje: A Dutch BERT Model

The transformer-based pre-trained language model BERT has helped to impr...
research
12/31/2020

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

In this work we provide a systematic empirical comparison of pretrained ...
research
05/04/2021

HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish

BERT-based models are currently used for solving nearly all Natural Lang...
research
03/24/2021

Czert – Czech BERT-like Model for Language Representation

This paper describes the training process of the first Czech monolingual...
research
07/16/2021

Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan

Multilingual language models have been a crucial breakthrough as they co...
research
08/14/2018

R-grams: Unsupervised Learning of Semantic Units in Natural Language

This paper introduces a novel type of data-driven segmented unit that we...

Please sign up or login with your details

Forgot password? Click here to reset