MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

by   Rasmus Kær Jørgensen, et al.

Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets-for biomedical named entity recognition and financial sentence classification-covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.


page 1

page 2

page 3

page 4


BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model

Pretrained language models have served as important backbones for natura...

CBEAF-Adapting: Enhanced Continual Pretraining for Building Chinese Biomedical Language Model

Continual pretraining is a standard way of building a domain-specific pr...

Improved Pretraining for Domain-specific Contextual Embedding Models

We investigate methods to mitigate catastrophic forgetting during domain...

Sabiá: Portuguese Large Language Models

As the capabilities of language models continue to advance, it is concei...

ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation

This paper presents the development and evaluation of ChatHome, a domain...

A Million Tweets Are Worth a Few Points: Tuning Transformers for Customer Service Tasks

In online domain-specific customer service applications, many companies ...

Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

This work presents biomedical and clinical language models for Spanish b...

Please sign up or login with your details

Forgot password? Click here to reset