BIOptimus: Pre-training an Optimal Biomedical Language Model with Curriculum Learning for Named Entity Recognition

by   Pavlova Vera, et al.

Using language models (LMs) pre-trained in a self-supervised setting on large corpora and then fine-tuning for a downstream task has helped to deal with the problem of limited label data for supervised learning tasks such as Named Entity Recognition (NER). Recent research in biomedical language processing has offered a number of biomedical LMs pre-trained using different methods and techniques that advance results on many BioNLP tasks, including NER. However, there is still a lack of a comprehensive comparison of pre-training approaches that would work more optimally in the biomedical domain. This paper aims to investigate different pre-training methods, such as pre-training the biomedical LM from scratch and pre-training it in a continued fashion. We compare existing methods with our proposed pre-training method of initializing weights for new tokens by distilling existing weights from the BERT model inside the context where the tokens were found. The method helps to speed up the pre-training stage and improve performance on NER. In addition, we compare how masking rate, corruption strategy, and masking strategies impact the performance of the biomedical LM. Finally, using the insights from our experiments, we introduce a new biomedical LM (BIOptimus), which is pre-trained using Curriculum Learning (CL) and contextualized weight distillation method. Our model sets new states of the art on several biomedical Named Entity Recognition (NER) tasks. We release our code and all pre-trained models


page 1

page 2

page 3

page 4


Formulating Few-shot Fine-tuning Towards Language Model Pre-training: A Pilot Study on Named Entity Recognition

Fine-tuning pre-trained language models has recently become a common pra...

Biomedical Language Models are Robust to Sub-optimal Tokenization

As opposed to general English, many concepts in biomedical terminology h...

Learning Rich Representation of Keyphrases from Text

In this work, we explore how to learn task-specific language models aime...

Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models

Large pre-trained neural language models have brought immense progress t...

A reproducible experimental survey on biomedical sentence similarity: a string-based method sets the state of the art

This registered report introduces the largest, and for the first time, r...

BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition

In recent years, with the growing amount of biomedical documents, couple...

Recognising Biomedical Names: Challenges and Solutions

The growth rate in the amount of biomedical documents is staggering. Unl...

Please sign up or login with your details

Forgot password? Click here to reset