LinkBERT: Pretraining Language Models with Document Links

03/29/2022
by   Michihiro Yasunaga, et al.
18

Language model (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5 biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7 BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data at https://github.com/michiyasunaga/LinkBERT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2022

Deep Bidirectional Language-Knowledge Graph Pretraining

Pretraining a language model (LM) on text has been shown to help various...
research
05/20/2023

Patton: Language Model Pretraining on Text-Rich Networks

A real-world text corpus sometimes comprises not only text documents but...
research
10/06/2020

Pretrained Language Model Embryology: The Birth of ALBERT

While behaviors of pretrained language models (LMs) have been thoroughly...
research
02/28/2023

Turning a CLIP Model into a Scene Text Detector

The recent large-scale Contrastive Language-Image Pretraining (CLIP) mod...
research
09/21/2022

Adapting Pretrained Text-to-Text Models for Long Text Sequences

We present an empirical study of adapting an existing pretrained text-to...
research
02/25/2022

OCR-IDL: OCR Annotations for Industry Document Library Dataset

Pretraining has proven successful in Document Intelligence tasks where d...
research
06/09/2023

DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

Vision-language pretraining models have achieved great success in suppor...

Please sign up or login with your details

Forgot password? Click here to reset