Universal Sentence Representation Learning with Conditional Masked Language Model

12/28/2020
by   ZiYi Yang, et al.
12

This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval, even outperforming models learned using (semi-)supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval (BR) and natural language inference (NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin. We explore the same language bias of the learned representations, and propose a principle component based approach to remove the language identifying information from the representation while still retaining sentence semantics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2019

Probing Multilingual Sentence Representations With X-Probe

This paper extends the task of probing sentence representations for ling...
research
04/15/2021

Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining

This work presents methods for learning cross-lingual sentence represent...
research
11/17/2019

Unsupervised Visual Representation Learning with Increasing Object Shape Bias

(Very early draft)Traditional supervised learning keeps pushing convolut...
research
05/01/2020

Multilingual Unsupervised Sentence Simplification

Progress in Sentence Simplification has been hindered by the lack of sup...
research
08/11/2018

Fake Sentence Detection as a Training Task for Sentence Encoding

Sentence encoders are typically trained on language modeling tasks which...
research
08/17/2023

Chinese Spelling Correction as Rephrasing Language Model

This paper studies Chinese Spelling Correction (CSC), which aims to dete...
research
11/09/2019

Bootstrapping Disjoint Datasets for Multilingual Multimodal Representation Learning

Recent work has highlighted the advantage of jointly learning grounded s...

Please sign up or login with your details

Forgot password? Click here to reset