TransAug: Translate as Augmentation for Sentence Embeddings

by   Jue Wang, et al.

While contrastive learning greatly advances the representation of sentence embeddings, it is still limited by the size of the existing sentence datasets. In this paper, we present TransAug (Translate as Augmentation), which provide the first exploration of utilizing translated sentence pairs as data augmentation for text, and introduce a two-stage paradigm to advances the state-of-the-art sentence embeddings. Instead of adopting an encoder trained in other languages setting, we first distill a Chinese encoder from a SimCSE encoder (pretrained in English), so that their embeddings are close in semantic space, which can be regraded as implicit data augmentation. Then, we only update the English encoder via cross-lingual contrastive learning and frozen the distilled Chinese encoder. Our approach achieves a new state-of-art on standard semantic textual similarity (STS), outperforming both SimCSE and Sentence-T5, and the best performance in corresponding tracks on transfer tasks evaluated by SentEval.


page 1

page 2

page 3

page 4


Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

We provide the first exploration of text-to-text transformers (T5) sente...

SimCSE: Simple Contrastive Learning of Sentence Embeddings

This paper presents SimCSE, a simple contrastive learning framework that...

SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning

Contrastive learning methods achieve state-of-the-art results in unsuper...

S-SimCSE: Sampled Sub-networks for Contrastive Learning of Sentence Embedding

Contrastive learning has been studied for improving the performance of l...

Identical and Fraternal Twins: Fine-Grained Semantic Contrastive Learning of Sentence Representations

The enhancement of unsupervised learning of sentence representations has...

Leveraging Multi-lingual Positive Instances in Contrastive Learning to Improve Sentence Embedding

Learning multi-lingual sentence embeddings is a fundamental and signific...

Toward Interpretable Semantic Textual Similarity via Optimal Transport-based Contrastive Sentence Learning

Recently, finetuning a pretrained language model to capture the similari...

Please sign up or login with your details

Forgot password? Click here to reset