LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

08/31/2022
by   Tao Shen, et al.
0

In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval – the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words – becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal language modeling encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning, achieving 42.6% MRR@10 with 45.83 QPS on a CPU machine for the passage retrieval benchmark, MS-Marco. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/06/2023

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval

Image-text retrieval (ITR) is a task to retrieve the relevant images/tex...
research
08/21/2022

A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval

Dense retrieval (DR) has shown promising results in information retrieva...
research
05/22/2023

Challenging Decoder helps in Masked Auto-Encoder Pre-training for Dense Passage Retrieval

Recently, various studies have been directed towards exploring dense pas...
research
04/11/2021

Fine-tuning Encoders for Improved Monolingual and Zero-shot Polylingual Neural Topic Modeling

Neural topic models can augment or replace bag-of-words inputs with the ...
research
05/19/2022

PLAID: An Efficient Engine for Late Interaction Retrieval

Pre-trained language models are increasingly important components across...
research
04/22/2022

Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Dense retrieval has shown promising results in many information retrieva...
research
04/27/2023

Large Language Models are Strong Zero-Shot Retriever

In this work, we propose a simple method that applies a large language m...

Please sign up or login with your details

Forgot password? Click here to reset