A Neural Corpus Indexer for Document Retrieval

by   Yujing Wang, et al.
University of Illinois at Urbana-Champaign
Tsinghua University

Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on a commonly used academic benchmark, achieving +51.9 the best baseline.


page 1

page 2

page 3

page 4


Ultron: An Ultimate Retriever on Corpus with a Model-based Indexer

Document retrieval has been extensively studied within the index-retriev...

Learning to Tokenize for Generative Retrieval

Conventional document retrieval techniques are mainly based on the index...

Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies

Recently, a new paradigm called Differentiable Search Index (DSI) has be...

How Does Generative Retrieval Scale to Millions of Passages?

Popularized by the Differentiable Search Index, the emerging paradigm of...

Learning Term Discrimination

Document indexing is a key component for efficient information retrieval...

Doc2Query–: When Less is More

Doc2Query – the process of expanding the content of a document before in...

Please sign up or login with your details

Forgot password? Click here to reset