Term-Sets Can Be Strong Document Identifiers For Auto-Regressive Search Engines
Auto-regressive search engines emerge as a promising paradigm for next-gen information retrieval systems. These methods work with Seq2Seq models, where each query can be directly mapped to the identifier of its relevant document. As such, they are praised for merits like being end-to-end differentiable. However, auto-regressive search engines also confront challenges in retrieval quality, given the requirement for the exact generation of the document identifier. That's to say, the targeted document will be missed from the retrieval result if a false prediction about its identifier is made in any step of the generation process. In this work, we propose a novel framework, namely AutoTSG (Auto-regressive Search Engine with Term-Set Generation), which is featured by 1) the unordered term-based document identifier and 2) the set-oriented generation pipeline. With AutoTSG, any permutation of the term-set identifier will lead to the retrieval of the corresponding document, thus largely relaxing the requirement of exact generation. Besides, the Seq2Seq model is enabled to flexibly explore the optimal permutation of the document identifier for the presented query, which may further contribute to the retrieval quality. AutoTSG is empirically evaluated with Natural Questions and MS MARCO, where notable improvements can be achieved against the existing auto-regressive search engines.
READ FULL TEXT