Complementing Lexical Retrieval with Semantic Residual Embedding
Information retrieval traditionally has relied on lexical matching signals, but lexical matching cannot handle vocabulary mismatch or topic-level matching. Neural embedding based retrieval models can match queries and documents in a latent semantic space, but they lose token-level matching information that is critical to IR. This paper presents CLEAR, a deep retrieval model that seeks to complement lexical retrieval with semantic embedding retrieval. Importantly, CLEAR uses a residual-based embedding learning framework, which focuses the embedding on the deep language structures and semantics that the lexical retrieval fails to capture. Empirical evaluation demonstrates the advantages of CLEAR over classic bag-of-words retrieval models, recent BERT-enhanced lexical retrieval models, as well as a BERT-based embedding retrieval. A full-collection retrieval with CLEAR can be as effective as a BERT-based reranking system, substantially narrowing the gap between full-collection retrieval and cost-prohibitive reranking systems
READ FULL TEXT