Refining BERT Embeddings for Document Hashing via Mutual Information Maximization

09/07/2021
by   Zijing Ou, et al.
0

Existing unsupervised document hashing methods are mostly established on generative models. Due to the difficulties of capturing long dependency structures, these methods rarely model the raw documents directly, but instead to model the features extracted from them (e.g. bag-of-words (BOW), TFIDF). In this paper, we propose to learn hash codes from BERT embeddings after observing their tremendous successes on downstream tasks. As a first try, we modify existing generative hashing models to accommodate the BERT embeddings. However, little improvement is observed over the codes learned from the old BOW or TFIDF features. We attribute this to the reconstruction requirement in the generative hashing, which will enforce irrelevant information that is abundant in the BERT embeddings also compressed into the codes. To remedy this issue, a new unsupervised hashing paradigm is further proposed based on the mutual information (MI) maximization principle. Specifically, the method first constructs appropriate global and local codes from the documents and then seeks to maximize their mutual information. Experimental results on three benchmark datasets demonstrate that the proposed method is able to generate hash codes that outperform existing ones learned from BOW features by a substantial margin.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/29/2019

Document Hashing with Mixture-Prior Generative Models

Hashing is promising for large-scale information retrieval tasks thanks ...
research
05/13/2021

Unsupervised Hashing with Contrastive Information Bottleneck

Many unsupervised hashing methods are implicitly established on the idea...
research
10/31/2022

Efficient Document Retrieval by End-to-End Refining and Quantizing BERT Embedding with Contrastive Product Quantization

Efficient document retrieval heavily relies on the technique of semantic...
research
03/27/2017

MIHash: Online Hashing with Mutual Information

Learning-based hashing methods are widely used for nearest neighbor retr...
research
01/16/2019

Deep Supervised Hashing leveraging Quadratic Spherical Mutual Information for Content-based Image Retrieval

Several deep supervised hashing techniques have been proposed to allow f...
research
11/20/2020

Shuffle and Learn: Minimizing Mutual Information for Unsupervised Hashing

Unsupervised binary representation allows fast data retrieval without an...
research
05/27/2021

Integrating Semantics and Neighborhood Information with Graph-Driven Generative Models for Document Retrieval

With the need of fast retrieval speed and small memory footprint, docume...

Please sign up or login with your details

Forgot password? Click here to reset