Compressibility of Distributed Document Representations

10/14/2021
by   Blaž Škrlj, et al.
0

Contemporary natural language processing (NLP) revolves around learning from latent document representations, generated either implicitly by neural language models or explicitly by methods such as doc2vec or similar. One of the key properties of the obtained representations is their dimension. Whilst the commonly adopted dimensions of 256 and 768 offer sufficient performance on many tasks, it is many times unclear whether the default dimension is the most suitable choice for the subsequent downstream learning tasks. Furthermore, representation dimensions are seldom subject to hyperparameter tuning due to computational constraints. The purpose of this paper is to demonstrate that a surprisingly simple and efficient recursive compression procedure can be sufficient to both significantly compress the initial representation, but also potentially improve its performance when considering the task of text classification. Having smaller and less noisy representations is the desired property during deployment, as orders of magnitude smaller models can significantly reduce the computational overload and with it the deployment costs. We propose CoRe, a straightforward, representation learner-agnostic framework suitable for representation compression. The CoRe's performance is showcased and studied on a collection of 17 real-life corpora from biomedical, news, social media, and literary domains. We explored CoRe's behavior when considering contextual and non-contextual document representations, different compression levels, and 9 different compression algorithms. Current results based on more than 100,000 compression experiments indicate that recursive Singular Value Decomposition offers a very good trade-off between the compression efficiency and performance, making CoRe useful in many existing, representation-dependent NLP pipelines.

READ FULL TEXT
research
08/29/2023

TransPrompt v2: A Transferable Prompting Framework for Cross-task Text Classification

Text classification is one of the most imperative tasks in natural langu...
research
05/31/2020

Improve Document Embedding for Text Categorization Through Deep Siamese Neural Network

Due to the increasing amount of data on the internet, finding a highly-i...
research
10/30/2019

Contextual Text Denoising with Masked Language Models

Recently, with the help of deep learning models, significant advances ha...
research
02/08/2023

Revisiting Offline Compression: Going Beyond Factorization-based Methods for Transformer Language Models

Recent transformer language models achieve outstanding results in many n...
research
06/15/2021

Direction is what you need: Improving Word Embedding Compression in Large Language Models

The adoption of Transformer-based models in natural language processing ...
research
11/03/2021

Learned Image Compression for Machine Perception

Recent work has shown that learned image compression strategies can outp...
research
12/11/2022

A Study of Slang Representation Methods

Warning: this paper contains content that may be offensive or upsetting....

Please sign up or login with your details

Forgot password? Click here to reset