Sparsifying Transformer Models with Differentiable Representation Pooling

09/10/2020
by   Michał Pietruszka, et al.
0

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations, thus leveraging the model's information bottleneck with twofold strength. A careful analysis shows that the contextualization of encoded representations in our model is significantly more effective than in the original Transformer. We achieve a notable reduction in memory usage due to an improved differentiable top-k operator, making the model suitable to process long documents, as shown on an example of a summarization task.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset