Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding

by   Shanshan Wang, et al.

Pre-trained language models (PLM) have demonstrated their effectiveness for a broad range of information retrieval and natural language processing tasks. As the core part of PLM, multi-head self-attention is appealing for its ability to jointly attend to information from different positions. However, researchers have found that PLM always exhibits fixed attention patterns regardless of the input (e.g., excessively paying attention to [CLS] or [SEP]), which we argue might neglect important information in the other positions. In this work, we propose a simple yet effective attention guiding mechanism to improve the performance of PLM by encouraging attention towards the established goals. Specifically, we propose two kinds of attention guiding methods, i.e., map discrimination guiding (MDG) and attention pattern decorrelation guiding (PDG). The former definitely encourages the diversity among multiple self-attention heads to jointly attend to information from different representation subspaces, while the latter encourages self-attention to attend to as many different positions of the input as possible. We conduct experiments with multiple general pre-trained models (i.e., BERT, ALBERT, and Roberta) and domain-specific pre-trained models (i.e., BioBERT, ClinicalBERT, BlueBert, and SciBERT) on three benchmark datasets (i.e., MultiNLI, MedNLI, and Cross-genre-IR). Extensive experimental results demonstrate that our proposed MDG and PDG bring stable performance improvements on all datasets with high efficiency and low cost.


page 1

page 2

page 3

page 4


Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

In this paper, we detail the relationship between convolutions and self-...

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains

Large pre-trained models have achieved great success in many natural lan...

SparseBERT: Rethinking the Importance Analysis in Self-attention

Transformer-based models are popular for natural language processing (NL...

Multi-Head Attention with Disagreement Regularization

Multi-head attention is appealing for the ability to jointly attend to i...

Self-attention Comparison Module for Boosting Performance on Retrieval-based Open-Domain Dialog Systems

Since the pre-trained language models are widely used, retrieval-based o...

Attending to Entities for Better Text Understanding

Recent progress in NLP witnessed the development of large-scale pre-trai...

Alignment Attention by Matching Key and Query Distributions

The neural attention mechanism has been incorporated into deep neural ne...

Please sign up or login with your details

Forgot password? Click here to reset