Self-Attentive Pooling for Efficient Deep Learning

by   Fang Chen, et al.

Efficient custom pooling techniques that can aggressively trim the dimensions of a feature map and thereby reduce inference compute and memory footprint for resource-constrained computer vision applications have recently gained significant traction. However, prior pooling works extract only the local context of the activation maps, limiting their effectiveness. In contrast, we propose a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers, such as max/average pooling or strided convolution. The proposed self-attention module uses patch embedding, multi-head self-attention, and spatial-channel restoration, followed by sigmoid activation and exponential soft-max. This self-attention mechanism efficiently aggregates dependencies between non-local activation patches during down-sampling. Extensive experiments on standard object classification and detection tasks with various convolutional neural network (CNN) architectures demonstrate the superiority of our proposed mechanism over the state-of-the-art (SOTA) pooling techniques. In particular, we surpass the test accuracy of existing pooling techniques on different variants of MobileNet-V2 on ImageNet by an average of 1.2 in the initial layers (providing up to 22x reduction in memory consumption), our approach achieves 1.43 with iso-memory footprints. This enables the deployment of our models in memory-constrained devices, such as micro-controllers (without losing significant accuracy), because the initial activation maps consume a significant amount of on-chip memory for high-resolution images required for complex vision tasks. Our proposed pooling method also leverages the idea of channel pruning to further reduce memory footprints.


page 1

page 3

page 7


Mix-Pooling Strategy for Attention Mechanism

Recently many effective self-attention modules are proposed to boot the ...

Group Generalized Mean Pooling for Vision Transformer

Vision Transformer (ViT) extracts the final representation from either c...

Augmenting Convolutional networks with attention-based aggregation

We show how to augment any convolutional network with an attention-based...

Exploring Self-Attention for Visual Intersection Classification

In robot vision, self-attention has recently emerged as a technique for ...

AdaPool: Exponential Adaptive Pooling for Information-Retaining Downsampling

Pooling layers are essential building blocks of Convolutional Neural Net...

Acoustic Scene Classification Using Fusion of Attentive Convolutional Neural Networks for DCASE2019 Challenge

In this report, the Brno University of Technology (BUT) team submissions...

RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference

Pooling operators are key components in most Convolutional Neural Networ...

Please sign up or login with your details

Forgot password? Click here to reset