Top-Down Visual Attention from Analysis by Synthesis

03/23/2023
by   Baifeng Shi, et al.
0

Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task-related objects. This ability of task-guided top-down attention provides task-adaptive representation and helps the model generalize to various tasks. In this paper, we consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. Prior work indicates a functional equivalence between visual attention and sparse reconstruction; we show that an AbS visual system that optimizes a similar sparse reconstruction objective modulated by a goal-directed top-down signal naturally simulates top-down attention. We further propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and achieves controllable top-down attention. For real-world applications, AbSViT consistently improves over baselines on Vision-Language tasks such as VQA and zero-shot retrieval where language guides the top-down attention. AbSViT can also serve as a general backbone, improving performance on classification, semantic segmentation, and model robustness.

READ FULL TEXT

page 1

page 4

page 6

page 7

page 8

page 9

page 14

page 15

research
04/23/2022

Visual Attention Emerges from Recurrent Sparse Reconstruction

Visual attention helps achieve robust perception under noise, corruption...
research
10/16/2021

A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models

Large pretrained vision-language (VL) models can learn a new task with a...
research
07/18/2022

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding

To bridge the gap between supervised semantic segmentation and real-worl...
research
04/19/2022

Behind the Machine's Gaze: Biologically Constrained Neural Networks Exhibit Human-like Visual Attention

By and large, existing computational models of visual attention tacitly ...
research
08/18/2020

Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks

Attention models are widely used in Vision-language (V-L) tasks to perfo...
research
09/15/2022

Towards self-attention based visual navigation in the real world

Vision guided navigation requires processing complex visual information ...
research
07/25/2021

Improving Robot Localisation by Ignoring Visual Distraction

Attention is an important component of modern deep learning. However, le...

Please sign up or login with your details

Forgot password? Click here to reset