SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding

07/27/2022
by   Mengxue Qu, et al.
0

In this paper, we investigate how to achieve better visual grounding with modern vision-language transformers, and propose a simple yet powerful Selective Retraining (SiRi) mechanism for this challenging task. Particularly, SiRi conveys a significant principle to the research of visual grounding, i.e., a better initialized vision-language encoder would help the model converge to a better local minimum, advancing the performance accordingly. In specific, we continually update the parameters of the encoder as the training goes on, while periodically re-initialize rest of the parameters to compel the model to be better optimized based on an enhanced encoder. SiRi can significantly outperform previous approaches on three popular benchmarks. Specifically, our method achieves 83.04 state-of-the-art approaches (training from scratch) by more than 10.21 Additionally, we reveal that SiRi performs surprisingly superior even with limited training data. We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity.

READ FULL TEXT

page 12

page 21

research
06/14/2022

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

In this work, we explore neat yet effective Transformer-based frameworks...
research
06/06/2021

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

As an important step towards visual reasoning, visual grounding (e.g., p...
research
07/05/2022

Weakly Supervised Grounding for VQA in Vision-Language Transformers

Transformers for visual-language representation learning have been getti...
research
05/10/2021

Visual Grounding with Transformers

In this paper, we propose a transformer based approach for visual ground...
research
09/16/2021

Fast-Slow Transformer for Visually Grounding Speech

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-...
research
06/02/2019

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires grounding instructions, su...
research
10/22/2022

HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

This paper tackles an emerging and challenging vision-language task, 3D ...

Please sign up or login with your details

Forgot password? Click here to reset