All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

07/07/2023
by   Chunhui Zhang, et al.
0

Current mainstream vision-language (VL) tracking framework consists of three parts, a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, OTB99-L, TNL2K, LaSOT, LaSOT_ Ext and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-arts on VL tracking. Codes will be made publicly available.

READ FULL TEXT

page 1

page 3

page 4

page 6

page 7

page 9

research
12/08/2021

FLAVA: A Foundational Language And Vision Alignment Model

State-of-the-art vision and vision-and-language models rely on large-sca...
research
03/10/2022

Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking

Exploiting a general-purpose neural architecture to replace hand-wired d...
research
06/18/2020

Language Guided Networks for Cross-modal Moment Retrieval

We address the challenging task of cross-modal moment retrieval, which a...
research
05/31/2021

Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model

Inspired by biological evolution, we explain the rationality of Vision T...
research
07/19/2023

Divert More Attention to Vision-Language Object Tracking

Multimodal vision-language (VL) learning has noticeably pushed the tende...
research
08/27/2023

Towards Unified Token Learning for Vision-Language Tracking

In this paper, we present a simple, flexible and effective vision-langua...
research
06/14/2022

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

In this work, we explore neat yet effective Transformer-based frameworks...

Please sign up or login with your details

Forgot password? Click here to reset