Divert More Attention to Vision-Language Object Tracking

07/19/2023
by   Mingzhe Guo, et al.
0

Multimodal vision-language (VL) learning has noticeably pushed the tendency toward generic intelligence owing to emerging large foundation models. However, tracking, as a fundamental vision problem, surprisingly enjoys less bonus from recent flourishing VL learning. We argue that the reasons are two-fold: the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning of current works. These nuisances motivate us to design more effective vision-language representation for tracking, meanwhile constructing a large database with language annotation for model learning. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer). To further improve VL representation, we introduce a contrastive loss to align different modalities. To thoroughly evidence the effectiveness of our method, we integrate the proposed framework on three tracking methods with different designs, i.e., the CNN-based SiamCAR, the Transformer-based OSTrack, and the hybrid structure TransT. The experiments demonstrate that our framework can significantly improve all baselines on six benchmarks. Besides empirical results, we theoretically analyze our approach to show its rationality. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking with diversified multimodal messages.

READ FULL TEXT

page 1

page 3

page 12

page 13

page 14

research
07/03/2022

Divert More Attention to Vision-Language Tracking

Relying on Transformer for complex visual feature learning, object track...
research
03/14/2023

PlanarTrack: A Large-scale Challenging Benchmark for Planar Object Tracking

Planar object tracking is a critical computer vision problem and has dra...
research
07/07/2023

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Current mainstream vision-language (VL) tracking framework consists of t...
research
05/17/2023

Rethinking Multimodal Content Moderation from an Asymmetric Angle with Mixed-modality

There is a rapidly growing need for multimodal content moderation (CM) a...
research
11/21/2022

Unifying Vision-Language Representation Space with Single-tower Transformer

Contrastive learning is a form of distance learning that aims to learn i...
research
06/26/2023

Large Multimodal Models: Notes on CVPR 2023 Tutorial

This tutorial note summarizes the presentation on “Large Multimodal Mode...

Please sign up or login with your details

Forgot password? Click here to reset