VLG: General Video Recognition with Web Textual Knowledge

by   Jintao Lin, et al.

Video recognition in an open and dynamic world is quite challenging, as we need to handle different settings such as close-set, long-tail, few-shot and open-set. By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework. The core contribution of this paper is twofold. First, we build a comprehensive video recognition benchmark of Kinetics-GVR, including four sub-task datasets to cover the mentioned settings. To facilitate the research of GVR, we propose to utilize external textual knowledge from the Internet and provide multi-source text descriptions for all action classes. Second, inspired by the flexibility of language representation, we present a unified visual-linguistic framework (VLG) to solve the problem of GVR by an effective two-stage training paradigm. Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings. Extensive results show that our VLG obtains the state-of-the-art performance under four settings. The superior performance demonstrates the effectiveness and generalization ability of our proposed framework. We hope our work makes a step towards the general video recognition and could serve as a baseline for future research. The code and models will be available at https://github.com/MCG-NJU/VLG.


page 2

page 8

page 13

page 18

page 19

page 20

page 21


ActionCLIP: A New Paradigm for Video Action Recognition

The canonical approach to video action recognition dictates a neural mod...

Distribution Alignment: A Unified Framework for Long-tail Visual Recognition

Despite the recent success of deep neural networks, it remains challengi...

Tell me what you see: A zero-shot action recognition method based on natural language descriptions

Recently, several approaches have explored the detection and classificat...

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Learning from large-scale contrastive language-image pre-training like C...

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Automatically generating textual descriptions for massive unlabeled imag...

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

We present the All-Seeing (AS) project: a large-scale data and model for...

Few-shot Event Detection: An Empirical Study and a Unified View

Few-shot event detection (ED) has been widely studied, while this brings...

Please sign up or login with your details

Forgot password? Click here to reset