VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

05/29/2023
by   Sihan Chen, et al.
0

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.

READ FULL TEXT

page 17

page 18

research
04/17/2023

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

In this paper, we propose a Vision-Audio-Language Omni-peRception pretra...
research
01/11/2022

Music2Video: Automatic Generation of Music Video with fusion of audio and text

Creation of images using generative adversarial networks has been widely...
research
11/18/2020

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

Neuro-symbolic representations have proved effective in learning structu...
research
11/15/2019

Cross-modal supervised learning for better acoustic representations

Obtaining large-scale human-labeled datasets to train acoustic represent...
research
06/26/2022

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

In this paper, we leverage the human perceiving process, that involves v...
research
10/24/2019

Vision-Infused Deep Audio Inpainting

Multi-modality perception is essential to develop interactive intelligen...
research
06/08/2022

Words are all you need? Capturing human sensory similarity with textual descriptors

Recent advances in multimodal training use textual descriptions to signi...

Please sign up or login with your details

Forgot password? Click here to reset