Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

by   Haiyang Xu, et al.

To promote the development of Vision-Language Pre-training (VLP) and multimodal Large Language Model (LLM) in the Chinese community, we firstly release the largest public Chinese high-quality video-language dataset named Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training. In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification. Youku-mPLUG can enable researchers to conduct more in-depth multimodal research and develop better applications in the future. Furthermore, we release popular video-language pre-training models, ALPRO and mPLUG-2, and our proposed modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG. Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1 improvement in video category classification. Besides, mPLUG-video achieves a new state-of-the-art result on these benchmarks with 80.5 video category classification and 68.9 CIDEr score in video captioning, respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with only 1.7 impressive instruction and video understanding ability. The zero-shot instruction understanding experiment indicates that pretraining with Youku-mPLUG can enhance the ability to comprehend overall and detailed visual semantics, recognize scene text, and leverage open-domain knowledge.


page 4

page 5

page 11

page 12

page 13


Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework

This paper presents a large-scale Chinese cross-modal dataset for benchm...

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

The pre-trained neural models have recently achieved impressive performa...

Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

Passage retrieval is a fundamental task in information retrieval (IR) re...

American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

Existing full text datasets of U.S. public domain newspapers do not reco...

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

We study joint video and language (VL) pre-training to enable cross-moda...

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

This paper introduces InternVid, a large-scale video-centric multimodal ...

Prompting Visual-Language Models for Efficient Video Understanding

Visual-language pre-training has shown great success for learning joint ...

Please sign up or login with your details

Forgot password? Click here to reset