Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

by   Ligong Han, et al.

Most methods for conditional video synthesis use a single modality as the condition. This comes with major limitations. For example, it is problematic for a model conditioned on an image to generate a specific motion trajectory desired by the user since there is no means to provide motion information. Conversely, language information can describe the desired motion, while not precisely defining the content of the video. This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We leverage the recent progress in quantized representations for videos and apply a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. To improve video quality and consistency, we propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. We introduce text augmentation to improve the robustness of the textual representation and diversity of generated videos. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images. It can generate much longer sequences than the one used for training. In addition, our model can extract visual information as suggested by the text prompt, e.g., "an object in image one is moving northeast", and generate corresponding videos. We run evaluations on three public datasets and a newly collected dataset labeled with facial attributes, achieving state-of-the-art generation results on all four.


page 7

page 14

page 15

page 16

page 17

page 20

page 21

page 22


Phenaki: Variable Length Video Generation From Open Domain Textual Description

We present Phenaki, a model capable of realistic video synthesis, given ...

MagicAvatar: Multimodal Avatar Generation and Animation

This report presents MagicAvatar, a framework for multimodal video gener...

Multimodal Conditionality for Natural Language Generation

Large scale pretrained language models have demonstrated state-of-the-ar...

Multi-object Video Generation from Single Frame Layouts

In this paper, we study video synthesis with emphasis on simplifying the...

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

This paper presents a unified multimodal pre-trained model called NÜWA t...

A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Image and video synthesis are closely related areas aiming at generating...

Sound2Sight: Generating Visual Dynamics from Sound and Context

Learning associations across modalities is critical for robust multimoda...

Please sign up or login with your details

Forgot password? Click here to reset