Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

06/13/2023
by   Shuai Yang, et al.
0

Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

READ FULL TEXT

page 1

page 2

page 5

page 6

page 7

page 8

page 9

page 10

research
08/07/2023

DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

In recent years, diffusion models have emerged as the most powerful appr...
research
11/23/2022

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Generating a video given the first several static frames is challenging ...
research
07/16/2020

World-Consistent Video-to-Video Synthesis

Video-to-video synthesis (vid2vid) aims for converting high-level semant...
research
05/10/2023

Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models

The proliferation of video content demands efficient and flexible neural...
research
05/23/2023

Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

In the paradigm of AI-generated content (AIGC), there has been increasin...
research
08/19/2023

MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance

This study introduces an efficient and effective method, MeDM, that util...
research
08/15/2023

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

The rising demand for creating lifelike avatars in the digital realm has...

Please sign up or login with your details

Forgot password? Click here to reset