Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

05/23/2023
by   Susung Hong, et al.
0

In the paradigm of AI-generated content (AIGC), there has been increasing attention in extending pre-trained text-to-image (T2I) models to text-to-video (T2V) generation. Despite their effectiveness, these frameworks face challenges in maintaining consistent narratives and handling rapid shifts in scene composition or object placement from a single user prompt. This paper introduces a new framework, dubbed DirecT2V, which leverages instruction-tuned large language models (LLMs) to generate frame-by-frame descriptions from a single abstract user prompt. DirecT2V utilizes LLM directors to divide user inputs into separate prompts for each frame, enabling the inclusion of time-varying content and facilitating consistent video generation. To maintain temporal consistency and prevent object collapse, we propose a novel value mapping method and dual-softmax filtering. Extensive experimental results validate the effectiveness of the DirecT2V framework in producing visually coherent and consistent videos from abstract user prompts, addressing the challenges of zero-shot video generation.

READ FULL TEXT

page 1

page 2

page 8

page 9

page 14

page 15

page 18

research
03/23/2023

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Recent text-to-video generation approaches rely on computationally heavy...
research
06/13/2023

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Large text-to-image diffusion models have exhibited impressive proficien...
research
05/10/2023

Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models

The proliferation of video content demands efficient and flexible neural...
research
12/22/2022

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

To reproduce the success of text-to-image (T2I) generation, recent works...
research
04/05/2023

VicTR: Video-conditioned Text Representations for Activity Recognition

Vision-Language models have shown strong performance in the image-domain...
research
06/05/2023

shs-nlp at RadSum23: Domain-Adaptive Pre-training of Instruction-tuned LLMs for Radiology Report Impression Generation

Instruction-tuned generative Large language models (LLMs) like ChatGPT a...
research
04/30/2023

StyleLipSync: Style-based Personalized Lip-sync Video Generation

In this paper, we present StyleLipSync, a style-based personalized lip-s...

Please sign up or login with your details

Forgot password? Click here to reset