Official implementatino of VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation.
Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal
Illustration of our two-stage framework for long, multi-scene video generation from text:
- In the first stage, we employ the LLM as a video planner to craft a video plan, which provides an overarching plot for videos with multiple scenes, guiding the downstream video generation process. The video plan consists of scene-level text descriptions, a list of the entities and background involved in each scene, frame-by-frame entity layouts (bounding boxes), and consistency groupings for entities and backgrounds.
- In the second stage, we utilize Layout2Vid, a grounded video generation module, to render videos based on the video plan generated in the first stage. This module uses the same image and text embeddings to represent identical entities and backgrounds from video plan, and allows for spatial control over entity layouts through the Guided 2D Attention in the spatial attention block.
Our proposed VideoDirectorGPT framework substantially improves layout and movement control
"pushing stuffed animal from left to right" | "pushing pear from right to left" | |
ModelScopeT2V | ||
VideoDirectorGPT (Ours) |
"a pizza is to the left of an elephant" | "four frisbees" | |
ModelScopeT2V | ||
VideoDirectorGPT (Ours) |
"make caraway cakes" | "make peach melba" | |
ModelScopeT2V | ||
VideoDirectorGPT (Ours) |
Our model is able to generate a detailed video plan that properly expands the original text prompt to show the process, has accurate object bounding box locations (overlaid), and maintains the consistency of the person across the scenes. ModelScopeT2V only generates the final food (caraway cake/peach melba) and that food is not consistent between scenes.
Scene 1: mouse is holding a book and makes a happy face.
Scene 2: he looks happy and talks.
Scene 3: he is pulling petals off the flower.
Scene 4: he is ripping a petal from the flower.
Scene 5: he is holding a flower by his right paw.
Scene 6: one paw pulls the last petal off the flower.
Scene 7: he is smiling and talking while holding a flower on his right paw.
ModelScopeT2V | VideoDirectorGPT (Ours) |
Our video plan's object layouts (overlaid) can guide the Layout2Vid module to generate the same mouse across scenes consistently, whereas ModelScopeT2V loses track of the mouse right after the first scene.
Users can flexibly provide either text-only or image+text descriptions to place custom entities when generating videos with VideoDirectorGPT. For both text and image+text based entity grounding examples, the identities of the provided entities are well preserved across multiple scenes
If you find our project useful in your research, please cite the following paper:
@article{Lin2023VideoDirectorGPT,
author = {Han Lin and Abhay Zala and Jaemin Cho and Mohit Bansal},
title = {VideoDirectorGPT: Consistent Multi-Scene Video Generation via LLM-Guided Planning},
year = {2023},
}