The official implementation of work "GridShow: Omni Visual Generation".
GRID introduces a novel paradigm that reframes visual generation tasks as grid layout problems. Built upon FLUX.1 architecture, our framework transforms temporal sequences into grid layouts, enabling image generation models to process visual sequences holistically. This approach achieves remarkable efficiency and versatility across diverse visual generation tasks.
- Efficient Inference: up to 35× faster inference speeds compared to specialized models
- Resource Efficient: Requires <1/1000 of computational resources
- Versatile Applications: Supports Text-to-Video, Image-to-Video, Multi-view Generation, and more
- Preserved Capabilities: Maintains strong image generation performance while expanding functionality
Due to upload limits of github, we compress our size from 1024×1024 to 256×256, to see full size of each please refer to:
vid1 vid2 vid3 vid4 vid5 vid6 vid7
From left to right: input cat video, and edited results of fox, tiger, and red panda transformations.
- Python >= 3.10
- NVIDIA GPU with 24GB+ VRAM
- CUDA 11.6+
- PyTorch >= 1.12
git clone https://github.com/[username]/GRID.git
cd GRID
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
source/
├── train/
│ ├── sequence1/
│ │ └── frame_{1..n}.jpg # Sequential frames
│ └── sequence2/
│ └── frame_{1..n}.jpg
└── val/
└── ...
python tools/concat.py \\
--input_dir source/train \\
--output_dir vidgrid \\
--grid_rows 4 \\
--grid_cols 6 \\
--frames_per_grid 24
Data Structure:
vidgrid/
├── vid1.jpg # 4x6 grid containing 24 frames
└── vid2.jpg # Each .jpg is a complete sequence
mkdir -p models
# Download GLM-4V-9B weights
# Option 1: From ModelScope
wget https://modelscope.cn/models/ZhipuAI/glm-4v-9b/resolve/main/pytorch_model.bin -O models/glm-4v-9b.bin
# Option 2: From Hugging Face
wget https://huggingface.co/THUDM/glm-4v-9b/resolve/main/pytorch_model.bin -O models/glm-4v-9b.bin
# Option 3: From WiseModel
wget https://wisemodel.cn/models/ZhipuAI/GLM-4V-9B/resolve/main/pytorch_model.bin -O models/glm-4v-9b.bin
python tools/caption_glm.py
Final Training Data Structure:
vidgrid/
├── vid1.jpg # Grid image
├── vid1.txt # Corresponding caption
├── vid2.jpg
└── vid2.txt
GRID utilizes FLUX.1 architecture for training. You'll need:
- GPU with minimum 24GB VRAM
- FLUX.1-dev model access and license
Accept the model license at black-forest-labs/FLUX.1-dev, then follow the official setup guide in black-forest-labs/flux repository for deployment and model weights download.
- Copy example config:
cp config/train_lora_4d.yaml config/your_config.yaml
- Edit configuration parameters
- Start training:
python run.py config/your_config.yaml
Training can be interrupted safely (except during checkpoint saving) and will resume from the last checkpoint.
- Text-to-Video Generation
- Image-to-Video Synthesis
- Multi-view Image Generation
- Video Style Transfer
- Editing
- Release the paper
- Release the training codes and demo
- Update the project page
- Release the model weights
- Release the paper
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.