SmoothVideo

This repository is the official implementation of Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning.

Setup

This implementation is based on Tune-A-Video.

Requirements

pip install -r requirements.txt

Installing xformers is highly recommended for more efficiency and speed on GPUs. To enable xformers, set enable_xformers_memory_efficient_attention=True (default).

Weights

[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-5)).

Usage

Training

To fine-tune the text-to-image diffusion models for text-to-video generation, run this command for the baseline model:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml"

Run this command for the baseline model with the proposed smooth loss:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml" --smooth_loss

Run this command for the baseline model with the proposed simple smooth loss:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml" --smooth_loss --simple_manner

Note: Tuning a 24-frame video usually takes 300~500 steps, about 10~15 minutes using one A100 GPU. Reduce n_sample_frames if your GPU memory is limited.

Inference

Once the training is done, run inference:

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

pretrained_model_path = "./checkpoints/stable-diffusion-v1-5"
my_model_path = "./outputs/man-skiing"
unet = UNet3DConditionModel.from_pretrained(my_model_path, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained(pretrained_model_path, unet=unet, torch_dtype=torch.float16).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_vae_slicing()

prompt = "spider man is skiing"
ddim_inv_latent = torch.load(f"{my_model_path}/inv_latents/ddim_latent-500.pt").to(torch.float16)
video = pipe(prompt, latents=ddim_inv_latent, video_length=24, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos

save_videos_grid(video, f"./{prompt}.gif")

We provide comparisons with different baselines, as follows:

Results

Tune-A-Video

Comparisons to Tune-A-Video.

Input video	Tune-A-Video

Input video	Tune-A-Video + smooth loss

A jeep car is moving on the road	A jeep car is moving on the beach	A jeep car is moving on the snow	A jeep car is moving on the road, cartoon style	A sports car is moving on the road
Input video	Tune-A-Video

Input video	Tune-A-Video + smooth loss

A rabbit is eating a watermelon	A tiger is eating a watermelon	A rabbit is eating an orange	A rabbit is eating a pizza	A puppy is eating an orange
Input video	Tune-A-Video

Input video	Tune-A-Video + smooth loss

A man is skiing	Mickey mouse is skiing on the snow	Spider man is skiing on the beach, cartoon style	Wonder woman, wearing a cowboy hat, is skiing	A man, wearing pink clothes, is skiing at sunset

Make-A-Protagonist

Comparisons to Make-A-Protagonist.

Input video	Make-A-Protagonist	Make-A-Protagonist + smooth loss

A jeep driving down a mountain road	A jeep driving down a mountain road in the rain

A man is playing basketball	A man is playing a basketball on the beach, anime style

A man walking down the street at night	A panda walking down the snowy street

A man waling down the street	Elon musk walking down the street

ControlVideo

Comparisons to ControlVideo.

Input video	Condition	ControlVideo	ControlVideo + smooth loss

A person is dancing	Pose condition	Michael Jackson is dancing

A person is dancing	Pose condition	A person is dancing, Makoto Shinkai style

A building	Canny edge condition	A wooden building, at night

A girl	Hed edge condition	A girl, Krenz Cushart style

A girl	Hed edge condition	A girl with rich makeup

Ink diffuses in water	Depth condition	Gentle green ink diffuses in water, beautiful light

Video2Video-zero

Comparisons to Training-free methods.

Input video	Instruct Video2Video-zero	Instruct Video2Video-zero + noise constraint	Video InstructPix2Pix	Video InstructPix2Pix + noise constraint

	Instruct: Make it animation

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
configs		configs
data		data
tuneavideo		tuneavideo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train_tuneavideo.py		train_tuneavideo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmoothVideo

Setup

Requirements

Weights

Usage

Training

Inference

Results

Tune-A-Video

Make-A-Protagonist

ControlVideo

Video2Video-zero

About

Releases

Packages

Languages

License

SPengLiang/SmoothVideo

Folders and files

Latest commit

History

Repository files navigation

SmoothVideo

Setup

Requirements

Weights

Usage

Training

Inference

Results

Tune-A-Video

Make-A-Protagonist

ControlVideo

Video2Video-zero

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages