Skip to content

Latest commit

 

History

History
137 lines (92 loc) · 5.63 KB

README.md

File metadata and controls

137 lines (92 loc) · 5.63 KB

Custom Diffusion 360

customdiffusion360.mp4

Custom Diffusion 360 allows you to control the new custom object's viewpoint in generated images by text-to-image diffusion models, such as Stable Diffusion. Given a 360-degree multiview dataset (~50 images), we fine-tune FeatureNeRF blocks in the intermediate feature space of the diffusion model to condition the generation on a target camera pose.

Customizing Text-to-Image Diffusion with Object Viewpoint Control
Nupur Kumari, Grace Su, Richard Zhang, Taesung Park Eli Shechtman, Jun-Yan Zhu

Results

All of our results are based on the SDXL model. We customize the model on various categories of multiview images, e.g., car, teddybear, chair, toy, motorcycle. For more generations and comparisons with baselines, please refer to our webpage.

Comparison to baselines

Generations with different target camera pose

Method Details

Given multi-view images of an object with its camera pose, our method customizes a text-to-image diffusion model with that concept with an additional condition of target camera pose. We modify a subset of transformer layers to be pose-conditioned. This is done by adding a new FeatureNeRF block in intermediate feature space of the transformer layer. We finetune the new weights with the multiview dataset while keeping pre-trained model weights frozen. Similar to previous model customization methods, we add a new modifier token V* in front of the category name, e.g., V* car.

Getting Started

git clone https://github.com/customdiffusion360/custom-diffusion360.git
cd custom-diffusion360
conda create -n pose python=3.8 
conda activate pose
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

We also use pytorch3D in our code. Please look at the instructions to install that here. Or you can follow the below steps to install from source:

conda install -c conda-forge cudatoolkit-dev -y
export CUDA_HOME=$CONDA_PREFIX/pkgs/cuda-toolkit/"
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"

Download the stable-diffusion-xl model checkpoint:

mkdir pretrained-models
cd pretrained-models
wget https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors
wget https://huggingface.co/stabilityai/sdxl-vae/resolve/main/sdxl_vae.safetensors

Inference with provided models

Download pretrained models:

gdown 1LM3Yc7gYXuNmFwr0s1Z-fnH0Ik8ttY8k -O pretrained-models/car0.tar
tar -xvf pretrained-models/car0.tar -C pretrained-models/

We provide all customized models here

Sample images:

python sample.py --custom_model_dir pretrained-models/car0 --output_dir outputs --prompt "a <new1> car beside a field of blooming sunflowers." 

Training

Dataset:

We share the 14 concepts (part of CO3Dv2 and NAVI) that we used in our paper for easy experimentation. The datasets are redistributed under the same licenses as the original works.

gdown 1GRnkm4xp89bnYAPnp01UMVlCbmdR7SeG
tar -xvzf  data.tar.gz

Train:

python main.py --base configs/train_co3d_concept.yaml --name car0 --resume_from_checkpoint_custom  pretrained-models/sd_xl_base_1.0.safetensors --no_date  --set_from_main --data_category car  --data_single_id 0

Your own multi-view images + Colmap: to be released soon.

Evaluation: to be released

Referenced Github repos

Thanks to the following for releasing their code. Our code builds upon these.

Stable Diffusion-XL Relpose-plus-plus GBT

Bibliography

@inproceedings{kumari2024customdiffusion360,
  title={Customizing Text-to-Image Diffusion with Object Viewpoint Control},
  author={Kumari, Nupur and Su, Grace and Zhang, Richard and Park, Taesung and Shechtman, Eli and Zhu, Jun-Yan},
  booktitle = {SIGGRAPH Asia},
  year      = {2024}
}

Acknowledgments

We are thankful to Kangle Deng, Sheng-Yu Wang, and Gaurav Parmar for their helpful comments and discussion and to Sean Liu, Ruihan Gao, Yufei Ye, and Bharath Raj for proofreading the draft. This work was partly done by Nupur Kumari during the Adobe internship. The work is partly supported by Adobe Research, the Packard Fellowship, the Amazon Faculty Research Award, and NSF IIS-2239076. Grace Su is supported by the NSF Graduate Research Fellowship (Grant No. DGE2140739).