Phillip Y. Lee*, Taehoon Yoon*, Minhyuk Sung (* equal contribution)
This repository contains the official implementation of GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation.
GrounDiT is a training-free method for spatial grounding in text-to-image generation, using Diffusion Transformers (DiT) to generate precise, controllable images based on user-specified bounding boxes.
More results can be viewed on our project page.
We introduce a novel training-free spatial grounding technique for text-to-image generation using Diffusion Transformers (DiT). Spatial grounding with bounding boxes has gained attention for its simplicity and versatility, allowing for enhanced user control in image generation. However, prior training-free approaches often rely on updating the noisy image during the reverse diffusion process via backprop- agation from custom loss functions, which frequently struggle to provide precise control over individual bounding boxes. In this work, we leverage the flexibility of the Transformer architecture, demonstrating that DiT can generate noisy patches corresponding to each bounding box, fully encoding the target object and allowing for fine-grained control over each region. Our approach builds on an intriguing property of DiT, which we refer to as semantic sharing. Due to semantic sharing, when a smaller patch is jointly denoised alongside a generatable-size image, the two become "semantic clones". Each patch is denoised in its own branch of the gen- eration process and then transplanted into the corresponding region of the original noisy image at each timestep, resulting in robust spatial grounding for each bound- ing box. In our experiments on the HRS and DrawBench benchmarks, we achieve state-of-the-art performance compared to previous training-free spatial grounding approaches.
-
Download PyTorch with CUDA version 11.8. (Any PyTorch version with ≥
2.0.0
would be fine.) -
Install other dependencies via
pip install -r requirements.txt
Jupyter Notebook demo is available at groundit_demo.ipynb
.
Or you can generate image via following command
python main.py
Argument | Description |
---|---|
--save_dir |
Directory where the results will be saved. |
--model_version |
Model version to use. Options: 512 or 1024 . |
--input_config_path |
Path to the input configuration file. |
--gpu_id |
GPU ID to use for inference Default: 0 |
--seed |
Random seed. |
--num_inference_steps |
Number of inference steps to perform. Default: 50 |
--groundit_gamma |
Apply GrounDiT for the initial γ% steps. Default: 0.5 |
You can find the example of input data format in the config.json
file.
Detailed explanation of the input data format
{
"0": {
"prompt": "a wide view picture of an antique living room with a chair, table, fireplace, and a bed",
"phrases": ["chair", "table", "fireplace", "bed"],
"bboxes": [[[0.0, 0.4, 0.15, 1.0]], [[0.25, 0.6, 0.45, 1.0]], [[0.475, 0.1, 0.65, 0.9]], [[0.7, 0.5, 1.0, 1.0]]],
"height": 288,
"width": 896
}
}
-
prompt
- Type:
str
- Description: The input text describing the image to be generated.
- Example:
"a wide view picture of an antique living room with a chair, table, fireplace, and a bed"
- Type:
-
phrases
- Type:
list[str]
- Description: A list of object descriptions (phrase) that you want to position in the image.
- IMPORTANT: Each phrase must be presented inside the
prompt
. - Notes:
- Each phrase can contain multiple words (e.g., brown bear).
- Example:
["chair", "table", "fireplace", "bed"]
- Type:
-
bboxes
- Type:
list[list[list[float]]]
- Description: A list containing bounding box coordinates for each phrase.
- IMPORTANT: The order of bounding boxes list must match the order of
phrases
. - Notes:
- Each phrase can have multiple bounding boxes.
- Bounding boxes follow the format
[ul_x, ul_y, lr_x, lr_y]
, where:ul_x
: x-coordinate of the upper-left corner (0 to 1).ul_y
: y-coordinate of the upper-left corner (0 to 1).lr_x
: x-coordinate of the lower-right corner (0 to 1).lr_y
: y-coordinate of the lower-right corner (0 to 1).
- Example:
"bboxes": [ [[0.0, 0.4, 0.15, 1.0]], // Bounding box for "chair" [[0.25, 0.6, 0.45, 1.0]], // Bounding box for "table" [[0.475, 0.1, 0.65, 0.9]], // Bounding box for "fireplace" [[0.7, 0.5, 1.0, 1.0]] // Bounding box for "bed" ]
- Type:
-
height
andwidth
- Type:
int
- Description: The dimensions of the generated image in pixels.
- Notes:
- Use either
height
andwidth
oraspect_ratio
. At least one should be present. - Specify both
height
andwidth
for exact resolution. - Values that deviate significantly from the generatable resolutions may result in implausible images.
- Use either
- Example:
"height": 288, "width": 896
- Type:
-
aspect_ratio
- Type:
float
- Description: The aspect ratio of the image (width / height).
- Notes:
- Use either
height
andwidth
oraspect_ratio
. At least one should be present. - Recommended range:
[0.25, 4.0]
- Extreme values may result in unrealistic images.
- Use either
- Type:
-
You can consult reasonable resolution values in the
ASPECT_RATIO_512_BIN
orASPECT_RATIO_1024_BIN
dictionaries, depending on your specifiedmodel_version
, inside the/groundit/pipeline_groundit.py
file. -
For the details of generatable resolution, please check Appendix D in our paper.
This code is heavily based on the diffusers library, and the official code for PixArt-α and R&B. We sincerely thank the authors for open-sourcing their code.
@inproceedings{lee2024groundit,
title={GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation},
author={Lee, Phillip Y. and Yoon, Taehoon and Sung, Minhyuk},
booktitle={Advances in Neural Information Processing Systems},
year={2024}
}