Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance, PDF
by Dongmin Park1, Sebin Kim2, Taehong Moon1, Minkyu Kim1, Kangwook Lee1,3, Jaewoong Cho1.
1 KRAFTON AI, 2 Seoul National University, 3 University of Wisconsin-Madison
- Rare-to-frequent (R2F) is a powerful training-free framework that can unlock the compositional generation power of SOTA text-to-image diffusion models (e.g., SDXL, SD3, IterComp, and FLUX) by leveraging SOTA LLMs (e.g., GPT-4o and LLaMA3) as the rare concept identificator and frequent concept guider throughout the diffusion sampling steps
- R2F is flexible to an arbitrary combination of diffusion backbones and LLM architectures
- R2F can also be seamlessly integrated with region-guided diffusion approaches, yielding more controllable image synthesis
- First work to apply cross-attention control on SD3!
- Fast 4-step inference with FLUX-schenell integration!
- While SOTA pre-trained T2I models (e.g., SD3 and FLUX) and an LLM-grounded T2I approach (e.g., RPG) struggle to generate images from prompts with rare compositions of concepts (= attribute + object ), R2F exhibits superior composition results
- This may provide a better image generation experience for user creators (e.g., designing a new character with unprecedented attributes)
- More generated images are in
images/
folder.
- Once a target rare distribution (deep blue) is difficult to estimate by a model, the score-interpolated distribution (sky blue), created through the interpolation of the estimated distribution (red) and the relevant yet frequent distribution (green), is much closer to the actual target.
- In other words, the Wasserstein distance of the score-interpolated distribution (sky blue) to the target (deep blue) is smaller than that of the original estimated distribution (red).
- Once we generate a rare composition of two concepts (flower-patterned and animal), SD3's naive inferences (red line) tend to be inaccurate when the composition becomes rarer (animal classes rarely appear on the LAION dataset).
- However, when we guide the inference with a relatively frequent composition (flower-patterned bear, which is easily generated as bear doll) at the early sampling steps and then turn back to the original prompt, the generation quality is significantly enhanced (blue line).
Therefore, we can unlock the power of diffusion models on rare concepts (even in the tail distribution) !!!
from R2F_Diffusion_xl import R2FDiffusionXLPipeline
from R2F_Diffusion_sd3 import R2FDiffusion3Pipeline
from R2F_Diffusion_flux import R2FFluxPipeline
from diffusers import DPMSolverMultistepScheduler
from gpt.mllm import GPT4_Rare2Frequent, LLaMA3_Rare2Frequent
import torch
api_key = "YOUR_API_KEY"
model = "itercomp"
if model == 'sd3':
pipe = R2FDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium", revision="refs/pr/26")
elif model == "sdxl":
pipe = R2FDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True)
elif model == "flux":
pipe = R2FFluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) # In R2F, we do experiment on FLUX.1-schnell which it requires 4 sampling steps.
elif model == "itercomp":
pipe = R2FDiffusionXLPipeline.from_pretrained("comin/IterComp",torch_dtype=torch.float16, use_safetensors=True)
pipe.to("cuda")
# Demo
prompt= 'A hairy frog'
# Get r2f prompt from LLMs
llm = "gpt4o"
if llm == "gpt4o":
r2f_prompt = GPT4_Rare2Frequent(prompt, key=api_key)
elif llm == "llama3.1":
r2f_prompt = LLaMA3_Rare2Frequent(prompt, model_id="meta-llama/Llama-3.1-8B-Instruct")
print(r2f_prompt)
image = pipe(
r2f_prompts = r2f_prompt,
seed = 42,# random seed
).images[0]
image.save(f"{prompt}_test.png")
### Get r2f_prompts from GPT-4o/LLaMA
cd gpt
bash get_r2f_response.sh
### Generate images
cd ../script/
bash inference_r2f.sh
### Get r2fplus_prompts from GPT-4o/LLaMA
cd gpt
bash get_r2fplus_response.sh
### Generate images
cd ../script/
bash inference_r2fplus.sh
- A new evaluation benchmark consisting of prompts with diverse and rare concepts
- See
test/original_prompt/rarebench/
folder. - All the r2f_prompts generated by GPT-4o are in
test/r2f_prompt/
folder.
git clone
cd Rare-to-Frequent
conda create -n R2F python==3.9
conda activate r2f
pip install -r requirements.txt
@article{park2024rare,
title={Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance},
author={Park, Dongmin and Kim, Sebin and Moon, Taehong and Kim, Minkyu and Lee, Kangwook and Cho, Jaewoong},
journal={arXiv preprint arXiv:2410.22376},
year={2024}
}
Our R2F is a general LLM-grounded T2I generation framework built on several solid works. Thanks to RPG, LMD, SAM, and diffusers for their wonderful work and codebase!