Skip to content

Latest commit

 

History

History
180 lines (148 loc) · 9.55 KB

README.md

File metadata and controls

180 lines (148 loc) · 9.55 KB

Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance, PDF

by Dongmin Park1, Sebin Kim2, Taehong Moon1, Minkyu Kim1, Kangwook Lee1,3, Jaewoong Cho1.

1 KRAFTON AI, 2 Seoul National University, 3 University of Wisconsin-Madison

🔎Overview

  • Rare-to-frequent (R2F) is a powerful training-free framework that can unlock the compositional generation power of SOTA text-to-image diffusion models (e.g., SDXL, SD3, IterComp, and FLUX) by leveraging SOTA LLMs (e.g., GPT-4o and LLaMA3) as the rare concept identificator and frequent concept guider throughout the diffusion sampling steps
  • R2F is flexible to an arbitrary combination of diffusion backbones and LLM architectures
  • R2F can also be seamlessly integrated with region-guided diffusion approaches, yielding more controllable image synthesis
    • First work to apply cross-attention control on SD3!
  • Fast 4-step inference with FLUX-schenell integration!

🖼Examples

  • While SOTA pre-trained T2I models (e.g., SD3 and FLUX) and an LLM-grounded T2I approach (e.g., RPG) struggle to generate images from prompts with rare compositions of concepts (= attribute + object ), R2F exhibits superior composition results
  • This may provide a better image generation experience for user creators (e.g., designing a new character with unprecedented attributes)
  • More generated images are in images/ folder.
R2F (Ours) FLUX-schnell SD3 RPG
Prompt: A furry frog warrior
Prompt: A mustachioed squirrel is holding an ax-shaped guitar on a stage
Prompt: A beautiful wigged octopus is juggling three star-shaped apples
Prompt: A red dragon and a unicorn made of diamond rollerblading through a neon lit cityscape

💡Why R2F works?

1. Theoretical observation

  • Once a target rare distribution (deep blue) is difficult to estimate by a model, the score-interpolated distribution (sky blue), created through the interpolation of the estimated distribution (red) and the relevant yet frequent distribution (green), is much closer to the actual target.
  • In other words, the Wasserstein distance of the score-interpolated distribution (sky blue) to the target (deep blue) is smaller than that of the original estimated distribution (red).

2. Empirical observation

  • Once we generate a rare composition of two concepts (flower-patterned and animal), SD3's naive inferences (red line) tend to be inaccurate when the composition becomes rarer (animal classes rarely appear on the LAION dataset).
  • However, when we guide the inference with a relatively frequent composition (flower-patterned bear, which is easily generated as bear doll) at the early sampling steps and then turn back to the original prompt, the generation quality is significantly enhanced (blue line).

Therefore, we can unlock the power of diffusion models on rare concepts (even in the tail distribution) !!!

🧪How to Run

1. Playground

from R2F_Diffusion_xl import R2FDiffusionXLPipeline
from R2F_Diffusion_sd3 import R2FDiffusion3Pipeline
from R2F_Diffusion_flux import R2FFluxPipeline

from diffusers import DPMSolverMultistepScheduler

from gpt.mllm import GPT4_Rare2Frequent, LLaMA3_Rare2Frequent
import torch

api_key = "YOUR_API_KEY"

model = "itercomp"
if model == 'sd3':
    pipe = R2FDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium", revision="refs/pr/26")
elif model == "sdxl":
    pipe = R2FDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True)
elif model == "flux":
    pipe = R2FFluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) # In R2F, we do experiment on FLUX.1-schnell which it requires 4 sampling steps.
elif model == "itercomp":
    pipe = R2FDiffusionXLPipeline.from_pretrained("comin/IterComp",torch_dtype=torch.float16, use_safetensors=True)
pipe.to("cuda")

# Demo
prompt= 'A hairy frog'

# Get r2f prompt from LLMs
llm = "gpt4o"
if llm == "gpt4o":
    r2f_prompt = GPT4_Rare2Frequent(prompt, key=api_key)
elif llm == "llama3.1":
    r2f_prompt = LLaMA3_Rare2Frequent(prompt, model_id="meta-llama/Llama-3.1-8B-Instruct")
print(r2f_prompt)

image = pipe(
    r2f_prompts = r2f_prompt,
    seed = 42,# random seed
).images[0]
image.save(f"{prompt}_test.png")

2. Running R2F on Benchmark Datasets

### Get r2f_prompts from GPT-4o/LLaMA
cd gpt
bash get_r2f_response.sh 

### Generate images
cd ../script/
bash inference_r2f.sh

3. Running R2F+ on Benchmark Datasets

### Get r2fplus_prompts from GPT-4o/LLaMA
cd gpt
bash get_r2fplus_response.sh 

### Generate images
cd ../script/
bash inference_r2fplus.sh

📊RareBench

✔Set Environment

git clone 
cd Rare-to-Frequent
conda create -n R2F python==3.9
conda activate r2f
pip install -r requirements.txt

📖Citation

@article{park2024rare,
  title={Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance},
  author={Park, Dongmin and Kim, Sebin and Moon, Taehong and Kim, Minkyu and Lee, Kangwook and Cho, Jaewoong},
  journal={arXiv preprint arXiv:2410.22376},
  year={2024}
}

Acknowledgements

Our R2F is a general LLM-grounded T2I generation framework built on several solid works. Thanks to RPG, LMD, SAM, and diffusers for their wonderful work and codebase!