Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Gal Chechik 2410.01731 website
- Abstract
- 1 Introduction
- 2 Related work
- 3 Method
- 4 Experiments
- 5 Analysis
- 6 Limitations
- 7 Conclusions
Prompt-Adaptive Workflows for Text-to-Image Generation (ComfyGen)
Background:
- Evolution of text-to-image generation from monolithic models to complex workflows
- Expertise required for effective workflow design due to component availability, interdependence, and prompt dependency
Problem Statement:
- Need for automated tailoring of workflows based on user prompts (prompt-adaptive workflow generation)
Approaches Proposed:
- Tuning-based method:
- Learns from user-preference data
- Training-free method:
- Uses LLM to select existing flows
Benefits:
- Improved image quality compared to monolithic models or generic workflows
- Complementary research direction in text-to-image generation field
Figure 1:
- Standard text-to-image generation flow: monolithic model transforms prompt into an image (top)
- Proposed approach: LLM synthesizes custom workflows based on user’s prompt (bottom)
- LLM chooses components that better match the prompt for improved quality.
Text-to-Image Generation: Advanced Workflows
- Researchers shift to complex workflows combining components for enhancing image quality (Rombach et al., 2022; Ramesh et al., 2021)
- Components include fine-tuned generative models, LLMs, LoRAs, improved decoders, super resolution blocks
- Effective workflows depend on prompt and image content
- Nature photographs may use photorealism models; human images often contain negative prompts or specific super-resolution models
- Building well-designed workflows requires expertise
Proposed Approach: Learn to build text-to-image generation workflows conditioned on user prompt using LLMs.
Components:
- Prompt: describes desired image
- LLM: interprets prompt and matches content with most appropriate blocks
- Workflow: tailored to specific prompt for improved image quality
- ComfyUI (comfyanonymous, 2023): stores workflows as JSON files, accesses multiple human-created workflows
- Training set: 500 diverse prompts, images generated using each workflow, scored by aesthetic and human preference estimators
- Two approaches for matching flows to novel prompts: ComfyGen-IC and ComfyGen-FT
- Comparison against baselines: single-model approaches (SDXL model, fine-tunes, DPO optimized version), prompt-independent popular workflows
- Benefits: outperforms all baselines on human-preference and prompt-alignment benchmarks.
Improving Text-to-Image Generation Quality: Related Work
Fine-tuning Pretrained Models:
- Curated datasets and improved captioning techniques used for fine-tuning (Dai et al., 2023; Betker et al., 2023; Segalis et al., 2023)
- Reward models as an alternative: reinforcement learning or differentiable rewards (Kirstain et al., 2023; Wu et al., 2023b; Xu et al., 2024; Lee et al., 2023; Clark et al., 2024; Prabhudesai et al., 2023; Wallace et al., 2024)
- Exploring diffusion input noise space using reward models (Eyring et al., 2024; Qi et al., 2024)
- Self-guidance or frequency-based feature manipulations for more detailed outputs (Hong et al., 2023; Si et al., 2024; Luo et al., 2024)
Leveraging Large Language Models:
- Significant improvements in reasoning abilities and adaptability through fine-tuning methods, zero-shot prompting or in-context learning (Schick et al., 2024; Wang et al., 2024; Surís et al., 2023; Shen et al., 2024; Gupta & Kembhavi, 2023; Wu et al., 2023a)
- LLM agents proposed to equip the model with external tools through API tags, documentation, model descriptions and code samples (Schick et al., 2024; Wang et al., 2024; Surís et al., 2023; Shen et al., 2024; Gupta & Kembhavi, 2023; Wu et al., 2023a)
- Our work focuses on prompt-adaptive pipeline creation and tapping into this under-explored path to improving the quality of downstream generations.
Pipeline Generation:
- Compound systems with multiple models used for state-of-the-art results across various domains (AlphaCode Team, 2024; Trinh et al., 2024; Nori et al., 2023; Yuan et al., 2024)
- Crafting such compound systems is a daunting task due to careful component selection and parameter tuning (Khattab et al., 2023; Zhuge et al., 2024)
- Our work tackles the task of pipeline generation for text-to-image models, focusing on designing compound pipelines that depend on user's prompt.
ComfyUI Method
Goal: Match input prompt with appropriate text-to-image workflow for improved visual quality and prompt alignment.
Hypothesis: Effective workflows depend on specific topics in the prompt.
Proposed Approach: Use LLM to reason over prompt, identify topics, and select or synthesize new flow.
Description of ComfyUI:
- Open-source software for designing and executing generative pipelines
- Users create pipelines by connecting model blocks
- Simple example pipeline: base model, face restoration block, positive/negative prompt (Figure 2a)
- Complex pipelines include LoRAs, ControlNets, IP-Adapters, etc. (Figure 2b,c)
- Pipelines can be exported to JSON files for automation
Training Data:
- Collect approximately 500 human-generated ComfyUI workflows from popular websites
- Filter out video and control image generation flows, highly complex flows, and community-written blocks appearing in fewer than three flows
- Augment data by randomly switching models, LoRAs, samplers, or changing parameters (310 distinct workflows)
- Collect 500 popular prompts from Civitai.com to synthesize images with each flow using ensemble of quality prediction models: LAION Aesthetic Score, ImageReward, HPS v2.1, Pickscore.
- Standardize and sum scores for a single scalar score per prompt-workflow pair (higher scores correlate with better image quality)
Approach to Providing Prompt-Dependent Flows using LLMs:
- Use in-context based solutions that leverage a powerful, closed-source LLM
- First step: provide LLM with list of labels for training prompts (object-categories, scene categories, styles)
- Examples: "People", "Wildlife", "Urban", "Nature", "Anime", "Photo-realistic"
- Calculate average quality score of images produced by each flow across all prompts in a label category
- Repeat for all flows and all labels, creating a table of flows and their performance across categories
- Ideally: provide LLM with full JSON description of flows to learn relationships between components and downstream performance
- Alternative approach: ComfyGen-IC - classifier capable of parsing new prompts, breaking them down into relevant categories, and selecting best matching flow
Approach to Fine-tuning LLM for Predicting High-Quality Workflows:
- Instead of best-scoring method: fine-tune LLM to predict specific flow that achieved given score for a prompt
- Significant drawbacks of best-scoring method: reduces number of training tokens, sensitive to randomness, no negative examples
- Proposed alternative formulation: task LLM with predicting flow given prompt and associated score
- Increases available data points for training by utilizing all flows instead of just highest scorers
- Reduces impact of random fluctuations by considering a wider range of scores and their associated flows
- Allows learning from negative examples, helping identify ineffective components or combinations
- Inference: provide LLM with prompt and high score, have it predict effective flow for given prompt
- Variation: ComfyGen-FT
Implementation Details:
- ComfyGen-IC implemented using Claude Sonnet 3.5
- ComfyGen-FT on top of pre-trained Meta Llama models (70B and 3.1B checkpoints)
- Unless otherwise noted, all results in the paper use 70B model with target score of 0.725
- Fine-tune for a single epoch using LoRA rank 16 and learning rate 2e−4
- Additional details provided in supplementary materials.
Method Showcase:
- Generates higher quality images across diverse domains and styles
- Prompts available in supplementary material
- Examples shown in Figure 3:
- Subject-focused images
- Photo-realistic imagery
- Artistic or abstract creations
Baseline Comparison:
- Two types of alternative approaches:
- Fixed, monolithic models: Pre-trained diffusion model directly conditioned by prompts (SDXL, JuggernautXL, DreamshaperXL, DPO-SDXL)
- Generic workflows: Same workflow for all images regardless of prompt (SSD-1B, Pixart-Σ)
- Evaluated on:
- GenEval benchmark: Prompt-alignment tasks like single-object generation, counting, and attribute binding
- User study on CivitAI prompts using human preference scores
GenEval Results:
- Tuning-based model outperforms all baselines despite only using human preference scores during training
- In-context approach underperforms due to short, simplistic prompts challenging its ability to match prompts with labels
Model | Object Detection Scores | Single Subject Object | Counting | Colors | Position | Attribute Binding | Overall |
---|---|---|---|---|---|---|---|
SDXL | 0.98 | 0.39 | 0.85 | 0.15 | 0.23 | 0.23 | 0.55 |
JuggernautXL | 1.00 | 0.73 | 0.48 | 0.89 | 0.11 | 0.19 | 0.57 |
DreamShaperXL | 0.99 | 0.78 | 0.45 | 0.81 | 0.17 | 0.24 | 0.57 |
DPO-SDXL | 1.00 | 0.81 | 0.44 | 0.90 | 0.15 | 0.23 | 0.59 |
Fixed Flow - Most Popular | 0.95 | 0.38 | 0.77 | 0.06 | 0.12 | 0.12 | 0.42 |
Fixed Flow - 2nd Most Popular | 1.00 | 0.65 | 0.86 | 0.13 | 0.34 | 0.34 | 0.59 |
ComfyGen-IC (ours) | 0.99 | 0.78 | 0.38 | 0.84 | 0.13 | 0.25 | 0.56 |
ComfyGen-FT (ours) | 0.99 | 0.82 | 0.50 | 0.90 | 0.13 | 0.29 | 0.61 |
CivitAI Prompts Evaluation:
- ComfyGen-FT outperforms all baseline approaches, despite being tuned with human preference scores and not strictly for prompt alignment
Findings of ComfyGen's Performance Analysis
Three aspects examined:
- Originality and diversity of generated flows
- Human-interpretable patterns
- Effect of using target score in ComfyGen-FT prompts
Originality and diversity
- ComfyGen-FT generates novel flows with minimal similarity to training corpus (0.9995 compared to expected 1.0)
- More diverse outputs than ComfyGen-IC, suggesting potential for further data or parameter search
Analyzing chosen flows
- Patterns identified in selected models per category: intuitive in some cases but not always clear
- Future work may involve explaining reasoning behind component selections
Effect of target scores
- ComfyGen-FT learns to associate target scores with varying quality flows
- Appropriate choice of score leads to comparable performance to ComfyGen-IC
- Predicting best model instead of score leads to diminished performance, highlighting importance of approach.
Comparative Analysis:
- Both ComfyGen-FT models (8B and 70B) perform equally well and significantly outperform baseline SDXL model and ComfyGEN-IC in most evaluations.
Limitations of ComfyGen's Approach
Text-to-Image Workflows:
- Current model is limited to text-to-image workflows
- Cannot address more complex editing or control-based tasks
- Potential resolution: Vision-Language Models (VLMs) could be used in the future
Generation Speed and Scalability:
- Generations take an order of 15 seconds per image
- With a set of 500 prompts and 300 flows, requires a month of GPU time to create
- Scaling up would likely require significant computational resources or more efficient ways (e.g., Reinforcement Learning) to explore the flow parameter space
Drawbacks of Fine-Tuning Approach:
- Cannot easily generalize to new blocks as they become available
- Requires retraining with new flows that include these blocks
Drawbacks of In-Context Approach:
- Can be easily expanded by including new flows in the score table provided to the LLM
- Increases the number of input tokens used, making it more expensive to run
- Eventually saturates the maximum context length
Future Work:
- More advanced retrieval-based approaches or use of collaborative agents could potentially address these limitations.
Conclusions
Introduction:
- Presented ComfyGen - a set of two approaches for prompt-adaptive workflow generation
- Demonstrated that such prompt-dependent flows can outperform monolithic models or fixed, user created flows in improving image quality
Future Work:
- Explore more prompt-dependent workflow creation methods
- Increase originality and expand scope to image-to-image or video tasks
- Potential collaboration with language model on creating such flows, providing feedback through instructions or examples of outputs
- Enable non-expert users to push the boundary of content creation.