Min3Flow is a 3-stage text-to-image generation framework. Its structure is modeled after dalle-flow but forgoes the client-server architecture in favor of modularity and configurabilty. The underlying packages have all been stripped down and optimized for inference, taking design inspiration from min-dalle.
At a high level, both packages do the same thing in a similar way.
- Generate an image from a text prompt using DALL·E-Mega weights
- Diffusion refinement with GLID-3-XL
- Upsample the 256x256 output images to 1024x1024 with SwinIR
A few thousand feet lower and you'll note that:
- Min3Flow uses min-dalle instead of dalle-mini for text-to-image generation. This means the pipeline is entirely PyTorch based, i.e. no flax dependency.
- The diffusion library, GLID-3-XL has been heavily refactored and extented. It now functions as standalone module, not just a command line script and supports additional ldm-finetune weights.
- Similar to the Glid3XL treatment, SwinIR is no-longer commandline bound. (Kudos to SwinIR_wrapper for the inspiration)
from min3flow import Min3Flow
mflw = Min3Flow(global_seed=42)
prompt = 'Dali painting of a glider in infrared'
grid_size = 3 # Create a grid of (3,3) images
image = mflw.generate(prompt, grid_size=grid_size)
mflw.show_grid(image)
grid_idx = [0,5,8] # or pass them all: grid_idx=None
img_diff = mflw.diffuse(prompt, image[grid_idx])
mflw.show_grid(img_diff)
grid_idx = [1,8,12,14] # or pass them all: grid_idx=None
img_up = mflw.upscale(img_diff[grid_idx])
mflw.show_grid(img_up, plot_index=False)
Not a fan of minimalism? Have a look at the Full Configuration in colab
git clone https://github.com/Rypo/min-3-flow.git && cd min-3-flow
conda env create -f environment.yml
pip install matplotlib jupyter notebook
pip install torch torchvision
# (Glid3XL requirements)
pip install transformers==4.3.1 einops
# CLIP requirements
pip install ftfy regex
pip install git+https://github.com/openai/CLIP.git
# SwinIR requirements
pip install timm
# ldm requirements
pip install pytorch-lightning omegaconf
git clone https://github.com/Rypo/min-3-flow.git && cd min-3-flow
git clone https://github.com/CompVis/latent-diffusion.git && cd latent-diffusion
pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
# install latent-diffusion
pip install -e .
cd ..
# install min3flow
pip install -e .
May need to add the following lines to the top of notebooks/scripts if you get a No module named 'ldm'
error.
import sys
sys.path.append('latent-diffusion')
sys.path.append('latent-diffusion/src/taming-transformers/')
For each prompt, a batch of 16 images was generated with 7 different configuration (A-G below). The same global seed (42) was used across all prompts and configurations.
-
A,B,C are images generated with Glid3XL alone (no initial image) and correspond to 3 different diffusion weights (finetune.pt, inpaint.pt, and ongo.pt).
-
D is images generated by creating an initial image with MinDalle(dtype=float32, supercondition factor=32) and pass that image along with the prompt to Glid3XL(classifier guidance=5.0, steps=200, skip rate=0.5)
-
E,F,G are images generated with MinDalle alone using float16+supercondition factor 16, float32+super conditionfactor 16, float32+supercondition factor 32
Before upsampling (1+ per prompt)
[(A, 7), (B, 4), (C, 8), (D, 25), (E, 16), (F, 13), (G, 24)]
After upsampling (1 per prompt)
[(A, 4), (B, 3), (C, 5), (D, 11), (E, 11), (F, 4), (G, 10)]
A: 'glid3xl-cg5-finetune-200step-0.0skip'
B: 'glid3xl-cg5-inpaint-200step-0.0skip'
C: 'glid3xl-cg5-ongo-200step-0.0skip'
D: 'mindalle-f32-sf32 -> glid3xl-cg5-inpaint-200step-0.5skip'
E: 'mindalle-f16-sf16'
F: 'mindalle-f32-sf16'
G: 'mindalle-f32-sf32'
- Min-Dalle
- Add optional dependencies for Extended model support
- Glid3XL
- Further reduce codebase
- Clean and optimize guided_diffusion or replace functionality with existing libraries
- Reintroduce masking and autoedit capablities
- Add support for inpaint weight and LAION variants (ongo, erlich, puck)
- Clean and add mask generation GUI
- Clean and add autoedit functionality
- Allow batch sizes greater than 1 in clip guidance function
- Allow direct weight path without requiring a models_roots
- Explain generating images from scratch (i.e. not to pass dalle output as diffusion input)
- Further reduce codebase
- SwinIR
- Test if non-SR tasks are functional and useful, if not remove
- General
- Standardize all generation outputs as tensors, convert to Image.Image in Min3Flow class
- Update documentation for new weight path scheme
- environment.yml and/or requirements.txt
- Google Colab notebook demo
- python 3.7.3 compatibility
- Add VRAM usage estimates
How to pronounce min-3-flow?
I'm partial to "min-ee-flow" but "min-three-flow" is fair game.My intention with the l337 style "E" was to sound less like some sort of Minecraft auto clicker (cf. MineFlow).
Why reinvent the wheel?
-
I found the client-server paradigm to be somewhat limiting in terms of parameter tuning. There are a lot more knobs that can be tuned than are allowed in DALL·E Flow. In persuit of this tunability, I ended up adding more functionality than existed with any of the base packages alone, so it ended up being more than just the sum of its parts.
-
I couldn't get DocArray to install on my machine. So, why spend an hour debugging when you can spend a month building your own!