Skip to content

A multistage text to image framework. Built from a inference-reduced set of min-dalle, glid-3-xl, and SwinIR.

License

Notifications You must be signed in to change notification settings

Rypo/min-3-flow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Min3Flow

Colab

Min3Flow is a 3-stage text-to-image generation framework. Its structure is modeled after dalle-flow but forgoes the client-server architecture in favor of modularity and configurabilty. The underlying packages have all been stripped down and optimized for inference, taking design inspiration from min-dalle.

Min3Flow vs DALL·E Flow

At a high level, both packages do the same thing in a similar way.

  1. Generate an image from a text prompt using DALL·E-Mega weights
  2. Diffusion refinement with GLID-3-XL
  3. Upsample the 256x256 output images to 1024x1024 with SwinIR

A few thousand feet lower and you'll note that:

  1. Min3Flow uses min-dalle instead of dalle-mini for text-to-image generation. This means the pipeline is entirely PyTorch based, i.e. no flax dependency.
  2. The diffusion library, GLID-3-XL has been heavily refactored and extented. It now functions as standalone module, not just a command line script and supports additional ldm-finetune weights.
  3. Similar to the Glid3XL treatment, SwinIR is no-longer commandline bound. (Kudos to SwinIR_wrapper for the inspiration)

Basic Usage

1. Generate an initial set of images from a text prompt

from min3flow import Min3Flow
mflw = Min3Flow(global_seed=42)

prompt = 'Dali painting of a glider in infrared'
grid_size = 3 # Create a grid of (3,3) images
image = mflw.generate(prompt, grid_size=grid_size)
mflw.show_grid(image)

[Generate] Dali painting of a glider in infrared

2. Select and refine your favorite(s) with diffusion

grid_idx = [0,5,8] #  or pass them all: grid_idx=None
img_diff = mflw.diffuse(prompt, image[grid_idx])
mflw.show_grid(img_diff)

[Diffuse] Dali painting of a glider in infrared

3. Select and upsample the images to 1024x1024

grid_idx = [1,8,12,14] #  or pass them all: grid_idx=None
img_up = mflw.upscale(img_diff[grid_idx])
mflw.show_grid(img_up, plot_index=False)

[Upsample] Dali painting of a glider in infrared

Not a fan of minimalism? Have a look at the Full Configuration in colab

Install

Conda/Mamba

git clone https://github.com/Rypo/min-3-flow.git && cd min-3-flow
conda env create -f environment.yml

Pip

pip install matplotlib jupyter notebook
pip install torch torchvision

# (Glid3XL requirements)
pip install transformers==4.3.1 einops

# CLIP requirements
pip install ftfy regex
pip install git+https://github.com/openai/CLIP.git
# SwinIR requirements
pip install timm
# ldm requirements
pip install pytorch-lightning omegaconf 

git clone https://github.com/Rypo/min-3-flow.git && cd min-3-flow
git clone https://github.com/CompVis/latent-diffusion.git && cd latent-diffusion

pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers

# install latent-diffusion
pip install -e .

cd ..
# install min3flow
pip install -e . 

May need to add the following lines to the top of notebooks/scripts if you get a No module named 'ldm' error.

import sys
sys.path.append('latent-diffusion')
sys.path.append('latent-diffusion/src/taming-transformers/')

Results Gallery

a realistic photo of a muddy dogA scientist comparing apples and oranges, by Norman Rockwellan oil painting portrait of the regal Burger King posing with a WhopperEternal clock powered by a human cranium, artstationanother planet amazing landscapeThe Decline and Fall of the Roman Empire board game kickstarterA raccoon astronaut with the cosmos reflecting on the glass of his helmet dreaming of the stars, digital artA photograph of an apple that is a disco ball, 85 mm lens, studio lightinga cubism painting Donald trump happy cyberpunkoil painting of a hamster drinking tea outsideColossus of Rhodes by Max Ernstlandscape with great castle in middle of forestan medieval oil painting of Kanye west feels satisfied while playing chess in the style of ExpressionismAn oil pastel painting of an annoyed cat in a spaceshipdinosaurs at the brink of a nuclear disasterfantasy landscape with medieval cityGPU chip in the form of an avocado, digital arta giant rubber duck in the oceanPaddington bear as austrian emperor in antique black & white photographya rainy night with a superhero perched above a city, in the style of a comic bookA synthwave style sunset above the reflecting water of the sea, digital artan oil painting of ocean beach front in the style of Titianan oil painting of Klingon general in the style of Rubenscity, top view, cyberpunk, digital realistic artan oil painting of a medieval cyborg automaton made of magic parts and old steampunk mechanicsa watercolour painting of a top view of a pirate ship sailing on the cloudsa knight made of beautiful flowers and fruits by Rachel ruysch in the style of Syd braka 3D render of a rainbow colored hot air balloon flying above a reflective lakea teddy bear on a skateboard in Times Square cozy bedroom at nightan oil painting of monkey using computerthe diagram of a search machine invented by Leonardo da VinciA stained glass window of toucans in outer spacea campfire in the woods at night with the milky-way galaxy in the skyBionic killer robot made of AI scarab beetlesThe Hanging Gardens of Babylon in the middle of a city, in the style of Dalípainting oil of Izhevska hyper realistic photo of a marshmallow office chairfantasy landscape with cityocean beach front view in Van Gogh styleAn oil painting of a family reunited inside of an airport, digital artantique photo of a knight riding a T-Rexa top view of a pirate ship sailing on the cloudsan oil painting of a humanoid robot playing chess in the style of Matissea cubism painting of a cat dressed as French emperor Napoleona husky dog wearing a hat with sunglassesA mystical castle appears between the clouds in the style of Vincent di Fategolden gucci airpods realistic photo

🍒Picking Procedure

For each prompt, a batch of 16 images was generated with 7 different configuration (A-G below). The same global seed (42) was used across all prompts and configurations.

  • A,B,C are images generated with Glid3XL alone (no initial image) and correspond to 3 different diffusion weights (finetune.pt, inpaint.pt, and ongo.pt).

  • D is images generated by creating an initial image with MinDalle(dtype=float32, supercondition factor=32) and pass that image along with the prompt to Glid3XL(classifier guidance=5.0, steps=200, skip rate=0.5)

  • E,F,G are images generated with MinDalle alone using float16+supercondition factor 16, float32+super conditionfactor 16, float32+supercondition factor 32

Before upsampling (1+ per prompt) 
[(A, 7), (B, 4), (C, 8), (D, 25), (E, 16), (F, 13), (G, 24)]

After upsampling (1 per prompt) 
[(A, 4), (B, 3), (C, 5), (D, 11), (E, 11), (F, 4), (G, 10)]

A: 'glid3xl-cg5-finetune-200step-0.0skip'
B: 'glid3xl-cg5-inpaint-200step-0.0skip'
C: 'glid3xl-cg5-ongo-200step-0.0skip'
D: 'mindalle-f32-sf32 -> glid3xl-cg5-inpaint-200step-0.5skip'
E: 'mindalle-f16-sf16'
F: 'mindalle-f32-sf16'
G: 'mindalle-f32-sf32'

TODO

  • Min-Dalle
    • Add optional dependencies for Extended model support
  • Glid3XL
    • Further reduce codebase
      • Clean and optimize guided_diffusion or replace functionality with existing libraries
    • Reintroduce masking and autoedit capablities
      • Add support for inpaint weight and LAION variants (ongo, erlich, puck)
      • Clean and add mask generation GUI
      • Clean and add autoedit functionality
    • Allow batch sizes greater than 1 in clip guidance function
    • Allow direct weight path without requiring a models_roots
    • Explain generating images from scratch (i.e. not to pass dalle output as diffusion input)
  • SwinIR
    • Test if non-SR tasks are functional and useful, if not remove
  • General
    • Standardize all generation outputs as tensors, convert to Image.Image in Min3Flow class
    • Update documentation for new weight path scheme
    • environment.yml and/or requirements.txt
    • Google Colab notebook demo
      • python 3.7.3 compatibility
    • Add VRAM usage estimates

Q/A

How to pronounce min-3-flow? I'm partial to "min-ee-flow" but "min-three-flow" is fair game.

My intention with the l337 style "E" was to sound less like some sort of Minecraft auto clicker (cf. MineFlow).

Why reinvent the wheel?
  1. I found the client-server paradigm to be somewhat limiting in terms of parameter tuning. There are a lot more knobs that can be tuned than are allowed in DALL·E Flow. In persuit of this tunability, I ended up adding more functionality than existed with any of the base packages alone, so it ended up being more than just the sum of its parts.

  2. I couldn't get DocArray to install on my machine. So, why spend an hour debugging when you can spend a month building your own!

About

A multistage text to image framework. Built from a inference-reduced set of min-dalle, glid-3-xl, and SwinIR.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published