# Stable Diffusion Pipeline
This is probably the best end-to-end semi-technical article:
And a detailed look at diffusion process:
But this is a short look at the pipeline:
1. Encoder / Conditioning
Text (via tokenizer) or image (via vision model) to semantic map
(e.g CLiP text encoder)
2. Sampler
Generate noise which is starting point to map to content
(e.g. k_lms)
3. Diffuser
Create vector content based on resolved noise + semantic map
(e.g. actual stable diffusion checkpoint)
4. Autoencoder
Maps between latent and pixel space (actually creates images from vectors)
(e.g. typically some image-database trained GAN)
5. Denoising
Get meaningful images from pixel signatures
Basically, blends what autoencoder inserted using information from diffuser
(e.g. U-NET)
6. Loop and repeat
From step#3 with cross-attention to blend results
7. Run additional models as needed
- Upscale (e.g. ESRGAN)
- Resore Face (e.g. GFPGAN or CodeFormer)