PyTorch Implementation of "Denoising Diffusion Probabilistic Models", Ho et al., 2020
This repo is yet another denoising diffusion probabilistic model (DDPM) implementation. This repo tries to stick to the original paper as close as possible.
The straightforward UNet model definition (without any fancy model builders, helpers, etc.) was specifically intentional because it can be quite difficult sometimes to get and understand the original model architecture behind all the abstraction layers and blocks and see the underlying entities clearly. However some kind of automated model generation with configuration files is handy while experimenting, hence will be added in the nearest future.
Some equations are borrowed from this blog post which demystifies whole math behind the diffusion process.
Diffusion process was implemented as a part of a class called DDPMPipeline, which containes forward
Forward diffusion process applies Gaussian noise to the input image in a scheduleded manner.
Backwrd diffusion process is a process which "denoises" an image using model predictions. It is worth to mention, that UNet model in this particular process predicts some kind of noise residual, and the final "denoised" image is obtained by applying the following equation:
Here,
As stated in the original paper:
- Our neural network architecture follows the backbone of PixelCNN++, which is a U-Net based on a Wide ResNet.
- We replaced weight normalization with group normalization to make the implementation simpler.
- Our 32×32 models use four feature map resolutions (32×32 to 4×4), and our 256×256 models use six.
- All models have two convolutional residual blocks per resolution level and self-attention blocks at the 16×16 resolution between the convolutional blocks.
- Diffusion time is specified by adding the Transformer sinusoidal position embedding into each residual block.
This implementation follows default ResNet blocks architecture without any multiplying factors for simplicity. Also current UNet implementation works better with 128×128 resolution (see next sections) and thus has 5 feature map resoltuions (128 → 64 → 32 → 16 → 8). It is worth noting that subsequent papers suggests more appropriate and better UNet architectures for the diffusion problem.
Training was performed on two datasets:
- smithsonian-butterflies-subset by HuggingFace
- croupier-mtg-dataset by alcazar90
All 128×128 models were trained for 300 epochs with cosine annealing with initial learning rate set to 2e-4, batch size 6 and 1000 diffusion timesteps.
300 epochs, 50266 steps
Epoch 4 | Epoch 99 |
---|---|
Epoch 204 | Epoch 300 |
Sampling from the epoch=300 | Sampling from the epoch=300 |
300 epoch, 72599 steps
Epoch 4 | Epoch 99 |
---|---|
Epoch 204 | Epoch 300 |
Sampling from the epoch=300 | Sampling from the epoch=300 |
All 256×256 models were trained for 300 epochs with cosine annealing with initial learning rate set to 2e-5, batch size 6 and 1000 diffusion timesteps.
300 epochs, 50266 steps
Epoch 4 | Epoch 100 |
---|---|
Epoch 205 | Epoch 300 |
[TODO description]