diffusion-DDPM

PyTorch Implementation of "Denoising Diffusion Probabilistic Models", Ho et al., 2020

Overview

This repo is yet another denoising diffusion probabilistic model (DDPM) implementation. This repo tries to stick to the original paper as close as possible.

The straightforward UNet model definition (without any fancy model builders, helpers, etc.) was specifically intentional because it can be quite difficult sometimes to get and understand the original model architecture behind all the abstraction layers and blocks and see the underlying entities clearly. However some kind of automated model generation with configuration files is handy while experimenting, hence will be added in the nearest future.

Some equations are borrowed from this blog post which demystifies whole math behind the diffusion process.

Diffusion process

Diffusion process was implemented as a part of a class called DDPMPipeline, which containes forward $q(x_t \vert x_{t-1})$ and backward $p_\theta(x_{t-1} \vert x_t)$ diffusion processes.

Forward diffusion process applies Gaussian noise to the input image in a scheduleded manner. Backwrd diffusion process is a process which "denoises" an image using model predictions. It is worth to mention, that UNet model in this particular process predicts some kind of noise residual, and the final "denoised" image is obtained by applying the following equation: $$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}}\epsilon(x_t,t)) + \sigma_tz$$

Here, $\epsilon$ is the UNet model, $\alpha_t$, $\bar{\alpha}_t$ are precomputed and $\sigma_t$ is calculated using these precomputed values at forward diffusion step.

UNet

As stated in the original paper:

Our neural network architecture follows the backbone of PixelCNN++, which is a U-Net based on a Wide ResNet.

We replaced weight normalization with group normalization to make the implementation simpler.

Our 32×32 models use four feature map resolutions (32×32 to 4×4), and our 256×256 models use six.

All models have two convolutional residual blocks per resolution level and self-attention blocks at the 16×16 resolution between the convolutional blocks.

Diffusion time is specified by adding the Transformer sinusoidal position embedding into each residual block.

This implementation follows default ResNet blocks architecture without any multiplying factors for simplicity. Also current UNet implementation works better with 128×128 resolution (see next sections) and thus has 5 feature map resoltuions (128 → 64 → 32 → 16 → 8). It is worth noting that subsequent papers suggests more appropriate and better UNet architectures for the diffusion problem.

Results

Training was performed on two datasets:

smithsonian-butterflies-subset by HuggingFace
croupier-mtg-dataset by alcazar90

128×128 resolution

All 128×128 models were trained for 300 epochs with cosine annealing with initial learning rate set to 2e-4, batch size 6 and 1000 diffusion timesteps.

Training on smithsonian-butterflies-subset

300 epochs, 50266 steps

Epoch 4	Epoch 99

Epoch 204	Epoch 300

Sampling from the epoch=300	Sampling from the epoch=300

Training on croupier-mtg-dataset

300 epoch, 72599 steps

Epoch 4	Epoch 99

Epoch 204	Epoch 300

Sampling from the epoch=300	Sampling from the epoch=300

256×256 resolution

All 256×256 models were trained for 300 epochs with cosine annealing with initial learning rate set to 2e-5, batch size 6 and 1000 diffusion timesteps.

Training on smithsonian-butterflies-subset

300 epochs, 50266 steps

Epoch 4	Epoch 100

Epoch 205	Epoch 300

[TODO description]

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
models		models
src		src
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

diffusion-DDPM

Overview

Diffusion process

UNet