Welcome to an exciting journey into the world of Variational Autoencoders (VAEs)! In this project, we dive deep into the MNIST dataset to understand and visualize the magic of VAEs with different latent sizes. π€©
Whether you're curious about how deep learning models generate new data or want to explore the connection between latent dimensions and reconstruction performance, this repository is for you! Here's what you'll find:
- Reconstruction Insights: Compare how well different latent spaces can recreate MNIST digits.
- Latent Space Visualization: Peek into the hidden space where numbers come to life!
- Performance Metrics: Track and analyze training and testing results like a pro.
Variational Autoencoders (VAEs) extend traditional Autoencoders (AEs) by introducing a probabilistic framework to the latent space. This enhancement provides better generalization, continuity, and the ability to generate new data samples.
Feature | Standard Autoencoder (AE) | Variational Autoencoder (VAE) |
---|---|---|
Mapping Type | Deterministic: Encodes data to a fixed latent representation. | Probabilistic: Encodes data into a latent distribution (mean and variance). |
Sampling | No Sampling: Reconstructs outputs directly from deterministic encodings. | Supports Sampling: Enables generation of new data points by sampling from the latent space. |
Latent Space Regularization | None: Focuses on reconstruction accuracy only. | Uses KL Divergence to enforce smoothness and continuity in the latent space. |
Generative Capabilities | Limited: Cannot generate new data samples. | Powerful: Can generate diverse and realistic data samples. |
Focus | Solely on reconstruction of input data. | Balances reconstruction and latent space organization for generative tasks. |
Imagine a model that doesn't just memorize input data but creates smooth, continuous representations that can generate new samplesβthis is what VAEs do! By introducing a probabilistic twist to the traditional autoencoder, VAEs bring us:
π¨ Creativity: Generate new, realistic-looking data points.
π Continuity: Smooth latent spaces mean similar inputs map to nearby latent points.
β¨ Regularization: A structured latent space ensures generalization and interpretability.
Get ready to explore, experiment, and learn with VAEs. Letβs unlock the mysteries of latent spaces together! π Curious? Let's jump into the details!π
This notebook is pre-configured for easy execution on Google Colab, requiring no extra setup. All you need is:
- A Google Account.
- A working internet connection.
Simply click the Open in Colab badge above and start experimenting right away! Colab will automatically install all required libraries and prepare the environment for you. π₯οΈβ‘
-
Maximizing Data Likelihood
The primary goal of a Variational Autoencoder is to maximize the likelihood of the observed data p(x). This is expressed as:However, solving this integral is intractable because integrating over all possible ( z ) is computationally expensive.
-
Bayes' Rule Approximation
To address this, Bayes' rule is applied:But thereβs a new problem: computing p_theta(z|x) is still challenging because it involves knowledge of the posterior, which is also intractable.
-
Neural Network as an Estimator
To approximate p_theta(z|x), we use a neural network q_phi(z|x) to act as the posterior. This is referred to as the variational posterior and makes the computation feasible.Now, instead of directly computing the likelihood p(x), the focus shifts to maximizing a lower bound called the Evidence Lower Bound (ELBO).
-
Decomposing the ELBO
Using the new approximation, the logarithm of p(x) can be rewritten as:Here:
- ELBO: Evidence Lower Bound, which we aim to maximize during training.
- D_KL: Kullback-Leibler divergence between q_phi(z|x) and the true posterior p_theta(z|x).
Since D_KL >= 0, maximizing the ELBO brings us closer to the true log-likelihood p(x).
- KL Divergence Loss
The Kullback-Leibler (KL) Divergence measures the difference between the learned latent distribution q_phi(z|x) (produced by the encoder) and the prior distribution p(z) (usually a standard Gaussian N(0, 1):
This is a statistical measure to ensure the generated latent space distribution aligns closely with the desired prior distribution.
- Latent Space Regularization: Ensures that the latent space is smooth, continuous, and well-organized, making it easier to sample meaningful latent vectors.
- Avoiding Overfitting: Without the KL term, the latent space may overfit the training data, losing generalization to new, unseen inputs.
- Generative Capabilities: A structured latent space ensures that new samples generated from the prior distribution resemble the training data.
By enforcing this regularization, KL divergence encourages the model to learn a meaningful and generative latent space.
-
Final Loss Function
Combining these, the VAE goal is to maximizing the lower bound:Where the first term encourages accurate reconstruction of input data and the second term regularizes the latent space to align with a standard Gaussian prior.
The mathematical explanations and formula illustrations in this section were adapted from Justin Johnson's EECS 498-007: Deep Learning for Computer Vision. Credit goes to the original author for the insightful material and visualizations.
The Variational Autoencoder (VAE) implemented in this project features a U-Net-inspired design with DownBlocks, MidBlocks, and UpBlocks, enhanced by self-attention and cross-attention mechanisms for precise feature extraction and reconstruction.
Variational Autoencoder Model Architecture image took from This page.
The encoder compresses input images into a latent distribution characterized by:
- Mean (ΞΌ) and Variance (ΟΒ²), which define the latent space.
- A reparameterization trick for differentiable sampling.
This ensures smooth and continuous latent representations, critical for generation and generalization.
The decoder reconstructs images from the latent vector z, using:
- Upsampling layers to restore resolution.
- Skip connections to retain fine-grained details.
- Self-attention to maintain global coherence.
To address the common issue of blurry outputs with pixel-wise losses (e.g., L2), this implementation integrates adversarial feedback and perceptual metrics:
-
PatchGAN Discriminator
A discriminator evaluates image patches, encouraging the model to produce sharper, more realistic textures. -
Perceptual Loss (VGG16)
Inspired by Zhang et al. (2018), "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric," perceptual loss compares features extracted by a pre-trained VGG16 model. This improves semantic coherence and enhances visual sharpness by prioritizing high-level details over pixel-level accuracy.
These methods ensure reconstructions are both visually realistic and semantically meaningful.
This project is inspired by ExplainingAI-Code/VAE-Pytorch and incorporates ideas from the work of Zhang et al. (2018):
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric.
Below are the latent space visualizations for different latent sizes. The GIFs illustrate how the latent space evolves during training.
Latent Size | Latent 2 | Latent 4 | Latent 16 |
---|---|---|---|
Latent Space Visualization | ![]() |
![]() |
![]() |
Sample reconstructions of test images at various latent sizes:
Latent Size | Reconstruction GIF |
---|---|
2 | ![]() |
4 | ![]() |
16 | ![]() |
Latent Size | MSE | SSIM | PSNR |
---|---|---|---|
16 | 0.0013 | 0.9919 | 35.5396 |
4 | 0.0025 | 0.9847 | 32.8133 |
2 | 0.0040 | 0.9803 | 30.5386 |
If you found this project exciting or helpful, please consider starring it on GitHub! β
Your support helps inspire more innovative projects and keeps the momentum going. π