Variational Autoencoders - Caroline Chen Blog

Generative AI isn’t just about Large Language Models. At its core, generative AI is about creating new data from scratch. While standard autoencoders are excellent for compression, they fail as generative models. This documentation explores the Variational Autoencoder (VAE), popularized by Kingma & Welling (2013).

The Core Problem

Traditional autoencoders compress an image into a discrete point in a low-dimensional “latent space.”

The Discontinuity Gap: Because the latent space is not regularized, it is often disorganized. Sampling a random point between two trained clusters often results in “gibberish” because the decoder hasn’t learned to interpret those empty regions.

A Probabilistic Approach

Instead of mapping an input to a single point, a VAE maps it to a probability distribution (specifically a Gaussian).

The Encoder: Predicts parameters of the distribution: Mean ( $\mu$ ) and Variance ( $\sigma^2$ ).
The Latent Space: By representing data as overlapping “clouds” rather than points, the space becomes continuous.

The Objective Function: ELBO

To train a VAE, we maximize the Evidence Lower Bound (ELBO). This objective balances reconstruction accuracy with latent space organization.

\ln p(x) \ge \underbrace{\mathbb{E}_{z \sim q(z|x)}[\ln p(x|z)]}_{\text{Reconstruction}} - \underbrace{D_{KL}(q(z|x) \parallel p(z))}_{\text{Regularization}}

1. Reconstruction Loss ( $L_2$ )

This term measures how well the decoded image matches the original. Under a Gaussian assumption, this is typically implemented as Mean Squared Error (MSE):

\mathcal{L}_{recon} = \sum (x_i - \hat{x}_i)^2

2. KL Divergence

This term forces the predicted distribution

q(z|x)

to be as close as possible to a Standard Normal Prior

p(z) = \mathcal{N}(0, 1)

. For a univariate Gaussian, the closed-form solution is:

D_{KL} = \frac{1}{2} \left( \sigma^2 + \mu^2 - 1 - \ln(\sigma^2) \right)

The Tug-of-War:

L_2

wants to separate data to ensure accuracy (scattering), while

D_{KL}

wants to pull all data to the center (overlapping). This tension creates a smooth, navigable latent space.

The Reparameterization Trick

In standard backpropagation, you cannot flow gradients through a random sampling operation (

z \sim \mathcal{N}(\mu, \sigma^2)

). To solve this, we move the randomness to an external variable

\epsilon

Mathematical Deduction

We define the latent vector

z

as a deterministic function:

z = \mu + \sigma \odot \epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0, I)

By treating

\epsilon

as a constant during the backward pass, we can calculate gradients for

\mu

and

\sigma

directly:

\frac{\partial z}{\partial \mu} = 1, \quad \frac{\partial z}{\partial \sigma} = \epsilon

Capabilities & Trade-offs

Smooth Interpolation

You can “walk” between two latent vectors to seamlessly blend features (e.g., changing a smile to a frown).

Data Generation

Generate entirely new samples by drawing random vectors from the standard normal prior.

Limitations

Blurriness: VAEs tend to produce softer images than GANs. This is because $L_2$ loss encourages the model to “average” its predictions when uncertain.
Inference: While foundational for models like Stable Diffusion, vanilla VAEs struggle with high-resolution, sharp details without advanced modifications like VQ-VAEs.

Resources

Original Paper: Auto-Encoding Variational Bayes
Concepts: ELBO, Reparameterization Trick, Latent Variables.

Blog

​The Core Problem

​A Probabilistic Approach

​The Objective Function: ELBO

​1. Reconstruction Loss (L2L_2L2​)

​2. KL Divergence

​The Reparameterization Trick

​Mathematical Deduction

​Capabilities & Trade-offs

Smooth Interpolation

Data Generation

​Limitations

​Resources

The Core Problem

A Probabilistic Approach

The Objective Function: ELBO

1. Reconstruction Loss ( $L_2$ )

2. KL Divergence

The Reparameterization Trick

Mathematical Deduction

Capabilities & Trade-offs

Limitations

Resources