In an earlier post, I gave an intuitive (but informal) explanation of VAEs. In this post, I go through the math of VAEs to provide a deeper understanding.

## 1. The Objective

### 1.1 The Ideal Objective

In regular supervised learning, we attempt to optimize parameters to maximize the likelihood of obtaining the data:

where is the data and are the labels.

What then, is the objective in the case of VAEs? Recall that in the case of autoencoders, we want to learn a model that can generate reasonable samples when it receives codes sampled from the unit Gaussian . We will denote the codes as latent variable . A “reasonable sample” will have a high probability of belonging to the same distribution as the data . Therefore, our objective is to maximize for that are sampled from .

Formally, we can write this objective as:

In practice, we maximize the log-likelihood of the data (also known as the evidence)

This is our ideal objective that we want to maximize. For the sake of brevity, I will leave the parameter out of the equations from here on.

### 1.2 The Real Objective

An astute reader may be a bit confused here: in the previous post, I explained that the encoder produced samples of according to the distribution

where and were dependent on . Where is this distribution in the above equation?

Before explaining this, let me explain why directly optimizing $latex \int P(X|z, \theta)P(z) dz $ does not work in practice.

When we try to optimize $latex \int P(X|z, \theta)P(z) dz $, we are optimizing (these are equivalent equations). Since this is impossible to compute analytically, we optimize this is by sampling from $ latex P(z) $ and computing $latex P(X|z, \theta) $, then performing gradient descent.

The problem is that for each data point , **a great majority of is unlikely to produce anything close to it**. This makes learning extremely inefficient since we will need to sample a large number of to obtain any meaningful output for .

What if, instead of sampling from the prior , we sampled from the posterior distribution for each data point ? This would allow us to concentrate on samples that are likely to produce and avoid wasting samples. We would need to slightly change the objective though.

Using Bayes Rule, we can prove that

Where is the KL-divergence.

From this equation, we see that the only modification we need to make is to add a KL-divergence term between the prior and the posterior, and we can optimize the likelihood of by sampling from the posterior.

The only problem is, the posterior is impossible to calculate here. Remember, in the VAE, is an extremely complicated function computed by a neural network. We have no way of finding the true posterior.

This is where comes in. Instead of sampling from , we sample from an approximate distribution , which is parameterized by a neural network.

Using a similar procedure as the above, we can prove that

This is almost entirely the same as the above objective, except for the KL-divergence term in the left-hand side.

This equation is the centerpiece of the mathematics behind VAEs. Let’s look into it a bit further.

## 2. Dissecting the Objective

Anyone who is familiar with variational methods will recognize this objective. It is a commonly occurring equation that underpins many variational bayesian methods.

Let’s look at the right-hand side. The right-hand side is composed of two terms. The first term

measures how likely the sampled are to generate the data. This basically pulls the model in the direction of explaining the data as well as it can.

The second term

measures how far deviates from the prior. This essentially acts as a kind of regularization term, stopping the model from outputting extremely narrow distributions to easily explain the data.

The right-hand side is commonly known as the **evidence lower bound (ELBO). **This is because it sets a lower bound on the evidence (the log-likelihood of the data). To understand further, let’s look at the left-hand side.

The left-hand side of the equation is almost the log evidence. The only difference is the KL-term. Recall that the KL-divergence becomes zero if and only if the two distributions are exactly the same. This means that if is sufficiently close to , then this term is exactly the log likelihood. Otherwise, the KL-divergence is always positive, so the right-hand side effectively acts as a lower bound on the log-evidence.

Therefore, **if is sufficiently flexible so that it can approximate , then we can optimize by optimizing the right-hand side equation**.

Thankfully, $ latex Q(z|X) $ is parametrized by a neural network so we can expect it to be sufficiently flexible. Since we can sample from and the KL-divergence is analytically computable so the right-hand side can be optimized relatively easily in the case of VAEs.

This is the theoretical backbone of VAEs and why they work.

## 3. Conclusion

Hopefully, this post served to give you a better understanding of VAEs and why they work. Variational methods are applied in a wide variety of areas, so the above reasoning is worth understanding in depth.