In this lecture, we’ll look at generative modeling, The business of training probability models that are too complex to give us an explicit density function over our feature space, but that do allow us to sample points. If we train them well, we get points that look like those in our dataset.
These kinds of methods are often combined with neural nets to produce very complex, high-dimensional objects, for instance images.
Here is the example, we gave in the first lecture. A deep neural network from which we can sample highly realistic images of human faces.
source: A Style-Based Generator Architecture for Generative Adversarial Networks, Karras et al.
In the rest of the lecture, we will use the following visual shorthand. The diagram on the left represents any kind of neural network. We don’t care about the precise architecture, whether it has one or a hundred hidden layers and whether it uses fully connected layers or convolutions, we just care about the shape of the input and the output.
The image on the right represents a multivariate normal distribution. Think of this as a contour line for the density function. We’ve drawn it in a 2D space, but we’ll use it as a representation for MVNs in higher dimensional spaces as well. If the MVN is nonstandard, we’ll draw it as an ellipse somewhere else in space.
A plain neural network is purely deterministic. It translates an input to an output and does the same thing every time with no randomness. How do we turn this into a probability distribution?
The first option is to take the output and to interpret it as the parameters of a probability distribution. We simply run the neural network for some input, let it produce some numbers as an output, and then interpret those as the parameters to a probability distribution. This distribution then defines a probability on a separate space. The network plus the probability distribution define a probability distribution conditional on the network input.
If this sounds abstract, note that it is something we've been doing already since the early lectures. For instance: to do binary classification, we defined a neural network with one sigmoid activated output node. We took that output value as the probability that the class was the positive one, but we could also say we're parametrizing a Bernoulli distribution with this value, and the Bernoulli distribution defines probabilities over the space containing the two outcomes "Class=pos" and "Class=neg".
If we do this with a multiclass problem and a softmax output, we are parametrizing a multinomial distribution.
Another example is regression, either linear or with a neural network.
Here we simply produce a target prediction for x. However, what we saw in the previous lecture is that if we interpret this as the mean of a normal distribution on y, then maximizing the likelihood of this distribution is equivalent to the least squares loss function.
If we build a probability distribution parametrized by a neural network in this way, training it is pretty straightforward. We can easily compute the log-likelihood over our whole data, which then becomes a loss function. With backpropagation and gradient descent, we can train the parameters of the neural network to maximize the likelihood of the data under the model
For many distributions, this leads to loss functions that we've seen already.
The loss function for a normal output distribution with a mean and variance, is a modification of the squared error. We can set the variance larger to reduce the impact of the squared errors, but we pay a penalty of the logarithm of sigma. If we know we are going to get the output value for instance i exactly right, then we will get almost no squared error and we can set the variance very small, paying little penalty. If we we’re less sure, then we expect a sizable squared error and we should increase the variance to reduce the impact a little. This way, we get a neural network that tells us not just what its best guess is, but also how sure it is about that guess.
For a solution that applies to high-dimensional outputs like images, we can use the outputs to parametrize a multivariate normal distribution. Here we'll parametrize both the mean and the covariance matrix.
If we provide both the the mean and the variance of an output distribution, it looks like this (for an n-dimensional output space). We simply split the output layer in two parts, and interpret one part as the mean and the other as the covariance matrix.
Since representing a full covariance matrix would grow very big for high-dimensional outputs, we usually assume that the covariance matrix is diagonal (all off-diagonal values are zero). That way the representation of the covariance requires as many arguments as the representations of the mean, and we can simply split the output vector into two halves.
Equivalently, we can think of the output distribution as putting an independent 1D Gaussian on each dimension, with a mean and variance provided for each.
For the mean, we can use a linear activation, since it can have any value, including negative values. However, the values of the covariance matrix need to be positive. To achieve this, we often exponentiate them. We’ll call this an exponential activation. An alternative option is the softplus function ln(1 + ex), which grows less explosively.
Here’s what that looks like when we generate an image. The output distribution gives us a mean value for every channel of every pixel (a 3-tensor) and a corresponding variance for every mean. If we look at what that tells us about the red value of the pixel at coordinate (8, 7) we see that we get a univariate Gaussian with a particular mean and a variance. The mean tells us the network’s best guess for that value, and the variance tells us how certain the network is about this output.
If we want to go all out, we can even make our neural network output the parameters of a Gaussian mixture model. This is called a mixture density network.
All we need to do, is make sure that we have one output node for every parameter required, and apply different activations to the different kinds of parameters. The means get a linear activation and the covariances get an exponential activation as before. The component weights need to sum to one, so we need to give them a softmax activation (over just these three output values).
If we want to train with maximum likelihood, we encounter this sum-inside-a-logarithm function again, which is difficult to deal with. But this time, it’s not such a headache. As we noted in the last lecture we can work out the gradient for this loss, it’s just a very ugly function. Since we are using backpropagation anyway, that needn’t worry us here. All we need to work out are the local derivatives or backward functions for functions like the logarithm and the sum, and those are usually provided by systems like Pytorch and Keras anyway.
The mixture density network may seem like overkill, but it’s actually very useful in regression problems where multiple answers may be valid.
Consider the problem of inverse kinematics in robotics. We have a robot arm with two joints, and we know where in space we want the hand of the arm to be. What angles should we set the two joints to? This is a great application for machine learning: it’s a relatively simple, smooth function. It’s easy to generate examples, and explicit solutions are a pain to write, and not robust to noise in the control of the robot. So we can solve it with a neural net.
The inputs are the two coordinates where we want the hand to be (x1, x2), and the outputs are the two angles we should set the joints to (θ1, θ2). The problem we run into, is that for every input, there are two solutions. One with the elbow up, and one with the elbow down. A normal neural network trained with an MSE loss would not pick one or the other, but it would average between the two.
A mixture density network with two components can fix this problem. For each input, it can simply put its components on the two valid solutions.
image source: Mixture Density Networks, Christopher Bishop, 1994
The problem with the robot arm is that the task is uniquely determined in one direction—every configuration of the robot arms leads to a unique point in space for the hand—but not when we try to reason backward from the hand to the configuration of the joints.
Here is a 2D version with that problem. Given x, we can predict t unambiguously. But, if we flip the problem and try to predict x given t, then at values like x=0.5, there are multiple predictions with high density. A network with a single Gaussian head (i.e. what we are implicitly doing when we are using least squares loss), will try to fit its Gaussian over both clusters of the data. This puts the mean, which is our ultimate prediction, between these two clusters, in a region where there is no data at all.
By contrast, the mixture density network can output a distribution with two peaks. This allows it to cover the two groups of points in the output, and so solve the problem in a much more useful way.
The general problem in the middle picture is called mode collapse: we have a problem with multiple acceptable answers, and instead of picking one of the answers at random, the neural network averages over all of them and produces a terrible answer.
If our data is spread out in space in a complex, clustered pattern, and we fit a simple unimodal distribution to it (that is, a distribution with one peak) then the result is a distribution that puts all the probability mass on the average of our points, but very little probability mass where the points actually are.
Mixture density networks go some way towards letting us capture more complex distributions in our neural networks, but when we want to capture something as complex and rich as the distribution on images representing human faces, they’re still insufficient.
A mixture model with k components gives us k modes. So in the space of images, we can pick k images to give high probability and the rest is just a simple Gaussian shape around those k points. The distribution on human faces has infinitely many modes (all possible human faces) that should all be about equally likely. To achieve a distribution this complex, we need to use the power of the neural net, not just to choose a finite set of modes, but to control the whole shape of the probability function.
Letting the neural network pick the parameters of a distribution with a simple shape is only ever going to produce a distribution with a simple shape. We need to change our approach.
Here’s a more powerful idea: we put the probability distribution at the start of the network instead of at the end of it. We sample a point from some straightforward distribution, usually a standard normal distribution, and we feed that point to a neural net. The result of these two steps is a random point, so we’ve defined another probability distribution. We call this construction a generator network.
If we ignore the value of the input, we are now sampling from an unconditional distribution on x.
To see what kind of distributions we might get when we do this, let’s try a little experiment.
We wire up a random network as shown: a two-node input layer, followed by 12, 100-node fully-connected hidden layers with ReLU activations., and a final transformation back to two points. We don’t train the network. We just use Glorot initialisation to pick the parameters, and then sample some points. Since the output is 2D, we can easily scatter-plot it.
Here’s a plot of 100k points sampled in this way. Clearly, we’ve defined a highly complex distribution. Instead of having a finite set of single points as modes, we get strands of high probability in space, and sheets of lower, but nonzero probability. Remember, this network hasn’t been trained, so it’s not representing anything meaningful, but it shows that the distributions we can represent in this way is a highly complex family.
We can also use this trick to generate images. A normal convolutional net starts with a low-channel, high resolution image, and slowly decreases the resolution by maxpooling, while increasing the number of channels. Here, we reverse the process. We shape our input into a low resolution image with a large number of channels. We slowly increase the resolution by upsampling layers, and decrease the number of channels.
We can use regular convolution layers, or deconvolutions, which are a kind of upside-down convolution. Both approaches give us effective generator networks for images.
We see, that even without training, we have produced a distribution on images that is very complex and non-uniform.
Of course, we can also use both options: we sample the input from a standard MVN, interpret the output as another MVN, and then sample from that.
In these kinds of generator networks, the input is often called z, and the space of inputs is often called the latent space. As we will see later, this maps perfectly onto the hidden variable models of the previous lecture.
So the big question is, how do we train a generator network? Given some data, how do we set the weights of the network so that the sampled outputs start to look like the examples we have in our data?
We’ll start with something that doesn’t work, to help us understand why the problem is difficult. Here is a naive approach: we simply sample a random point x (e.g. a picture) from the data, and sample a point y from the model and train on how close they are.
If we implement this naive approach, we do not get a good probability distribution. Instead, we get mode collapse.
Here is a schematic example of what's happening: the blue points represent the modes (likely points) of the data. The green point is generated by the model. It’s close to one of the blue points, so the model should be rewarded, but it’s also far away from almost all of the other points. During training, there’s no guarantee that we will pair it up with the correct point, and we are likely to compute the loss to a completely different point.
On average the model is punished for generating such points much more often than it is rewarded. The model that generates only the open point in the middle gets a smaller loss (and less variance). Under backpropagation, neural networks tend to converge to a distribution that generates only the open point over and over again.
In other words, the many different modes (areas of high probability) of the data distribution end up being averaged (“collapsing”) into a single point.
Even though we have a probability distribution that is able to represent highly complex, multi-modal outputs, if we train it like this, we still end up producing a unimodal output centered on the mean of our data. If the dataset contains human faces, we get a fuzzy average of all faces, not a sample with individual details.
How do we get the the network to imagine details, instead of averaging over all possibilities?
There are two main approaches: GANs and variational autoencoders. We'll give a quick overview of the basic principle of GANs in the next part, and then a more detailed treatment of autoencoders and variational autoencoders in the last two parts.
In the last video, we defined generator networks, and we saw that they represent a very rich family of probability distributions. We also saw, that training them can be a tricky business.
In this video we’ll look one way of training such networks: the method of generative adversarial networks (GANs).
GANs originated just after Convolutional networks were breaking new ground, showing spectacular, sometimes super-human, performance in image labeling. The suggestion arose that perhaps convolutional networks were doing more or less the same as what humans do when they look at something.
To verify this, researchers decided to start investigating what kind of inputs would make a trained convolutional network give a certain output. This is easy to do, you just compute the gradient with respect to the input of the network, and train the input to maximise the activation of a particular label, while keeping the parameters of the network fixed.
You would expect that if you start with a random image, and follow the gradient to maximize the activation of the output node corresponding to the label “bus”, you’d get a picture of a bus. Or at least something that looks a little bit like a bus. What you actually get is something that is indistinguishable from the noise you started with. Only the very tiniest of changes is require to make the network see a bus.
These are called adversarial examples. Instances that are specifically crafted to trip up a given model.
The researchers also found that if they started the search not at a random image, but at an image of another class, all that was needed to turn it into another class (according to the network) was a very small distortion. So small, that to us the image looks unchanged. In short, a tiny change to the image is enough to make a convolutional neural net think that a picture of a bus is a picture of an ostrich.
Adversarial examples are an active area of research (both how to generate them and make models more robust against them).
Even manipulating objects in the physical world can have this effect. A stop sign can be made to look like a different traffic sign by the simple addition of some stickers. Clearly, this has some worrying implications for the development of self-driving cars.
Pretty soon, this bad news was turned into good news by realising that if you can generate adversarial examples automatically, you can also add them to the dataset as negatives and retrain your network to make it more robust. You can simply tell your network that these things are not stop signs. Then, once your network is robust to the original adversarial examples, you can generate some new adversarial examples, and start the whole thing over again.
We can think of this as a kind of iterated 2 player game (or an arms race). The discriminator (our classifier) tries to get good enough to tell fake data from real data and the generator (the algorithm that generates the adversarial examples) tries to get good enough to fool the discriminator.
This is the basic idea of the generative adversarial network.
We’ll look at four different examples of GANs. We’ll call the basic approach the “vanilla GAN”
Generating adversarial examples by gradient descent is possible, but it’s much nicer if both our generator and our discriminator are separate neural networks. This will lead to a much cleaner approach for training GANs.
We will draw the two components like this. The generator takes an input sampled from a standard MVN and produces an image. This is a generator network as described in the previous video. We don't give it an output distribution (i.e. we're using option 2 from the previous part).
The discriminator takes an image and classifies it as Positive (a real image of the target class) or Negative (a fake image sampled from the generator).
If we have other images that are not of the target class, we can add those to the negative examples as well, but often, the positive class is just our dataset (like a collection of human faces), and the negative class is just the fake images created by the generator.
To train the discriminator, we feed it examples from the positive class, and train it to classify these as Pos.
We also sample images from the generator (whose weights we keep fixed) and train the discriminator to recognize these as negative. At first, these will just be random noise, but there’s little harm in telling our network that such images are not busses (or whatever our positive class is).
Note that since the generator is a neural network, we don’t need to collect a dataset of fake images which we then feed to the discriminator. We can just stick the discriminator on top of the generator, making a single computation graph, and train it by gradient descent to classify the result as negative. We just need to make sure to freeze the weights of the generator, so that gradient descent only updates the discriminator.
Then, to train the generator, we freeze the discriminator and train the weights of the generator to produce images that cause the discriminator to label them as Positive.
This step may take a little time to wrap your head around. If it helps, think of the whole disciminator as a very complicated loss function. Whatever the generator produces, the more likely the discriminator is to call it positive, the lower the loss.
We don’t need to wait for either step to converge. We can just train a the discriminator for one batch (i.e. one step of gradient descent) and then train the generator for one batch, and so on.
And this is what we’ll call the vanilla GAN.
Sometimes we want to train the network to map an input to an output, but to generate the output probabilistically. For instance, when we train a network to color in a black-and-white photograph of a flower, it could choose many colors for the flower. We want to avoid mode collapse here: instead of averaging over all possible colors, giving us a brown or gray flower, we want it to pick one color, from all the possibilities.
A conditional GAN lets us train generator networks that can do this.
In a conditional GAN, the generator is a function with an image input, which it maps it to an image output. However, it uses randomness to imagine specific details in the output.
In this example, it imagines the photograph corresponding to a line drawing of a shoe. Running this generator twice would result in different shoes that are both “correct” instantiations of the input line drawing.
source: Image-to-Image Translation with Conditional Adversarial Networks (2016), Phillip Isola Jun-Yan Zhu et al.
To train a conditional GAN, we give the discriminator pairs of inputs and outputs. If these come from the generator, they should be classified as fake (negative) and if they come from the data, they should be classified as real (positive).
The generator is trained in two ways.
We freeze the weights of the discriminator, as before, and train the generator to produce thins that the discriminator will think are real.
We feed it and input from the data, and backpropagate on the corresponding output (using L1 loss).
The conditional GAN works really well, but only if we have an example of a specific output that corresponds to a specific input. For some tasks, we don’t have paired images. We only have unmatched bags of images in two domains. For instance, we know that a picture of a horse can be turned into a picture of that horse as a zebra (a skilled painter could easily do this), but we don't have a lot of paired images of horses and corresponding zebras. All we have is a large number of horse images and a large number of zebra images.
If we randomly match one zebra image to a horse image, and train a conditional GAN on this, all we get is mode collapse.
CycleGANs solve this problem using two tricks.
First, we train generators to perform the transformation in both directions. We train both a horse-to-zebra generator and a zebra-to-horse generator. Then each horse in our dataset is transformed into a zebra and back again.This gives us a fake zebra picture, which we can use to train a zebra discriminator, together with the real zebra pictures. We do the same thing the other way around: we transform the zebras to fake horses and back again, and use the fake horses together with the real horses to train a horse discriminator.
Second, we add a cycle consistency loss. When we transform a horse to a zebra and back again, we should end up with the same horse again. The more different the final horse picture is from the original, the more we punish the generator networks.
Here is the whole process in a diagram.
One way to think of this is as the generators practicing steganography: hiding a secret message inside another innocent message. The generators are trying to hide a picture of a horse inside a picture of a zebra. The cycle consistency loss ensures that all the information of the horse picture is in the zebra picture and the horse picture can be fully decoded from the zebra picture. The discriminator's job is to tell which of the zebra pictures it sees have a horse hiding in it.
If we have a strong discriminator and the generator can still fool it, then we get very realistic zebra pictures with horses hidden inside. Since the obvious way to make this transformation is to transform the horse into the zebra in the way we would do it, this is the transformation that the network learns.
The CycleGAN works surprisingly well. Here’s how it maps photographs to impressionist paintings and vice versa.
It doesn’t always work perfectly, though.
Finally, let’s take a look at the StyleGAN, the network that generated the faces we first saw in the introduction. This is basically a Vanilla GAN, with most of the special tricks in the way the generator is constructed. It uses too many tricks to discuss here in detail, so we’ll just focus on one aspect: the idea that the latent vector is fed to the network at each stage of its forward pass.
Since an image generator starts with a coarse (low resolution), high level description of an image, and slowly fills in the details, feeding it the latent vector at every layer (transformed by an affine transformation to fit it to the shape of the data at that stage), allows it to use different parts of the latent vector to describe different aspects of the image (the authors call these “styles”).
The network also receives separate extra random noise per layer, that allows it to make random choices. Without this, all randomness would have to come from the latent vector.
To see how this works, we can try to manipulate the network, by changing the latent vector to another for some of the layers. In this example all images on the margins are people that are generated for a particular single latent vector.
We then re-generate the image for the destination, except that for a few layers (at the bottom, middle or top), we use the source latent vector instead.
As we see, overriding the bottom layers changes things like gender, age and hair length, but not ethnicity. For the middle layer, the age is largely taken from the destination image, but the ethnicity is now override by the source. Finally for the top layers, only surface details are changed.
This kind of manipulation was done during training as well, to ensure that it would lead to faces that fool the discriminator.
Let’s look at the other side of the network: the noise inputs.
If we keep all the latent and noise inputs the same except for the very last noise input, we can see what the noise achieves: the man’s hair is equally messy in each generated example, but exactly in what way it’s messy changes per sample. The network uses the noise to determine the precise orientation of the individual “hairs”.
We’ve given you a high level overview of GANs, which will hopefully give you an intuitive grasp of how they work. However, GANs are notoriously difficult to train, and many other tricks are required to get them to work. Here are some phrases you should Google if you decide to try implementing your own GAN.
In the next video, we’ll look at a completely different approach to training generator networks: autoencoders.
In this part, we'll start to lay the groundwork for Variational Autoencoders. This starts with a completely different abstract task: dimensionality reduction. We'll see that given a dimensionality reduction model, we can often turn it into a generative model with a few hacks. In the next part, we will then develop this type of model in a more grounded and theoretical way.
Before we turn to autoencoders, let's first look at what we can do once we've trained a generator network. We’ll look at four use cases.
The first, of course is that we can generate data that looks like it came from the same distribution as ours.
Another thing we can do is interpolation.
If we take two points in the input space, and draw a line between them, we can pick evenly spaced points on that line and decode them. If the generator is good, this should give us a smooth transition from one point to the other, and each point should result in a convincing example of our output domain.
Remember that in some contexts, we refer to the input of a generator network as its latent space.
We can also draw an interpolation grid; we just map the corners of a square lattice of equally spaced points to four points in our input space, and run all points through the generator network.
If the latent space is high dimensional, most of the probability of the standard MVN is near the edges of the radius-1 hypersphere (not in the centre as it is in 1, 2 and 3-dimensional MVNs).
For that reason, we get better results if we interpolate along an arcinstead of along a straight line. This is called spherical linear interpolation.
What if we want to interpolate between points in our dataset? It’s possible to do this with a GAN trained generator, but to make this work, we first have to find our data points in the input spoace. Remember, during training the discriminator is the only network that gets to see the actual data. We never explicitly map the data to the latent space.
We can tack a mapping from data to latent space onto the network after training (as was done for these images), but we can also learn such a mapping directly. As it happens, this can help us to train the generator in a much more direct way.
Note that such a mapping would also give us a dimensionality reduction. We can see the latent space representation of the data as a reduced dimensionality representation of the input.
We’ll focus on the perspective of dimensionality reduction for the rest of this video, to set up basic autoencoders. We can get a generator network out of these, but it’s a bit of an afterthought. In the next video, we’ll see how to train generator networks with a data-to-latent-space mapping in a more principled way.
Here’s what a simple autoencoder looks like. It’s is a particular type of neural network, shaped like an hourglass. Its job is just to make the output as close to the input as possible, but somewhere in the network there is a small layer that functions as a bottleneck.
After the network is trained, this small layer becomes a compressed low-dimensional representation of the input.
Here’s the picture in detail. We call the bottom half of the network the encoder and the top half the decoder. We feed the autoencoder an instance from our dataset, and all it has to do is reproduce that instance in its output. We can use any loss that compares the output to the original input, and produces a lower loss, the more similar they are. Then, we just brackpropagate the loss and train by gradient descent.
We call the blue layer the latent representation of the input. If we train an autoencoder with just two nodes on the latent representation, we can plot what latent representation each input is assigned. If the autoencoder works well, we expect to see similar images clustered together (for instance smiling people vs frowning people, men vs women, etc).
To show what this looks like, we've set up a relatively simple autoencoder consisting of convolutions in the encoder and deconvolutions in the decoder. We train it on a low-res version of the FFHQ dataset of human faces. We give the latent space 256 dimensions.
Here are the reconstructions on a very simple network, with MSE loss on the output after 5 full passes over the data.
After 300 epochs, the autoencoder has pretty much converged. Here are the reconstructions next to the original data. Considering that we've reduced each image to just 256 numbers, it's not too bad.
One thing we can now do is to study the latent space based on the examples that we have. For instance, we can see whether smiling and non-smiling people end up in distinct parts of the latent space.
We just label a small amount of instances as smiling and nonsmiling (just 20 each in this case). If we're lucky, these form distinct clusters in our latent space. If we compute the means of these clusters, we can draw a vector between them. We can think of this as a “smiling” vector. The further we push people along this line, the more the decoded point will smile.
This is one big benefit of autoencoders: we can train them on unlabeled data (which is cheap) and then use only a very small number of labeled examples to “annotate” the latent space. In other words, autoencoders are a great way to do semi-supervised learning.
Once we've worked out what the smiling vector is, we can manipulate photographs to make people smile. We just encode their picture into the latent space, add the smiling vector (times some small scalar to control the effect), and decode the manipulated latent representation. If the autoencoder understands "smiling" well enough, the result will be the same picture but manipulated so that the person will smile.
Here is what that looks like for our (simple) example model. In the middle we have the decoding of the original data, and to the right we see what happens if we add an increasingly large multiple of the smiling vector.
To the right we subtract the smiling vector, which makes the person frown.
With a bit more powerful model, and some face detection, we can see what some famously moody celebrities might look like if they smiled.
What we get out of an autoencoder, depends on which part of the model we focus on.
If we keep the encoder and the decoder, we get a network that can help us manipulate data in this way.
If we keep just the encoder, we get a powerful dimensionality reduction method. We can use the latent space representation as the features for a model that does not scale well to too many features (like a non-naive Bayesian classifier).
But this lecture was about generator networks. How do we get a generator out of a trained autoencoder? It turns out we can do this by keeping just the decoder.
We don’t know beforehand where the data will end up in latent space, but after training we can just check. We encode the training data, fit a distribution to this point cloud in our latent space, and then just use this distribution as the input to our decoder to create a generator network.
This is the point cloud of the latent representations in our example. We plot the first two of the 256 dimensions, resulting in the blue point cloud.
To these points we, we fit an MVN (in 256 dimensions), and we sample 400 new points from it, the red dots.
If we feed these points to the decoder, this is what we get. It's not quite up these with the style gan results, but clearly, the model can generate some non-existant people.
This has given us a generator, but we have little control over what the latent space looks like. We just have to hope that it looks enough like a normal distribution that our MVN makes a good fit. In the GAN, we have perfect control over what our distribution on the latent space looks like; we can freely set it to anything. However, there, we have to fit a mapping from data to latent space after the fact.
We’ve also seen that this interpolation works well, but it’s not something we’ve specifically trained the network to do. In the GAN, we should expect all latent space points to decode to something that fools the decoder, but in the autoencoder, there is nothing that stops the points in between the data points from decoding to garbage.
Moreover, neither the GAN nor the autoencoder is a very principled way of optimizing. Is there a way to train for maximum likelihood directly?
The answer to all of these questions is the variational autoencoder, which we’ll discuss in the next video.
The variational autoencoder is a more principled way to train a generator network using the principles of an autoencoder. This requires a little more math, but we get a few benefits in return.
We’ll start with the maximum log-likelihood objective. We want to choose our parameters θ (the weights of the neural network) to maximise the log likelihood of the data. We will write this objective step by step until we end up with an autoencoder.
The first insight is that we can view our generator as a hidden variable model. We have a hidden variable z, which is standard normally distributed and then fed to a neural network to produce a variable x, which we observe.
The network computes the conditional distribution of x given z.
Here we see how the hidden variable problem causes our mode collapse. If we knew which z was supposed to produce which x, we could feed that z to the network compute the loss between the output and x and optimize by backpropagation and gradient descent.
The problem, as in the previous lecture, is that we don’t have the complete data. We don’t know the values of z, only the values of x. There, we introduced a distribution q to approximate the "completion" of the data. We'll do that here as well: we'll introduce a probability distribution that tells us, for a given instance x, which latent representations are most likely.
Before we define q, let's think what it needs to approximate. We want a distribution on the latent space, conditional on some instance x (for instance an image).
In the EM algorithm, we defined q as a big table of numbers (the responsibilities). Because we were approximating a categorical distribution, we could just set the parameters of q(z|x) explicitly for every x. Here, we want to generalize more: we will train a neural network to learn a function that maps a given instance x to a distribution on the latent space. If the network works well the correct latent representation (the one that decodes to x if we feed it to the generator) will get a high probability density under the distribution produced by this network.
This is the network we’ll use. It consumes x and it produces a normal distribution on z. It has its own parameters, phi, which are not related to the parameters theta of p.
In the EM algorithm, for a given choice of parameters θ for p(x|z, θ), we could easily work out the reverse distribution on z, conditioned on x. This gave us a target that q should approximate.
Here, things aren’t so easy. To work out p(z|x, θ) we would need to invert the neural network: work out for a particular output x, which input values z are likely to have caused that output.
This is not impossible to do (we saw something similar in at the start of the lecture), but its a costly and imprecise business. Just like we did with the GANs, it’s best to introduce a network that will learn the inversion for us. For this reason, we won't follow the EM logic of alternating approximation. Instead, we'll figure out a way to train p and q together. We’ll try to update the parameters of p to fit the data, and try we’ll update the parameters of q to keep it a good approximation to the inversion of p.
Since the parameters of our model are the neural network weights, we’ll simplify our notation like this.
This emphasizes that even though these are probabilities, the conditional probabilities shown here are the only ones that we can efficiently compute. We can’t reverse the conditional, or marginalize anything out. The price we pay for using the power of neural networks is that we are stuck with these functions and have build an algorithm on just these.
With one exception. We do know the marginal distribution on z, since that is what we defined as the distribution on the input of our generator neural net: it’s a simple standard normal MVN.
Putting everything together, this is our model. If we feed qv an instance from our data x, we get a normal distribution on the latent space. If we sample a point z from this distribution, and feed it to pw we get a distribution on x. If the networks are both well trained, this should give us a good reconstruction of x.
The neural network pw is our probability distribution conditional on the latent vector. qv is our approximation of the conditional distribution on z.
We can now apply the decomposition we proved in the last lecture. We look at the marginal distribution on x, and break it up into two terms using q.
Before, we had a discrete hidden variable, and now we have a continuous one, but the proof still works (since we wrote everything in terms of expectations).
Here is the decomposition, rewritten in our new notation.
In the EM algorithm, we used this to set up an alternating optimization algorithm: first optimizing one term and then the other. To do that here, we’d need to minimize the KL divergence between q and the inverse of p. This is possible, but expensive. As we said before, the inverse of p is hard to compute.
Instead, we’ll follow a cheaper track: we'll maximize a lower bound to the value we actually want to maximize.
Because we cannot invert pw, we cannot easily compute the KL term (let alone optimise qv to minimise it).
Instead, we focus entirely on the L term. Since it’s a lowerbound on the quantity we’re trying to maximize, anything that increases L will help our model. The better we maximise L, the better our model will do.
Here’s a visualisation of how a lower bound objective works. We’re interested in finding the highest point of the blue line (the maximum likelihood solution), but that’s difficult to compute. Instead, we maximise the orange line (the evidence lower bound). Because it’s guaranteed to be below the blue line everywhere, we may expect to be finding a high value for the blue line as well. To some extent, pushing up the orange line, pushes up the blue line as well.
How well we do on the blue line depends a lot on how tight the lower bound is. The distance between the lower bound and the log likelihood is expressed by the KL divergence between pw(z|x) and qv(z|x). That is, because we cannot easily compute pw(z|x), we introduced an approximation qv(z|x) . The better this approximation, the lower the KL divergence, and the tighter the lower bound.
Right now, we can’t use L as a loss function directly, since it contains functions like p(x,z) that are not easily computed and because it’s an expectation. We’ll rewrite it step by step until it’s a loss function that can de directly used, heaply and easily in a deep learning system.
Since we want to implement a loss function, we will minimize the negative of the L function. We now need to work out what that means for our neural net.
All three probability functions we are left with are ones we can easily compute: q(z|x) is given by the encoder network, p(x|z) is given by the decoder network, and p(z) was chosen when we defined (back in part 1) how the generator works, if we marginalize out x, the distribution on z is a standard multivariate normal.
Let's see if this is a loss function we can implement in a system like pytorch.
The KL term is just the KL divergence between the MVN that the encoder produces for x and the standard normal MVN. This works out as a relatively simple differentiable function of mu and sigma, so we can use it directly in a loss function.
The second part of our loss function requires a little more work. It’s an expectation for which we don’t have a closed form expression. Instead, we can approximate it by taking some samples, and averaging them. To keep things simple, we just take a single sample (we’ll be computing the network lots of times during training, so overall, we’ll be taking lots of samples).
We almost have a fully differentiable model. Unfortunately, we still have a sampling step in the middle (and sampling is not a differentiable operation).
We can get get rid of it by remembering the algorithm for sampling from a given MVN. We sample from the standard MVN, and transform the sample using the parameters of the required MVN.
Since we have, by construction, a diagonal covariance matrix, the sampling algorithm is particularly simple. We just sample a vector from a standard normal distribution, element-wise multiply it by the vector of standard deviations of the target distribution, and add the mean of the target distribution.
Looking at this algorithm we can see two things. First, the sampling part doesn't depend on the parameters produced by q. We can do all the random bits before we know what parameters q has produced for us. Second, the rest of the algorithm is a simple, differentiable and affine operation.
This means that we can basically work the sampling algorithm into the architecture of the network. We provide the network with an extra input: a sample from the standard MVN.
Why does this help us? We’re still sampling, but we’ve moved the random sampling out of the way of the backpropagation. The gradient can now propagate down to the weights of the q function, and the actual randomness is treated as an input, rather than a computation.
And with that, we have a fully differentiable loss function that we can put into a system like Keras or pytorch to train our autoencoder.
The two terms of the loss function are usually called KL loss and reconstruction loss.
The reconstruction loss maximises the probability of the current instances. This is basically the same loss we used for the regular autoencoder: we want the output of the decoder to look like the input.
The KL loss ensures that the latent distributions are clustered around the origin, with variance 1. Over the whole dataset, it ensures that the latent distribution looks like a standard normal distribution.
The formulation of the VAE has three forces acting on the latent space. The reconstruction loss pulls the latent distribution as much as possible towards a single point that gives the best reconstruction. Meanwhile, the KL loss, pulls the latent distribution (for all points) towards the standard normal distribution, acting as a regularizer. Finally, the sampling step ensures that not just a single point returns a good reconstruction, but a whole neighbourhood of points does. The effect can be summarized as follows:
The reconstruction loss ensures that there are points in the latent space that decode to the data.
The KL loss ensures that all these points together are laid out like a standard normal distribution.
The sampling step ensure that points in between these also decode to points that resemble the data.
To define an autoencoder, we need to choose the output distribution of our decoder, which will determine the precise form of the reconstruction loss. In these slides, we've used a normal distribution, but for images, that's not usually the best choice.
We can get slightly better results with a Laplace distribution, but convergence will still be slow.
Better results are achieved with the binary cross entropy. This doesn't correspond to a proper distribution on continuous valued image tensors, but it's often used anyway because of the fact convergence. To fix this problem, you can use something called a continuous Bernoulli distribution, which will give you fast convergence and a theoretically correct VAE.
Here are some reconstructions for the regular autoencoder and for the VAE. They perform pretty similarly. There are slight differences if you look closely, but it's hard to tell which is better.
However, if we generate data by providing the generator with a random input, the difference becomes more pronounces. Here we see that the VAE is more likely to generate complete, and coherent faces (although both models still struggle with the background).
For completeness, here is the smiling vector, applied to the VAE model.
Here are some examples from a more elaborate VAE.
source: Deep Feature Consistent Variational Autoencoder by Xianxu Hou, Linlin Shen, Ke Sun, Guoping Qiu
![An animated example of latent space interpolation in the DCVAE](https://houxianxu.github.io/assets/dfcvae/combined.gif)
Here is what the algorithm looks like in Pytorch. Load the 5th worksheet to give it a try.
In this worksheet, the VAE is trained on MNIST data, with a 2D latent space. Here is the original data, plotted by their latent coordinates. The colors represent the classes, to which the VAE did not have access.
If you run the worksheet, you’ll end up with this picture (or one similar to it).
While the added value of the VAE is a bit difficult to detect in our example, in other domains it’s more clear.
Here is an example of interpolation on sentences. First using a regular autoencoder, and then using a VAE. Note that the intermediate sentences for the AE are non-grammatical, but the intermediate sentences for the VAE are all grammatical.
source: Generating Sentences from a Continuous Space by Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, Samy Bengio
We see that GANs are in many ways the inverse of autoencoders, in that GANS have the data space as the inside of the network, and VAEs have it as the outside.