link here
link here

link here

This lecture is a little on the heavy side. We generally pack a lot into the lectures, but this is probably the most dense one in the series. This is mostly to make the story complete and self-contained. We've tried to indicate where you can skip parts to make it easier to get through the whole thing. The aim is to give you a general idea the first time around, but to also ensure that if you do ever find yourself needing to use kernel methods or Lagrange multipliers, you can come back to this lecture and work through the details.

Don't worry if you don't understand all the details in the time you have. If you're struggling with this one, we recommend skimming everything quickly, and then trying the relevant homework exercise. This should tell you which parts to come back to.

The following subjects may appear on the exam as application questions, so these are important to fully understand:

Backpropagation

Lagrange multipliers

The kernel trick

Things like the unconstrained SVM loss and the derivation of the SVM dual objective are good to know but they won't be on the exam as application questions, so you can skip them if you're struggling to find the time to wrap your head around the whole lecture.

link here

A few lectures ago, we saw how we could make a linear model more powerful, and able to learn nonlinear decision boundaries by just expanding our features: we add new features derived from the old ones, and depending on which combinations we add, we can learn new, non-linear decision boundaries or regression functions.

link here

Both models we will see today, neural networks and support vector machines, take this idea and build on it. Neural networks are a big family, but the simplest type, the two-layer feedforward network, functions as a feature extractor followed by a linear model. In this case, we don’t choose the extended features but we learn them, together with the weights of the linear model.

The support vector machine doesn’t learn the expanded features (we still have to choose them manually), but it uses a kernel function to allow us to fit a linear model in a very high-dimensional feature without having to pay for actually computing all these expanded features.

link here

The layout of today’s lecture will be largely chronological. We will focus on neural networks, which were very popular in the late eighties and early nineties.

Then, towards the end of the nineties, interest in neural networks died down a little and support vector machines became much more popular.

In the next lecture, we’ll focus on Deep Learning, which sees neural networks make a comeback in a big way.

link here

In this video, we’ll start with the basics of neural networks.


In the very early days of AI (the late 1950s), researchers decided to try a simple approach: the brain is the only truly intelligent system we know, so let’s see what it’s made of, and whether that provides some inspiration for intelligent (and learning) computer systems.


They started with a single brain cell: a neuron. A neuron receives multiple different input signals from other cells through connections called dendrites. It processes these in a relatively simple way, deriving a single new signal, which it sends out through its single axon. The axon branches out so that this single output signal can reach different cells.


image source: http://www.sciencealert.com/scientists-build-an-artificial-neuron-that-fully-mimics-a-human-brain-cell

link here

These ideas needed to be radically simplified to work with computers of that age, but the basic idea was still there: multiple inputs, one output. Doing this yielded one of the first successful machine learning systems: the perceptron. This was the model we saw in action in the video in the the first lecture.

The perceptron has a number of inputs, the features in modern parlance, each of which is multiplied by a weight. The result is summed, together with a bias parameter, and the sign of this result is used as the classification.

Of course, we’ve seen this classifier already: it’s just our basic linear classifier. The training algorithm was a little different from gradient descent, but the basic principle was the same.

Note that when we draw the perceptron this way, as a mini network, the bias can be represented as just another input that we fix to always be 1. This is called a bias node.

link here

Of course the brain’s power does not come from the fact that a single neuron is such a powerful mechanism by itself: it’s the composition of many simple parts that allows it to do what it does. We make the output of one neuron the input of another, and build networks of billions of neurons.

And this is where the perceptron turns out to be too simple an abstraction. Because composing perceptrons doesn’t make them more powerful. Consider the graph on the left, with multiple perceptrons composed together.

Writing down the function that this graph represents, we see that we get a simple function, with the first two perceptrons in brackets. If we then multiply out the brackets, we see that the result is a linear function. This means that we can represent this function also as a single perceptron with four inputs. This is always true. No matter how many perceptrons you chain together, the result will never be anything more than a simple linear function over your inputs: a single perceptron.


click image for animation
link here

To create perceptrons that we can chain together in such a way that the result will be more expressive than any single perceptron could be, the simplest trick is to include a non-linearity, also called an activation function.

After all the weighted inputs have been combined, we pass the resulting scalar through a simple non-linear scalar function to produce the output. One popular option, especially in the early days of neural networks, was the logistic sigmoid, which we’ve seen already. Applying a sigmoid means that the sum of the inputs can range from negative infinity to positive infinity, but the output is always in the interval [0, 1].

Another, more recent non-linearity is the linear rectifier, or ReLU nonlinearity. This function just sets every negative input to zero, and keeps everything else the same.

Not using an activation function is also called using a linear activation.

click image for animation
link here

Using these nonlinearities, we can arrange single perceptrons into neural networks. Any arrangement of perceptrons makes a neural network, but for ease of training, this arrangement seen here was the most popular for a long time. It’s called the feedforward network or multilayer perceptron (MLP). We arrange a layer of hidden units in the middle, each of which acts as a perceptron with a nonlinearity, connecting to all input nodes. Then we have one or more output nodes, connecting to all hidden units. Note the following points.

There are no cycles, the network feeds forward from input to output.

Nodes in the same layer are not connected to each other, or to any other layer than the next and the previous one.

Each layer is fully connected to the previous layer, every node in one layer connects to every node in the layer before it.

In the 80s and 90s these networks usually had just one hidden layer, because we hadn’t figured out how to train deeper networks.

Note that every line in this picture represents one distinct parameter of the model. The blue lines (those connected to bias nodes) represent biases, and the rest represent weights.

We can use networks like these to do classification or regression.

link here

To build a regression model, all we need is one output node without an activation. This means that our network as a whole, describes a function from the feature space to the real number line.

We can think of the first layer of our network as computing a feature expansion: the same thing we did in the fourth lecture to enable our linear regression to learn non-linear patterns, but this time, we don’t have to come up with the feature expansion ourselves, we simply learn it. The second layer is then just a linear regression in this expanded feature space.

The number of hidden nodes is a hyperparameter. More nodes makes the network more powerful (that is, able to represent more different functions), but also more likely to overfit, more expensive to compute and potentially more difficult to train. The only real advice we can give is that whenever possible, your hidden layer should be wider than the input layer.

After we've computed the output, we can apply any regression loss function we like, such as least-squares loss.

click image for animation
link here

To build a binary classifier, we could do what the perceptron did: use the sign of the output as the class. This would be a bit like using our least squares classifier from the second lecture, except with a feature expansion layer below it.

These days, however, it’s much more common to take inspiration from the logistic regression. We apply the logistic sigmoid to the output and interpret the resulting value as the probability that the given input (x) is of the positive class.

The logarithmic loss that we use for logistic regression, can then be applied here as well.

click image for animation
link here

For multiclass classification, we can use something called a softmax activation. We create a single output node for each class, and then ensure that they are all positive and that together they sum to one. This allows us to interpret them as class probabilities.

The softmax function is one way to ensure this property. It simply passes each output node through the exponential function, to ensure that they are all positive, and then divides each by the sum total, to ensure that all outputs together sum to one.

After the softmax we can interpret the value of node y3 as the probability that x has class 3. Given these probabilities, we can apply a simple log loss: the aim is to maximize the logarithm of the probability of the true class.

click image for animation
link here

Because neural networks can be expensive to compute we tend to use stochastic gradient descent to train them.

Stochastic gradient descent is very similar to the gradient descent we’ve seen already, but we define the loss function over a single example instead of summing over the whole dataset: just use the same loss function, but pretend your data set consists of only one instance. We then loop over all instances, and perform a small gradient descent step for each one based on only the loss for that instance.

Stochastic gradient descent has many advantages, including:

Using a new instance each time adds some noise to the process, since the gradient will be slightly different for each instance, which can help to escape local minima.

Gradient descent works fine if the gradient is not perfect, but still good on average (over many instances). This means that taking many small inaccurate steps is often much better than taking one very accurate big step.

Computing the loss over the whole dataset is expensive. By computing the loss over one instance at a time, we get N steps of stochastic gradient descent for the price of one step of regular gradient descent.

link here

Apart from this exception, the training of a neural network proceeds in much the same way as the training of linear classifiers we've seen already.

link here

Before we dig into the details, we can get a sense of what neural network training looks like in the tensorflow playground. We suggest you play around a bit with the different datasets, different activations, and try to change the shape of the network.

Note how the shape of the decision boundary changes based on the activation functions we choose (curvy for sigmoid, piecewise linear for ReLU)

Note that adding another layer makes the network much more difficult to train (especially with sigmoid activations).

Try the linear activation (i.e. no activations on the hidden nodes). Note that all you get is a linear decision boundary, not matter how many layers you try.

Try a network on the circular dataset, with hidden layers with 2 units. It should not be possible so solve the circular dataset this way. It can be shown that to create a closed shape like a circle as a decision boundary, at least one hidden layer needs to be strictly bigger than your input layer.

link here

That’s the basic idea of neural networks. So far, it’s hopefully a pretty simple idea. The complexity of neural networks lies in computing the gradients. For such complex models, sitting down at the kitchen table with pen and paper, and working out a symbolic expression for the gradient is no longer feasible. If we manage it at all, we get horrible convoluted expressions that no longer reduce to nice, simple functions, as they did in the case of linear regression and logistic regression.

To help us out, we need the backpropagation algorithm, which we’ll discuss in the next video.

link here
link here


link here

In the last video, we saw what the structure of a very basic neural network was, and we ended on this question. How do we work out the gradient?

For neural networks, the gradients quickly get too complex to work out by hand, so we need to automate this process.

link here

There are three basic flavors of working out derivatives and gradients automatically.

The first is to do it symbolically, What we do on pen and paper, when we work out a derivative, is a pretty mechanical process. It’s not too difficult to program this process out and let the computer do it for us. This, is what happens, when you as Wolfram alpha to work out a derivative, for instance. It has its uses, cartainly, but it won;t work for us. The symbolic expression of the gradient of a function grows exponentiall with the complexity of the original function. That means that as we build bigger and bigger networks the expression of the gradient would soon grow too big to store in memory, let alone to compute.

An alternative approach is to forget the symbolic form of the function, and just estimate the gradient for a specific input x. We could, for instance, pick some points close to x and and fit a hyperplane through the outputs. This would be a pretty good approximation of the tangent hyperplane, so we could just read out the gradient. The problem is that this is a pretty unstable business. It’s quite difficult to be sure that the answer is accurate. It’s also expensive: the more dimensions in your model space, the more points you need to get an accurate estimate of your gradient, and each point requires you to recompute your model for a new input.

Backpropagation is a middle ground: it does part of the work symbolically, and part of the work numerically. We get a very accurate computation of the gradient, and the cost of computation is usually only twice as expensive as computing the output for one input.

click image for animation
link here

Here’s the three steps required to implement backpropagation for a given function.

link here

To show that backpropagation is a generic algorithm for working out gradients, not just a method for neural networks, we’ll first show how it works for some arbitrary scalar function: f(x) = 2/sin(e-x).

First we take our function f, and we break it up into a chain of smaller functions, the output of each feeding into the next. Defining the functions a, b, c, and d as shown, we can write f(x) = d(c(b(a(x)))).

The graph on the right is a called a computation graph: each node represents a small computer program that receives an input, computes an output and passes it on to another module.

click image for animation
link here

Because we’ve described our function as a composition of modules, we can work out the derivative purely by repeatedly applying the chain rule.

click image for animation
link here

We’ll call the derivative of the whole function with respect to input x the global derivative, and the derivative of each module with respect to its input we will call a local derivative.

click image for animation
link here

The next step is to work out the local derivatives symbolically, using the rules we know.

The difference from what we normally do is that we stop when we have the derivatives of the output of a module in terms of the input. For instance, the derivative c/ b is cos b. Normally, we would fill in the definition of b and see if we could simplify any further. Here we stop once we know the derivative in terms of b.

click image for animation
link here

Then, once all the local derivatives are known, in symbolic form, we switch to numeric computation. We will take a specific input, in this case -4.499 and compute the gradient only for that.

First we compute the output of the function f given this input. We do this simply by following the computation graph: the input is fed to the first module, and its output is fed to the second module, and so on. This is known as the forward pass. During our computation, we also retain our intermediate values a, b, c and d. These will be useful later on.

click image for animation
link here

Next up is the backward pass. We take the chain-rule derived form of the derivative, and we fill in the intermediate values a, b, c and d.

This gives us a function with no variables, so we can compute the output. The result is that the derivative of this function, for the specific input -4.499, is 0.

Note that we have stopped doing symbolic computations: we fill in the numeric values and work out the numeric result (accepting a small amount of inaccuracy due to floating point imprecisions).

click image for animation
link here

If you're still struggling to see the difference between these three options, consider whether the answer you get is specific to a particular input x, or whether you get an answer that applies to the variable x, regardless of its value.

When we worked out the gradient for least squares regression or for logistic regression, the result was not specific to a particular input. We didn't get a specific vector filled with numbers, we got a symbolic function that told us how to compute the gradient for any model and dataset. In short, if you can do it without knowing what the specific numbers of the input are, it's symbolic computation.

With numeric computation, you don't get this function. You specify the particular dataset and the current model, and you get an estimate of the gradient as a single vector filled with specific numbers. If you change the model or the dataset, you have to do the whole gradient estimation again. If you need to know the specific numbers of the input before you can start, it's numeric computation.

Backpropagation is, as we've said a middle ground. It works out the derivatives of the modules with respect to their inputs symbolically. You can do this without knowing the specific input we want the gradient for, so this part is symbolic. We then switch to numeric computation. This part can only start if we know the specific input we are computing the gradient for, so this part is numeric.

link here

More fine-grained modules make the local derivatives easier to work out, but may increase the numeric instability. Less fine grained modules usually result in more accurate gradients, but if we make them too big, we end up with the problem that the symbolic expression of the gradient grows too complex.

link here

Let’s now see how this works for a neural net.

It’s important to remember that we don’t care about the derivative of the output y with respect tot he inputs x. The function we’re computing the gradient for is the loss, and the variables we want to compute the gradient for are the parameters of the network. x does end up in our computation, because it’s part of the loss, but only as a constant.

We'll work out what the gradient descent update will look like for the weights in the first and second layer.

click image for animation
link here

Here’s what the local gradients look like for the weight v2.

The line on the bottom shows how we update v2 when we apply a single step of stochastic gradient descent for x (x may not appear in the gradient, but the values y and h2 were computed using x).

click image for animation
link here

To see what this update rule means, we can use an analogy. Think of the neural network as a hierarchical structure like a government trying to make a decision. The output node is the prime minister: he provides the final decision (for instance what the tax on cigarettes should be).

To make this decision, he listens to his ministers. His ministers don’t tell him what to do, they just shout. The louder they shout, the higher they want him to make the output.

If he trusts a particular minister, he will weigh their advice positively, and follow their advice. If he distrusts the minister, he will do the opposite of what the minister says. The ministers each listen to a bank of civil servants and weigh their opinions in the same way the prime minister weight theirs. All ministers listen to the same civil servants, but they have their own level of trust for each.


image sources: https://www.government.nl/government/members-of-cabinet/mark-rutte, https://www.government.nl/government/members-of-cabinet/ingrid-van-engelshoven, https://www.rijksoverheid.nl/regering/bewindspersonen/kajsa-ollongren

Door Photo: Yordan Simeonov (EU2018BG) - Dit bestand is afgeleid van: Informal JHA meeting (Justice) Arrivals (26031834658).jpg, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=70324317

link here

So let’s say the network has produced an output. The prime minister has set a tax on cigarettes y, and based on the consequences realises that he should actually have set a tax of t. He’s now going to adjust his level of trust in each of his subordinates.

Looking at the update rule for weight v2, we can see that he takes two things into account: the error (y-t), how wrong he was, and what minister h2 told him to do.

If the error is positive, he set y too high. If h2 shouted loudly, he will lower his trust in her.

If the error is negative, he set y too low. If h2 shouted loudly, he will increase his trust in her.

If we use a sigmoid activation, the ministers can only provide values between 0 and 1. If we use an activation that allows h2 to be negative, we see that the minister takes the sign into account: if h2 was negative and the error was negative too, the trust in the minister increases (because the PM should’ve listened to her).

link here

link here

So far, this is no different from gradient descent on a linear model. The real power of the backpropagation algorithm shows when we look at how the error propagates back down the network (hence the name) and is used to update the weights. Lets look at the derivative for weight w12

click image for animation
link here

To see how much minister h2 needs to adjust her trust in x1, she first looks at the global error. To see how much she contributed to that global error, and whether she contributed negatively or positively, she multiplies by v2, her level of influence over the decision. Then she looks at how much the input from all her subordinates influenced the decision, considering the activation function (that is, if the input was very high, she’ll need a bigger adjustment to make a meaningful difference). Finally she multiplies by x1, to isolate the effect that we trust in x1 had on her decision.

click image for animation
link here

link here

These weren’t just reasons not to use neural nets in production. They also slowed down the research on neural nets. SVM researchers were (probably) able to more faster, because once they’d designed a kernel, they could compute the optimal model performance and know, without ambiguity, whether it worked or not. Neural net researchers could design an architecture and spend months tuning the training algorithm without ever knowing whether the architecture would eventually perform.

link here

One important part of building such a framework is to recognise that all of this can easily be described as matrix multiplication/addition, together with the occasional element-wise non-linear operation. This allows us to write down the operation of a neural network very elegantly.

In order to make proper use of this, we should also work out how to do the backpropagation part in terms of matrix multiplications. That’s where we’ll pick up next week in the first deep learning lecture.

click image for animation
link here
link here

link here

In lecture 5, we introduced the logistic regression model, with the logarithmic loss. We saw that it performed very well, but it had one problem: when the data are very well separable, it didn’t have any basis to choose between two models like this: both separate the training data very well. Yet, they’re very different models.

There are some tricks we can add to the logistic regression to deal with this problem, but today we'll look at a loss function that takes this problem as its starting point: the maximum margin hyperplane classifier.

link here

Here is an extreme example of the problem. We have two linearly separable classes and a decision boundary that separates the data perfectly. And yet, if I see a new instance that is very similar to the rightmost red diamond, but with a slightly higher x1 value, it is suddenly classified as a blue disc.

This illustrates the intuition behind the loss function we will introduce in this video. If we see new points near our existing points, they should be classified the same as the existing points. One way to accomplish this is to look at the distance from the decision boundary to the nearest red diamond and blue disc, and to maximize that.

click image for animation
link here

What we are looking for is the hyperplane that separates the classes and has a maximal distance to the nearest positive point and nearest negative point.

link here

We measure the distance m at a right angle to the decision boundary. For the positive class, there is only one point nearest the margin, but for the negative class, there are two the same distance away.

link here

The points closest to the decision boundary are called the support vectors. This name comes from the fact that the support vectors alone, are enough to describe the model. If I give you the support vectors, you can work out the hyperplane without seeing the rest of the data.

The distance to the support vectors is called the margin. We’ll assume that the decision boundary is chosen so that the margin is the same on both sides.

link here

So, given a dataset, how do we work out which hyperplane maximizes the margin?

This is a tricky problem, because the support vectors aren’t fixed. If we move the hyperplane around to maximize the distance to one set of support vectors, we may move too close to other points, making them the support vectors.

Surprisingly, there is a way to phrase the maximum margin hyperplane objective as a relatively simple optimization problem.

link here

To work this out, let’s first review how we use a hyperplane to define a linear decision boundary. Here is the 1D case. We have a single feature and we first define a linear function from the feature space to a scalar y.

If the function is positive we assign the positive class, if it is negative, we assign the negative class. Where this function is equal to 0, where it “intersects” the feature space, is the decision boundary (which in this case is just a single point).

Note that by defining the decision boundary this way, we have given ourselves an extra degree of freedom: the same decision boundary can be defined by infinitely many hyperplanes. We’ll use this extra degree to help us define a single hyperplane to optimize.

click image for animation
link here

Here’s the picture for a two dimensional feature space. The decision boundary is the dotted line where the hyperplane intersects the (x1, x2) plane. If we rotate the hyperplane about that dotted line, we get a different hyperplane defining the same decision boundary.

link here

The hyperplane h we will choose is the one that produces y=1 for the positive support vectors and y=-1 for the negative support vectors. Or rather, we will define the support vectors as those points for which the line produces 1 and -1.

For all other negative points, h should produce values below -1 and for all other positive points, h should produce values above 1.

click image for animation
link here

This is the picture we want to end up with in 2 dimensions. The linear function evaluates to -1 for the negative support vectors, and to a lower value for all other negative points. It evaluates to 1 for the positive support vectors and to a higher value for all other positive points.

The trick we use to achieve this is to optimize with a constraint. We first define the margin as the distance from the decision boundary, where h evaluates to zero, to the line where h evaluates to 1, and on the other side to the line where h evaluates to -1. Then we set the constraint that all points should be on the correct side of their respective margins.

link here

Here is our objective, written as precisely as we can manage at the moment. We will make this more precise as we move on.

The quantity that we want to maximize is "2 times the margin": the width of the band separating the negative from the positive support vectors (between the two dotted lines in the previous slide).

The constraints define the support vectors: all positive points should evaluate to 1 or higher. All negative points should evaluate to -1 or lower. Note that if we have N instances in our data, this gives us a problem with N constraints.

Note that this automatically ensures that the support vectors end up at -1 and 1. Why?

link here

Here is a picture of a case where all negative points are strictly less than -1, and all positive points are strictly larger than 1. The constraints are satisfied, but there are no points on the edges of the margin: we have no support vectors.

In this case, we can easily make the margin bigger, pushing it out to coincide with the nearest points. Therefore, we have not hit the maximum yet. This is not an optimal solution to our optimization problem.

Thus, any hyperplane with a maximal margin, that satisfies the constraints. must have points on the edges of its margin. These points are the support vectors.

link here

Here is the picture in 3D. Just like the hyperplane crosses the plane where y=0 to make the decision boundary, it crosses the y=1 plane to make the positive margin, and it crosses the y=-1 plane to make the negative margin.

Imagine finding a hyperplane that separates the classes, and then angling it so that he margins hit the nearest points.

link here

Here is the picture for a single feature. We want to maximize the distance between the point where the hyperplane hits -1 and where it hits 1, while keeping the negatives below -1 and the positives above 1.

click image for animation
link here

So, how do we work this into a practical optimization objective that we can actually solve?

The first thing we’ll do, is to simplify the two constraints for the two classes into a single constraint.

We introduce a label yi for each point xi which is -1 for negative points and +1 for positive points. Multiplying the left-hand side of the constraint by yi keeps it the same for positive points and takes the negative for negative points. This means that in both case, the left hand side should now be larger than or equal to one.

We now have a problem with the same constraint for every instance in the data.

Next, we need to make the phrase "2x the size of the margin" more precise. We know that our hyperplane, whichever hyperplane we choose, is defined by parameters w and b. Looking at the parameters of a particular hyperplane (good or bad), can we tell what the size of the margin is?

click image for animation
link here

First, let's recall what the parameters mean geometrically. Remember that in the equation wTx + b, w is the vector pointing orthogonally to the decision boundary. b is how high the hyperplane is at the origin.

link here

This is the value we’re interested in expressing. Twice the margin.

link here

To make the math easier, let’s move the axes around so that the lower dotted line (belonging to the negative support vectors) crosses the origin. Doing this doesn’t change the size of the margin.

We can now imagine a vector from the origin to the upper dotted line, at a right angle. Call this vector a. The length of a is exactly the quantity we’re interested in.

Remember also that the vector w points in the same direction as a, because both are perpendicular to the decision boundary.

click image for animation
link here

Because of the way we’ve moved the hyperplane, we know that the origin (0) hits the negative margin, so evaluates to -1. We also know that a hits the positive margin, so evaluates to +1.

Subtracting the first from the second, we find that the dot product of a and w must be equal to two.

Since a and w point in the same direction (cos α = 1), their dot product is just the product of their magnitudes (see the geometric definition of the dot product on the right).

Re-arranging, we find that the length of a is 2 over that of w.



click image for animation
link here

So, the thing we actually want to maximise is 2/||w||. This gives us a precise optimization objective.

Note that almost all the complexity of the loss is in the constraints. Without them we could just let all elements of w go to zero However, the constraints require the output of our model to be larger than 1 for all positive points and smaller than -1 for all negative points. This will automatically push the margin up to the support vectors, but no further.

link here

Since we prefer to minimize instead of maximize, we take the inverse of this objective, and minimize that. The resulting classifier is called a “hard marginsupport vector machine (SVM), since no points are allowed to violate the constraint and end up inside the margin.

The hard margin SVM is nice, but it doesn’t work well when:

We have data that is not linearly separable

We could have a very nice decision boundary if we just ignored a few misclassified points. For instance, when there is a little noise, or a few outliers.

link here

A common alternative is to replace the norm of w by the dot product of w with itself. This is just a question of removing the square from the norm, so it doesn't change the location of the minimum.

This form is easier to work with if we want to work out the gradient explicitly.

click image for animation
link here

To deal with such situations, we can allow a soft margin. In a soft margin, we allow a few points to be on the wrong side of the margin, if it helps us achieve a better fit on the rest of the points. That is, we can trade off a few violations of the constraints against a bigger margin.

link here

To achieve this, we introduce a slack parameter pi for each point xi. This parameter indicates the extent to the constraint on xi is relaxed. Our learning algorithm can set pi to whatever it likes. If it sets pi to zero, the constraint is the same as it was for the hard margin classifier. If it sets pi higher than zero, the constraint is relaxed and the point xi can fall inside the margin.

The price we pay is that pi is added to our minimization objective, so the value we reach there becomes higher if we use more nonzero slack parameters.

Our search algorithm, which we will detail later, does the rest. It automatically makes the tradeoff between how much we want to violate the original constraints and how big we want the margin to be.

C is a hyperparameter, indicating how to balance the tradeoff.

link here

Here is what that looks like in 1D. The open points are the support vectors, and for each class, we have one point on the wrong side of the decision boundary, requiring us to pay the residual pi as a penalty.

So, the objective function has a penalty added to it, but without this penalty, we would not have been able to satisfy the objective at all, since the two classes are not separable

click image for animation
link here

However, even if the classes are linearly separable, it can be good to allow a little slack.

Here, the two points that would be the support vectors of the hard margin objective leave a very narrow margin. By allowing a little slack, we can get a much wider margin that provides a decision boundary that may be more likely to generalise to unseen data.

link here

So, now that we have made our objective precise, how do we find a good solution? We haven’t discussed constrained optimization much yet. It turns out, we don't necessarily need to use constrained optimization methods, although there is a benefit to using them. We'll look at both options.

link here

The first oiption allows us to use the old familiar gradient descent, without having to worry about constraints.

The other requires us to delve into constrained optimization, which we start to do in the next video. The payoff for that is that it opens the door to the kernel trick.

In the rest of this video, we will work out option one.

If you're in a hurry, and you just want to know the parts that are important for the course, you can skim the rest of this video and focus on the next two.

click image for animation
link here

To get rid of the constraints, let’s look at what we know about the value of pi.

If the constraint for xi is violated, we can see that pi makes up the difference between what yi(wTxi+ b) should be (1) and what it is.

If the constraint is not violated, pi becomes zero, and the value we computed above becomes negative.

We can summarise these two conclusions in a single definition for pi:it is is 1 - yi(wTxi+ b) if the constraint is violated and 0 otherwise. This is equal to the value max(0, 1- yi(wTxi+ b)) in both cases.

Since this value is always equal to pi, we can replace pi by it everywhere it occurs in the optimization objective

click image for animation
link here

Doing this, we get a new objective function.

The new constraints are now always true. For the second one, this is easy to see, since the maximum of 0 and something is always larger than or equal to 0.

For the first, note that we worked out max(0, 1- yi(wTxi+ b)) as how far below 1 the value yi(wTxi+ b) was. If we move it to the the other side, we get

yi(wTxi+ b) + max(0, 1- yi(wTxi+ b))

which must therefore be exactly equal to 1 if yi(wTxi+ b) is below 1, or larger if yi(wTxi+ b) is larger than 1.

Since the constraints are always true, we can remove them.

link here

This gives us an unconstrained loss function we can directly apply to any model. For instance when training a neural network to classify, this makes a solid alternative to logarithmic loss. This is sometimes called the L1-SVM (loss).

We can think of the first term as a regularizer. It doesn’t enforce anything about how well the plane should fit the data. It just ensures that the parameters of the plane don’t grow too big. We'll see more regularization in the next lecture, but this form, where we add the norm of the parameter vector to the loss function, is very common.

The highlighted part of the second term functions as a kind of error (just as we used in least squares classification, but without the square). It computes how far the model output yi(wTxi+ b) is away from the desired value (1).

However, unlike the least squares classifier, we only compute this error for points that are sufficiently close to the decision boundary. For any points far from the boundary (i.e. outside the margin), we do not compute any error at all. This is achieved by cutting off any negative values. If the data is linearly separable, we could easily shrink the margin enough to make the error zero for all points, but this usually requires a w with a very high norm, so then the regulariser kicks in and start increasing so much that we prefer some points inside the margin.

click image for animation
link here

And with that, we have discussed our final classification loss. Let’s review.

link here

Here are all our loss functions in one handy slide.

The error, also known as zero one loss, is imply the number or proportion of misclasssified examples. It’s usually what we’re interested in, but it doesn’t give us a loss surface that is suitable for searching.

The least-squares loss we introduced as a simple illustration of the principle of a proxy loss for the error. In practice it doesn't usually work very well, and is rarely used.

The log loss requires a sigmoid function to be added to the output of the linear function. It assumes that he result is the probability of the positive class applying to the instance, and it maximizes the log likelihood of the classes given the model parameters. Practically this boils down to minimizing the negative log likelihood of the correct class. This can also be derived from the cross-entropy between the true class distribution given by the data and the class distribution given by the model.

Finally, the soft margin SVM loss, which we've introduced today attempts to maximize the margin between the positive and negative points. It's also know as a maximum margin loss, or the hinge loss (since the error is fed through a maximum function, which looks like a hinge).

link here
link here

link here

In the previous video we introduced the maximum margin loss objective. This was a constrained optimization problem which we hadn't learned how to solve yet. We sidestepped that issue by rewriting it into an unconstrained optimization problem, so that we could solve it with plain gradient descent.

In this video, we will learn a relatively simple trick for attacking constrained optimization problems: the method of Lagrange multipliers. In the next video, we will see what happens if we apply this method to the SVM objective function.

click image for animation
link here

So we start with optimization under constraints.

First let’s make it a little more intuitive what optimization under constraints looks like. Here, we have a simple constrained optimization problem. We are trying to find the lowest point on some surface, but there is a constraint that we also need to satisfy.

In this case, the constraint specifies that the solution must lie on the unit circle (that is, x and y together must make a unit vector). Within that set of points, we want to find the points x and y that result in the lowest value f(x, y).

This is what is called an equality constraint. We have some function of our parameters that needs to be exactly equal to some value for any solution that we will return.

link here

We can now draw the surface of f. In this case, f is a two-dimensional parabola. We see that if we ignore the constraint, the lowest point is somewhere towards the bottom right. If we move to that point, however, we violate the constraint.

To figure out how low we can get while satisfying the constraint, we project the constraint region (the unit circle) onto the function, giving us a deformed circle. The constrained optimization problem asks what the lowest point on this deformed circle is.

link here

The method of Lagrange multipliers is a popular way of dealing with these kinds of problems.

To give you an idea of where we're going, we will describe the recipe first, with no derivation or intuition. We'll just show you the steps you need to follow to arrive at an answer, and we'll walk through a few examples. Then, we will look at why this recipe actually works.

First, we need to rewrite the constraint so that it is equal to zero. This is easily done by just moving everything to the left hand side. We call the objective function (the one we want to minimize) f and the left hand side of this constraint g.

Then, we create a new function L, the Lagrangian. This function has the same parameters as f, plus an additional parameter α. It consists of the function f plus or minus the constraint function g times α. The end result will be the same whether we add or subtract g. The parameter α is called a Lagrange multiplier.

We now take the partial derivative of L with respect to all of its arguments (including α), i.e. we compute its gradient, and we set them all equal to 0. The resulting system of equations describes the solution to the constrained optimization problem. If we're lucky, that system of equations can be solved.

Again, there is no reason you should understand from this description why this should work at all. Let's first see the method in action, and then look at why it works.

click image for animation
link here

This is our example problem with the constraint rewritten to be equal to 0.

The first thing we do is to define our L-function. This is a function in three variables: the x and y from f and the Lagrange multiplier α we've introduced.

Next we take the three partial derivatives of L with respect to its arguments. We set all of these equal to zero which gives us a system of equations. If we manage to solve this, we work out where the solutions to our problem are.

There's no exact recipe for how to work this out in terms of x and y. Here are a few tricks to look out for:

You can often rewrite the x and y equations to isolate the Lagrange multiplier on the left hand side and then set the right hand sides equal to each other. We'll use this trick in the next slide.

Often, the constraint is the simplest function. Rewrite this to isolate one of the variables, and then look for another equation where you can easily isolate a variable.

If you have access to Wolfram Alpha (e.g. if you're not doing an exam) it's a good idea to put the system of equations in, as well as the minimization as a whole, to see if the solution looks like one that is easy to solve by hand. Often, Alpha will give you a mess of squares and constants, suggesting that the analytic solution is not pretty, and you might as well solve it numerically. There is no guarantee that this method yields an easily solvable system of equations (unless you're doing a homework problem).

click image for animation
link here

We are left with the three equations on the top left. It's not always guaranteed that the Lagrangian method leads to a system of equations that can be neatly solved, but in this case it can. Finding such a solution isn't a purely mechanical process, we can't give you a set of steps that always works, but a good place to start is to rewrite the equations for the parameters to isolate the Lagrange multiplier α on the left hand side.

We then set the right hand sides equal to each other. Moving both denominators to the other sides, we see that the terms 2xy on both sides cancel out. And we are left with y=4x as a condition for our solution.

We now use the constraint x2 + y2 = 1 to finish the solution. We put 4x in place of y to give us an equation with only xs to solve, and we put 1/4y in place of x to give us an equation with only ys to solve.

In both cases, the square means that we get a positive and a negative answer. This gives us four possible solutions. We can check the value of f(x, y) for each to see which is the minimum, or we can look at the plot. The latter option shows that the minimum has a positive x and a negative y, so that must be the solution.

click image for animation
link here

Before we see why this works, we'll look at one more example.

Imagine you are given two investment opportunities by a bank. You can use plan A or plan B. Both plans are guaranteed to make money after a year, but the more money you invest, the less you make proportionally (imagine there's a very strict tax system). The curves for the interest you get are shown top right.

Both are clearly profitable investments, but sadly you don't have any money. What you can do however, is act as a bank yourself. You offer one of the schemes to somebody else. You take their money, and incur a debt. After a year, you'll have to pay them back with the interest, but in the mean time, you can use their money in the other scheme. All we need to do is figure out a point where one scheme makes more money than the other.

We model this trick by saying you can invest positive or a negative amount in either of the schemes. In that case, the debt you incur is the negative investment, minus the interest. This is why the curve is mirrored for negative investments: you grow your debt by offering an investment to others.

We'll label your investments as a dollars in scheme A and b dollars in scheme B. Since you have no money of your own, our constraint is that a + b = 0. Everything you get from the person investing with you, you put into the other plan. We want to maximize the amount of money you make after a year, which is the sum of the investments and the interests.

From the plot, it's clear that plan B always pays more than plan A everywhere (this is necessary for a clean solution), so you should probably use plan B yourself, and offer plan A to somebody else. But how much should invest? The amount you make is proportional to how much you invest, but it also decays, so it's not as simple as just investing as much as you can.

click image for animation
link here

First, we write down our Lagrangian, which is simply the objective function, plus the constraint function times α.

We work out the partial derivatives and set them equal to zero. Remember that the derivative of the absolute function |x| is the sign function sign(x).

We find a solution, again, by isolating the Lagrange multipliers on the left hand side, and setting the right hand sides equal to one another. Then, the constraint tells us very simply that b should be equal to -a, so we fill that in to get an expression in terms of only a, which we can solve. This tells us that |a| = 1, which means we get solutions at a = 1 and a = -1. We can tell from the plot which is the minimum and which is the maximum.

click image for animation
link here

If we make the interest curves cross, the problem becomes a bit more interesting: which plan is better depends on how much we put in. We don't know beforehand which plan to choose and which to offer.

We can also attack this problem with the method of Lagrange multipliers. If you do this, you'll likely get stuck at the equation shown below. This tells us the solution, but it doesn't simplify in a straightforward way to a simple answer. This is often the case with more realistic problems.

Note that this doesn't make the method of Lagrange multipliers useless for such problems. We've still found a solution, we just can't express it better than this. We can easily solve this equation numerically, which would probably give us a much more accurate answer, more quickly than if we'd solved the constrained optimization problem by numerical means in its original form.

link here

This has hopefully given you a good sense of how the method works. If you trust us, you can now just apply it, and with a little common sense, you can usually find your way to a solution.

Still, we haven't discussed why this works. Let's see if we can add a little intuition. To illustrate, we'll return to the parabola we started with.

link here

One way to help us visualize what's happening is to draw contours for the function f. These are lines on our function where the output is the same value. For any given value k, we can highlight all the points where f(x, y) = k, resulting in a curved line on the surface of our function.

If we look down onto the xy plane from above, the z axis disappears, and we get a 2D plot, where the contour lines give us an idea of the height of the function. Note that the function f gets lower towards the top right corner.

click image for animation
link here

Each contour line indicates a constant output value for f. We can tell by this plot what we can achieve while sticking to the constraint..

The output value k1 is very low (the fourth lowest of the contour lines in this plot), so it would make a good solution, but it never meets the circle representing our constraints. That means that we can’t get the output as low as k1 and satisfy the constraints.

The next lowest contour we've drawn, with value k2, does give us a contour line that hits the circle representing our constraints. Therefore, we can satisfy our constraints and get at least as low as k2. However, the fact that it crosses the circle of our constrains, means that we can also get lower than k2. This makes sense if you look at the plot: if we drew more contours, we could have one between k1 and k2 that hits the green circle. If we try and get lower and lower without leaving the circle, we see that we would probably end up with the contour that doesn't cross the circle, but just touches it at one point only. A bit like a tangent line, but curved.

So, for this picture we have three conclusions:

Any contour line that doesn't meet the constraint region represents a value that we cannot achieve while satisfying the constraints.

Any contour line that crosses the constraint region represents a value we can achieve, but that isn't the optimum.

Any contour line that just touches the constraint region is a possible optimum.

These certainly seem true here. We can use some basic calculus to show that this is true in general.

click image for animation
link here

We'll work out how these ideas look in hyperplanes, and then translate them to general functions. We can always approximate our functions locally with a hyperplane, so the translation should be simple.

If we have a hyperplane defined by wTx + b, then we know that w is the direction of steepest ascent, and -w is the direction of steepest descent. This tells us that the direction orthogonal to the line of w is the direction in which the value of the plane doesn’t change: the direction of equal value. If we drew contours on a hyperplane, they would all be lines orthogonal to w.

This means that if we take any point on our function f and work out the tangent hyperplane of f at that point, that is, compute the gradient, the direction orthogonal to the gradient points along the contour line.

In this case, since our contour line crosses the circle of the constraints, the direction of equal value doesn’t point along the circle for the constraints, and we can conclude that the value of f decreases in one direction along the circle and increases in one direction. Put simply, we are not at a minimum.

click image for animation
link here

By this logic, the only time we can be sure that there is no lower to go along the circle is when the direction of equal value points along the circle. In other words, when the contour line is tangent to the circle: when it touches it only at one point without crossing it.

How do we work out where this point is? By recognizing that the circle for our constraints is also a contour line. A contour of the function x2 + y2, for the constant value 1.

When the gradient of x2 + y2 points in the same or opposite direction as that of f, then so do the vectors orthogonal to them, which are the directions of equal value for f and for g respectively. And when that happens we have a minimum or maximum for our objective.

click image for animation
link here

These are the two basic insights we've just discussed. We look at the contour lines of f and g, and note that the the constraint region is just the contour line of g for the value 0.

At any point where they cross, we've shown there can't be a minimum. At any point where they don't touch at all, we're outside the constraint region. The only other option is that they are tangent: that is, they just touch.

To work out where two curves just touch, we note that the vectors that point along the curve must lie on the same line. These are the directions of equal value of f and g respectively, which are the vectors orthogonal to the gradients of f and g. So instead of looking for where the directions of equal value point in the same direction, we can just look where the gradients point in the same direction. This is something that we can write down symbolically.

click image for animation
link here

We are looking for gradients pointing in the same (or opposite) direction, but not necessarily of the same size. To state this formally, we say that there must be some value α, such that the gradient of f is equal to the gradient of g times α.

We rewrite to get something that must be equal to zero. By moving the gradient symbol out in front (the opposite of what we usually do with gradients), we see that what we’re looking for is the point where the gradient of some function is equal to zero. This function, of course, is the Lagrangian.

What we see here is that at the optimum, the derivative of the Lagrangian wrt to the parameters x, is zero. This shows why we want to take the derivative of the Lagrangian wrt x, and set it to zero. It doesn't yet tell us why we also take the derivative with respect to α, and set that equal to zero as well.

click image for animation
link here

All we need to do now is to figure out what α is. What we've seen is that at the optimum, α is the ratio of the size of the gradient of f to the size of the gradient of g,

The recipe we've already seen just takes the derivative of L wrt α, same as we do wrt to x, and sets it equal to zero. It's not immediately intuitive that this is the right thing to do.

You may think that by setting L L/α = 0, we are choosing α to optimize the value of L. But L expresses the difference between f(x) and αg(x), not the difference between their gradients. There is no intuitive reason why we'd want f(x) and αg(x) to be close together in value, we only want their gradients to match. Our problem statement says that g(x) should be 0, and that the gradients of f and g should be the same.

What's more, we could also have added f(x) and αg(x), according to the recipe, and got the same result. So why does setting L/α = 0 give us the correct α?

One way to look at this is that this simply recovers the constraint. The multiplier α only appears in front of g(x) so taking the derivative w.r.t. α just isolates g(x) and sets it equal to zero. This also shows why we can add or subtract the Lagrange multiplier: -g(x) = 0 and g(x) = 0 have the same solution.

This should be enough to convince us that we are doing the right thing, but it's worth investigating what this function L actually looks like.

We can ask ourselves, if we fix x, and find the zero of L/α this way, aren't we somehow optimizing L? This becomes even more mysterious when we realize that as a function of α, L is simply a 1D linear function (f(x) and g(x) are constant scalars if we take this to be a function of α). The maxima and minima of a linear function are off to positive and negative infinity respectively, so how can we be optimizing a linear function?

The answer is that when the constraint is satisfied we know that g(x) = 0. This means that L = f(x) - αg(x) = f(x), and L becomes constant function: a flat horizontal line.

In other words, if the constraint isn't satisfied L = f(x) - αg(x) is a linear function, with no optima, so L/α = 0 has no solutions. If and only if the constraint is satisfied does L/α = 0 have a solution, so requiring that L/α = 0, is the same as requiring that the constraint is satisfied.

link here

This is a slightly complex and subtle point to understand about the shape of the Lagrangian function. Where its gradient is 0, it forms a saddlepoint (a minimum in one direction and a maximum in another), but that's not necessarily because we are minimizing over x and maximizing over α. It's more correct to say that at the optimum for x, L as a function of α is constant (it has the same value for all α). When x is not at its optimum, L is a linear function of α.

On the left we've plotted L as a function of α at and near the optimal values of x and y for our example function. When we move x slightly away from the optimum, the function f(x) - αg(x) becomes some linear function of α. Only when g(x) = 0, do we get a constant function.

On the right, we see the same picture in 3D. We've fixed y at the optimal value and varied x and α. The lines from the plot on the left are highlighted. The solution to the problem is in the exact center of the plot. Note that we have a saddlepoint solution, but it doesn't necessarily have its minimum over x and its maximum over α.

This is also why we can't find the Lagrangian solution with gradient descent. Gradient descent can be used to find minima, but it doesn't settle on saddlepoints like these.

link here

And with that, we have the method of Lagrange multipliers.

We rewrite the problem so that he constraints are some function that needs to be equal to zero. Then we create a new function L, which consists of f with α times g subtracted (or added). For this new function x and α are both parameters. Then, we solve for both x and α.

This new function L has an optimum where the original function is minimal within the constraints. The new optimum is a saddlepoint. This means we can’t solve it easily by basic gradient descent, we have to set its gradient equal to zero, and solve analytically.

click image for animation
link here

To deepen our understanding, and to set up some things that are coming up, we can ask ourselves what happens when we have an inactive constraint. What if the global minimum is inside the constraint region, so the solution would be the same with and without the constraint? Ideally, the Lagrangian method should still work, and give us the global minimum.

In such a case, the gradient of f(x) will be the zero vector, since it's at a global minimum. The gradient of g(x) won't be the zero vector, since we're at a contour of g(x). Is this an optimum if the gradients aren't pointing in the same direction?



click image for animation
link here

They are, if we set α to 0. Then the term α∇g(x) is reduced to the zero vector and it's equal to ∇f(x) which is equal to the zero vector.

So why doesn't this always work, even when we have an active constraint? Why can't we always set α=0 and collapse the gradient of the constraint function, so that it's always pointing in all directions at once? The answer is that if we set α equal to zero, we are forcing ∇f(x) to the zero vector, so to its global minimum, which is normally outside the constraint. If we then attempt to solve L/α = 0, we will not find a solution.

link here

Finally, If we have multiple constraints, the method extends very naturally. With two constraints, we get three gradients: one for the objective function and two for the constraints.We want all all three to be pointing in the same direction, so we add all gradients together, with separate mulitpliers for each constraint. This sum should be equal to zero.

link here

Here's what that looks like for two constraints. We end up adding a term to the Lagrangian for each constraint, each with a new multiplier.

Note that if any of these constraints happens to be inactive, we will simply end up setting their multiplier to zero, and we will very naturally recover the problem with only the active constraints.

link here

Sometimes a problem is too complex to solve use the Lagrangian method. In such cases, you can often still use the method, but instead of solving the problem, you turn one optimization into another one. This second problem is called the dual problem of the first. Under the right conditions, the solution to the dual problem also gives you the solution to the original problem.

This is why the Lagrangian method is relevant to the subject of SVMs: we can't solve the SVM problem analytically, but we can rewrite it into a different problem.

To illustrate the principle in its most basic form on this very simple problem. To see how it works in detail, and most importantly when it does and doesn't work, you'll have to watch the sixth video.

link here

Here's our start and end points. This problem is very easy to solve explicitly of course, but we'll show you how to translate it to give you a sense of the principle. That way, when we make this step with SVMs, you'll hopefully understand the basic idea of what's happening, even if you skip the full derivation.

Note that in the dual problem, the x and y have disappeared and been replaced with α's. These are the Lagrange multipliers: the basic idea is that we set up the Lagrangian, set its derivative equal to zero, and then rewrite everything in terms of the Lagrange multipliers, getting rid of all other constraints.

link here

The first step is the same as before: we set up the Lagrangian, and set its derivative equal to zero.

We then deviate from the standard approach by rewriting these equations to isolate x and y on the left-hand side. We express both as equations of α only (note that we need to be a bit lucky with our problem to be able to do this).

The derivative with respect to α, we don't fill in. We will hold on to this, and use it in a different way.

click image for animation
link here

What we can now do is fill these back into the original Lagrangian. Whatever x and y are at the optimum, the Lagrangian should take this value in terms of the Lagrange multipliers α.

Now, we require a bit of mental gymnastics. We still have the unused bit of knowledge that at the optimum, the derivative of the Lagrangian should be equal to zero. That's still true of this Lagrangian. In this case, we know how to work that out explicitly, but imagine that this was too complicated to do either because the function is too complex, or because there are more constraints active that make things complicated.

Another route we can take is to recognize that the equation L/α = 0 describes an optimum of L. We saw earlier that in the 3D space of (x, y, α) that ∇L is always 0 at a saddle-point. It turns out, that if we rewrite it like this, expressed in terms of only α, and we get a bit lucky, the optimum corresponds to a minimum or a maximum in α. In this case, L is a second order polynomial in α, so it must have only one minimum or maximum.

click image for animation
link here

Here's the trick in a nutshell. We rewrite the Lagrangian to express it in α. We are assuming the conditions that hold at the optimum, so this form only holds for the optimal x and y. Then we add the assumption that L/α = 0, and we treat this as an optimization objective.

We are essentially doing the opposite of what we normally do. We normally start with an optimization objective and set the function's derivative equal to zero. Here we work out a function, assume it's derivative is equal and suppose that this corresponds to the solution to an optimization objective.

At this point, we don't know whether we'll get a maximum or a minimum, or even a plateau or a saddlepoint. We'll just have to check by hand and hope for the best. In this case, it turns out we get a rather neat maximum, but then this was a particularly simple problem.

link here

And with that, we have our dual problem. A different optimization problem, that we can use to solve the first. We just optimize for α, and use the relations we worked out earlier to translate the optimal α back to the optimal x and y.

This is all a bit handwavy, and if you work out a dual problem in this way, you should always keep your eyes wide open and double check that everything works out as you'd hoped. None of this is guaranteed to work, if you do it like this.

If you want a more grounded and formal approach, we need to work out the dual problem slightly differently. This is a bit too much for a BSc level course, but we've included the basics in the sixth video for the sake of completeness.

click image for animation
link here

Equality constraints are relatively rare. It's more often the case that you'll run into a an inequality constraint: some quantity that is allowed to be equal to or larger than 0, for instance. In such cases, the constraint region becomes a filled-in area in which the solution is allowed to lie.

Optimization with inequality constraints is not part of the exam, but it is necessary to derive SVMs. If you're not interested in the details, just remember that it's basically the same approach, except we need a little extra administration. If you want to know the details, you can check out part 6 of this lecture.

link here

In the next video, we will return to our constrained optimization objective and apply the KKT method to work out the Lagrangian dual. As we will see, this will allow us to get rid of all parameters except the KKT multipliers

link here
link here

Errata: in the video, the optimization objective for the dual is a minimization objective when it should be a maximization objective. In the notes below, we take the negative of this objective.

link here

Here is the original optimization objective again, before we started rewriting. We will use the method of Lagrange multipliers to rewrite this objective to its dual problem.

link here

First, we rewrite the objective function and the constraints a little to make things easier down the line. We turn the norm of w into its dot product. This is just a question of removing the square root so it doesn't change the location of the minimum.

In the constraints, we move everything to the right, so that all constraints are "greater than or equal to 0."

link here

As we announced already, the SVM view follows from working out the dual problem to the soft margin SVM problem.

We've seen this done for a simple problem already: (1) we work out the Lagrangian, set its partial derivatives equal to zero, (2) we use these equations to rewrite the Lagrangian to eliminate all variables except the Lagrange mulitpliers, (2) we cast the solution back to an optimization problem, optimizing only over the multipliers.

link here

Here, we will skip a step. This derivation is simply too long and complicated for a BSc course. We will just show you the optimization problem at the top, and tell you that if you set up the Lagrangian and work out the dual problem, and do a little rewriting, you end up with the objective at the bottom.


There's an optional sixth video, for if you really want or need to know how this works, but if you don't, you can take my word for it: these two problems lead to the same solution, which is the maximum margin hyperplane.

link here

You may need to stare at the dual problem a little bit to see what you are looking at.


Note the following:

The α's are the Lagrange multipliers that were introduced for the first constraint of each point. That is, we are assigning each instance in the data a number alpha. Remember from the previous video that at the optimum, these are 0 if the constraint is inactive, and nonzero if the constraint is active.

The α's are the only parameters of the problem. The x's and y's are simply values from the data.

The second constraint also received a multiplier, βi, but this was removed from the optimization in rewriting.

The first problem sums once over the dataset. The second sums twice, with indices i and j. This means we are essentially looking at two nested loops, looking at all pairs of instances over the data.

For each pair, of any two instances, we compute the product of their multipliers αiαj, the product of their labels yiyj and their dot product. Summing these all up, we get the quantity that we want to minimize.

Because we started with a problem with inequality constraints, we don't end up with a problem without constraints. Instead we get a problem with constraints over the multipliers.

There is also a kind of penalty term keeping the sum of all alphas down.

The slack parameter C now functions to keep the alphas in a fixed range.

The final line says that the sum of all the alphas on the positive examples must equal the sum of all the alphas on the negative examples.


Finally, note that unlike in the Lagrangian examples, we haven't ended up with anything we can solve analytically. We've just turned one constrained optimization problem into another one. We'll still need a solver that can handle constrained optimization problems. We won't go into the details, but the SMO algorithm is a popular choice for SVMs.

link here

Of course the solution means nothing to us in terms of the Lagrange multipliers, since these are variables that we introduced ourselves. Once we've found the optimal multipliers, we need to translate them back to a form that allows us to make classifications.

The simplest thing to do is to translate them back to the hyperplane parameters w and b. As we saw in the previous video, the relation between the multipliers and the parameters of the original problem usually emerges from setting the Lagrangian derivative equal to zero. From this, we see that the vector w is a weighted sum over the support vectors, each multiplied by its label.

This makes sense if you remember that w is the direction in which the hyperplane ascends the quickest. That is, it's the direction in which our model thinks the the points become most likely to be positive. In this sum, we are adding together all the positive support vectors, weighted by their Lagrange multiplier, and subtracting the same sum for the negative support vectors.

click image for animation
link here

Here's a visualization of how the different Lagrange multipliers combine to define w. We have two support vectors for the negative class, and one for the positive class. The weights for both classes need to sum to the same value (the second constraint in the dual problem), so the weights for the negative vectors need to be half that of the weight for the positive vector.

The second term in the objective function tells us that we'd like the multipliers to be as big as possible, and the first constraint suggests that the largest multiplier can be no bigger than C. Assuming we've set C=1, we get the multiplier values shown here.

The relation in the previous slide now tells us that at the optimum, w is the weighted sum of all suppoort vectors, with the negative ones subtracted and the positive ones added.

If we scale the support vectors by the multipliers, we can draw a simple vector addition to show how we arrive at w.

click image for animation
link here

Once we have our solution in terms of the Lagrange multipliers, we need to use them somehow to work out what class to assign to a new point that we haven't seen before.

The first option is simply to compute w from the Lagrange multipliers and use w and b as you normally do in a linear classifier. However, this doesn't work with the method coming up. There, we never want to compute w explicitly because it might be too big. Instead, we can take the definition of w in terms of the multipliers, and fill it into our classification objective.

What we see is that by computing a weighted sum over the dot products of the new instance xnew with all instances in the data. Or rather, with all support vectors, since the multipliers are zero for the non-support vectors.

click image for animation
link here

So why did we do all this if we still need to search for a solution? We had a version that worked with gradient descent, and now we have a version that requires constrained optimization. What have we gained?

The main results here are twofold:

First, notice that the hyperplane parameters w and b have disappeared entirely from the objective and its constraints. The only parameters that remain are one αi per instance i in our data, and the hyperparameter C. The alphas function as an encoding of the support vectors: any instance for which the corresponding alpha is not zero is a support vector. Remember that nonzero Lagrange multipliers correspond to inactive constraints. Only the constraints for the support vectors are active.

Second, note that the algorithm only operates on the dot products of pairs of instances. In other words, if you didn’t have access to the data, but I did give you the full matrix of all dot products of all pairs of instances, you would still be able to find the optimal support vectors. This allows us to use a very special trick.

link here

What if I didn’t give you the actual dot products, but instead gave you a different matrix of values, that behaved like a matrix of dot products.

A kernel function is a function of two vectors that behaves like a dot product, but in a higher dimensional feature space. This will take a bit of effort to wrap your head around, so we'll start at the beginning.

link here

Remember, by adding features that are derived from the original features, we can make linear models more powerful. If the number of features we add grows very quickly (like if we add all 5-way cross products), this can becomes a little expensive (both memory and time wise).

The kernel trick is basically a souped-up version of this idea.

It turns out that for some feature expansions, we can compute the dot product between two instances in the expanded features space without explicitly computing all expanded features.

link here

Let’s look at an example. The simplest way we saw to extend the feature space was to add all cross-products. This turns a 2D dataset into a 5D dataset. Let's se if we can do this, or something similar, without computing the 5D vectors.

link here

Here are two 2D feature vectors. What if, instead of computing their dot product, we computed the square of their dot product.

It turns out that this is equal to the dot product of two other 3D vectors a’ and b’.

click image for animation
link here

The square of the dot product in the 2D feature space, is equivalent to the regular dot product in a 3D feature space. The new features in this 3D space can all be derived from the original features. They're the three cross products, with a small multiplier on the a1a2 cross product.

click image for animation
link here

That is, this kernel function k doesn't compute the dot product between two instances, but it does compute the dot product in a feature space of expanded features. We could do this already, but before we had to actually compute the new features. Now, all we have to do is compute the dot product in the original feature space and square it.

click image for animation
link here

Since the solution to the SVM is expressed purely in terms of the dot product, we can replace the dot product this kernel function. We are now fitting a line in a higher-dimensional space, without computing any extra features explicitly.

Note that this only works because we rewrote the optimization objective to get rid of w and b. Since w and b have the same dimensionality as the features, keeping them in means using explicit features.

Saving the trouble of computing a few extra features may not sound like a big saving, but by choosing our kernel function cleverly we can push things a lot further.

link here

For some expansions to a higher feature space, we can compute the dot product between two vectors, without explicitly expanding the features. This is called a kernel function.

There are many functions that compute the dot product of two vectors in a highly expanded feature space, but don’t actually require you to expand the features.

link here

Taking just the square of the dot product, as we did in our example, we lose the original features. If we take the square of the dot product plus one, it turns out that we retain the original features, and get all cross products.

If we increase the exponent to d we get all d-way cross products. Here we can see the benefit of the kernel trick. Imagine setting d=10 for a dataset with a modest 10 features. Expanding all 10-way cross-products of all features would give each instance 10 trillion expanded features. We wouldn't even be able to fit one instance into memory.

However, if we use the kernel trick, all we need to do is to compute the dot product in the original feature space, add a 1, and raise it to the power of 10.

link here

If ten trillion expanded features sounded like a lot, here is a kernel that corresponds to an infinite-dimensional expanded feature space. We can only approximate this kernel with a finite number of expanded features, getting closer as we add more. Nevertheless, the kernel function itself is very simple to compute.

Gamma is another hyperparameter.

Because this is such a powerful kernel, it is prone to overfitting.

click image for animation
link here

Here’s a plot for the dataset from the first lecture. As you can tell, the RBF kernel massively overfits for these hyperparameters, but it does give us a very nonlinear fit.

link here

One of the most interesting application areas of kernel methods is places where you can turn a distance metric in your data space directly into a kernel, without first extracting any features at all.

For instance for an email classifier, you don't need to extract word frequencies, as we’ve done so far, you can just design a kernel that operates directly on strings (usually based on the edit distance).Put simply, the fewer operations we need to turn one email into another, the closer we put them together. If you make sure that such a function behaves like a dot product, you can stick it in the SVM optimizer as a kernel. You never need to deal with any features at all. Just the raw data, and their dot products in some feature space that you never compute.

If you’re classifying graphs, there are distance metrics like the Weisfeiler-Lehman algorithm that you can use to define kernels.

link here

Kernel SVMs are complicated beasts to understand, but they're easy to use with the right libraries. Here's a simple recipe for fitting an SVM with an RBF kernel in sklearn.

link here

Neural nets require a lot of passes over the data, so it takes a big dataset before kN becomes smaller than N2, but eventually, we got there. At that point, it became more efficient to train models by gradient descent, and the kernel trick lost its luster.

link here

And when neural networks did come back, they caused a revolution. That’s where we’ll pick things up next lecture.

link here

To make the story complete, we need to know two things that we've skipped over. How to solve problems with inequality constraints, and how to use this method to work out the dual problem for the SVM.

These are explicitly not exam material. We've separated them into this video so that you can watch them if you need the whole story, or if you want to get a sense of what the missing steps look like, but you are entirely free to skip this video.

click image for animation
link here

To start with, let's look at the details of how to handle inequality constraints. For instance, if you want you solution to lie anywhere within the unit circle, instead of on the unit circle.

This method, called the method of KKT multipliers is necessary to understand how we derive the kernel trick in the next video. It won't, however, be an exam or homework question, so you're free to skim the rest of this video if you've reached your limit of math.

link here

Lagrange multipliers work great, and are very useful, but so far we've only seen what to do if the constraint is an equality: if some quantity needs to stay exactly equal to some other quantity. It's more often the case that we have an inequality constraint: for instance, the amount of money we spend needs to stay within our budget.

If the constraint in our problem is not an equality constraint, but an inequality constraint, the same method applies, but we need to keep a few more things in mind.

Here is an example. This time we are not looking for a solution on the unit circle, we are looking for the lowest point anywhere outside the unit circle.

This means that our constraint is inactive. The simplest approach for a particular problem is just to check manually if the constrain tis active. If it isn't, you can just solve the unconstrained problem, and if it is, the solution must be on the boundary of the constraint region, so the problem basically reduces to the standard Lagrangian method.

click image for animation
link here

For this problem, if we search only inside the unit circle, the constraint is active. It stops us from going where we want to go, and we end up on the boundary, just like we would if the constraint were an equality constraint. This means that if we know that we have an active constraint, our solution should coincide with the equivalent problem with an equality constraint. For this reason, we can use almost the same approach. We just have to set it up a little bit more carefully, so that we restrict the allowed solutions a bit more.

We first set the convention that all constraints are rewritten to be “greater than” inequalities, with zero on the right hand side. This doesn’t change the region we’re constrained to, but note that the function on left of the inequality sign had a “bowl” shape before, and now has a “hill” shape. In other words, the gradients of this function now point in the opposite direction.

The drawings indicate the 1D equivalent. The places where the two functions intersect (the boundary of our constraint region) are the same, but the constraint function is flipped around. This means that its gradient (the direction of steepest ascent) now points in the opposite direction.

We now know two things: the inequality is always a "greater than" inequality (by convention), and the constraint function is always a "hill" shape and never a "bowl" shape (if it were a bowl shape with a greater than constraint, the constraint would be inactive in this case).

click image for animation
link here

If we are minimizing, we need to make sure that the gradient points into the constrained region, so that the direction of steepest descent points outside. If the direction of steepest descent pointed into the region, we could find a lower point somewhere inside, away from the boundary. Since the gradient for the constraint function points inside the region, we need to make sure that the gradients of the objective function point in the same direction.

If we are maximizing, by the same logic, we need to make sure that the gradients point in opposite directions. We want to direction of greatest ascent to point outside the constraint region.

Contrast this with case of equality constraints. There, we just needed to make sure that the gradients were on the same line, either pointing in the same direction or in opposite directions. Since the constraint was a 1D curve, the gradients and negative gradients always point outside of the constraint region. Now, we need to be a bit more careful. Since we tend to minimize in machine learning, we'll show that version in detail.

link here

This makes the derivation the inequality method a little more complicated than the version with an equality constraint: we again set the gradient of the objective function equal to that of the constraint, again with an α to account for the differences in size between the two gradients, but this time around, we need to make sure that α remains positive, since a negative α would cause the gradient of the constraint to point in the wrong direction.

Even though we’ve not removed the constraint, we’ve simplified it a lot: it is now a linear function, even a constant one, instead of a nonlinear function. Linear constraints are much easier to handle, for instance using methods like linear programming, or gradient descent with projection. If you're lucky, you may even be able to solve it analytically still.

click image for animation
link here

All this only works if we check manually whether a constraint is active. Sometimes this isn't practical: we may have too many constraints, or we may want to work out a solution independent of the specifics. For instance, in the SVM problem, we can only check which constraints are active once we know the data. If we want to work out a solution that holds for any dataset (with the data represented by uninstantiated variables), we can't check manually which constraints are active.

Instead, we can work the activity checking into the optimization problem using a condition called complementary slackness.

Remember what we saw for the Lagrangian case: if the constraint is inactive, we can simply set the multiplier to 0 and the problem reduces to the unconstrained problem.

However, if we don't set the multiplier to zero, we need to make sure that the constraint is active, and stays on the boundary. We can achieve this by requiring that g(x) is exactly 0 rather than larger than or equal to zero.

In short either α is exactly zero, or g(x) is.

link here

We can summarize this requirement by saying that the product of the multiplier and the constraint function should be exactly zero. This condition is called complementary slackness.

If we allow the solution to move away from the constraint boundary to the interior of the constraint region, g(x) will become nonzero (because the boundary is where it is zero), so the α should be zero to satisfy complementary slackness. This will effectively remove the g term from the Lagrangian, forcing us to find the global minimum of f.

If g(x) is on the boundary and so equal to zero, we are allowing α to be nonzero, this means that the g term in the Lagrangian will be active, and we don't need to find the global minimum of f , where its gradient is zero, only the constrained minimum, where its gradient is equal to α∇g(x).

link here

Here's what the general solution looks like. We start with an optimization objective. We construct a Lagrangian-like function as before, but this time, we don't require that its whole gradient is equal to zero, just the objective functions and the constraint terms. In other words, we don't require that the derivative with respect to the multiplier is zero.

This is because the constraint may not be active: in that case the multiplier itself is zero and the gradient can take any value.

This equation may have many solutions, not all of which will be solutions to the optimization problem. The Karush-Kuhn-Tucker (KKT) conditions then tell us which of these solutions also solve the optimization problem.

It may seem a little counter-intuitive that this is actually a step forward. We had a simple minimization objective with a single constraint, and now we have to solve an equation under several constraint, including the original. Are we really better off? There are two answers. In some rare cases, you can actually solve the problem analytically. We'll see an example of that next. In other cases, you can use the KKT conditions to formulate a dual problem. We'll dig into that after the example.

link here

If we have multiple inequality constraints, we just repeat the procedure with fresh multipliers for each. Each constraint gets its own set of three KKT conditions.

link here

We can also mix equality and inequality constraints. In this case, we just treat the equality constraint the same as we did before. We set the KKT conditions for the inequality constraint(s) and for the equality constraint, we only set the condition that the original equality is true. As we saw before, this is equivalent to constructing the Lagrangian and requiring that its derivative with respect to the multiplier of the equality constraint is zero.

click image for animation
link here

Here's an example of when you can solve a KKT problem analytically.

When we introduced entropy, we noted that the cross entropy of a function with itself was the lowest that the cross entropy could get. The implication was that the average codelength is minimized if we choose a code that corresponds to the source distribution of the elements we are trying to transmit.

For a finite space of outcomes, we can now prove this with the Lagrangian method. Here's how we set up the problem. We will assume that we have n outcomes over which the probabilities are defined. The source the outcomes are drawn from is called p, and it assigns probabilities p1 through pn. We encode messages using a code corresponding to distributions q, which assigns probabilities q1 though qn, and thus uses codewords of lengths - log q1 though -log qn.

The expected codelength under this scheme is the cross-entropy between p and q. What we want to show is that setting q=p will minimize the expected codelength.

link here

Here's how that looks as a minimization problem. We want to find the n values for qi for which the corss entropy is minimized. The values of pi are given (we treat them as a constant).

The constraints essentially state that the qi values put together are probabilities. They should be non-negative and they should collectively sum to 1. This gives us n inequality constraints and one equality constraints.

The only thing we need to do to put the constraint into the correct form is to move the 1 to the left hand side.

click image for animation
link here

Here is the Lagrangian we get. Note that it has 2n + 1 parameters: the original n parameters qi, the multipliers of the inequalities αi and one more for the multiplier of the equality β.

The additional constraints are only on the multipliers for the inequalities. They tell us that both the αi multipliers and the qi parameters should be positive, and that at least one of them should be zero.

Remember that we won't be working out the whole gradient of the Lagrangian, only the gradient with respect to the original parameters and the equality constraints.

link here

We start by working out the relvant partial derivatives of the Lagrangian and setting them equal to zero.

On the left we have the equations resulting from this: n equations for the original parameters, and one for the equality constraint. Again, for the multipliers corresponding to inequality constraints, we don't set the derivative equal to zero: for those we rely on the KKT conditions instead.

At this point, there's no standard, mehcanical way to proceed. We need to look at what these equations and inequalities are telling us. If we're lucky, there's a way to solve the problem, and if we're even luckier, we're clever enough to find it.

We can start with the following observations:

The parameters qi must be positive and sum to zero. This means that they can't all be zero.

For those that are nonzero, αi must be zero, due to complementary slackness. So for the nonzero qi, we have pi/qi = - β and thus qi = - pi/β.

All nonzero qi should sum to one, so we have -(1/β)Σpi = 1 (with the sum over those i's corresponding to nonzero qi)

At this point, we run into a snag. We need to know something about those qi's that are zero, but in that case, the first term of the first inequality, -pi/qi, becomes undefined. The main thing to note is that this is a problem with our domain, not with the Lagrangian/KKT method. Note that in the entropy, we have a factor log qi which goes to negative infinity as qi goes to zero. The entropy isn't really properly defined for zero probabilities, so it's no surprise that we run into difficulty with the Lagrangian/KKT method for zero probabilities.

The way we deal with this in entropy is to say that the length of the code word for i, - log qi, goes to infinity as the probability goes to zero. If we are absolutely sure that the outcome i won't happen, we can assign it an "inifinitely long codeword" by setting qi=0 (allowing us to make the other codewords shorter, by giving them more probability mass). We do, however, have to be absolutely sure that the i will never happen, that it that pi is also 0. If pi is anywhere above 0, no matter how small, and qi=0 the expected codelength (the cross entropy) becomes infinite, because there is a non-zero probability that we'll need to transmit an infinitely long codeword.

The conclusion of all of this is that if any of our qi are 0 and the corresponding pi aren't, the objective function is infinite and we are as far from our minimum as we can be. Thus, we may conclude that if qi is zero at an optimum, then pi is also zero. Conversely, if pi is zero and qi isn't, we know we can't be at an optimum, because we could move probability mass away from qi to outcomes that have nonzero probabilty.

All this should convince you that at the optimum, the nonzero qi's correspond exactly to the nonzero pi's. This means that -(1/β)Σpi = 1 = Σpi, so β must be -1, and we have qi = pi for all i.

link here

The other approach we saw in the earlier video, is not to solve the optimization problem, but to turn it into a different problem: the so called dual problem of the first. We were a bit handwavy in our first explanation, skipping a lot of the details, and being vague about when it works and why. Now, in the context of optimization under inequality constraints, we can be more clear about exactly what we're doing.

link here

We'll explain this first in the context of a basic minimization problem with a single inequality constraint. The approach of dual problems makes most sense in the context on inequality constraints, so now that we have the machinery for this, we can give it a proper treatment.

We know that for any solution that satisfies the KKT conditions, the term αg(x) must be positive. This means that whenever x satisfies the KKT conditions, we know that the Lagrangian at x is always strictly less than the the objective function f at x.


click image for animation
link here

For all nonnegative a, and x satisfying the constraint, this tells us that the Lagrangian is less than the objective function. This includes the x for which L is minimal (under the constraints).

Since this is always true, regardless of our value of a, we can now choose a to maximize this function.

click image for animation
link here

Here's a visualization. We are looking for the minimum on the red line f(x), within the region where the green line g(x)is larger than 0.

If we construct L, and pick some positive value a, we get a line that within the constrain region lies below f(x). If we keep x fixed and change a to a' in such a way that L(x, a') is bigger than L(x, a), we move the minimum closer to the optimal value.

click image for animation
link here

Here are the proper definitions of the primal and dual problem. "primal" is just the opposite of dual, the problem we started out with.

What we have done is to remove x as a variable, by setting it to the value it has at the minimum within in the constraint region. This leaves the Lagrange multipliers as the only free variable to maximize over.

link here

We have not shown that the value we get for the dual problem is actually the same as the one we get for the dual. Only that it is less than or equal. This is called weak duality: the dual serves as a lower bound for the primal. This is always true.

If they are exactly equal, we say that strong duality holds. It can be shown that this is true if and only if all three KKT conditions holds for the solution. Two of them are already included in the dual problem.We're just missing complimentary slackness. We can either define the problem with complimentary slackness added, or we can figure out the dual without complimentary slackness, and then check whether it holds for the solution.

link here

Before we use these principles to work out the dual for the SVM problem, let's see it in action on a slightly simpler problem. We'll use the problem from the fourth part, but with an inequality constraint instead of an equality constraint.

The constraint is active, so the solution will be the same as before, but we'll take you through the formal steps of formulating the dual problem.

link here

As before, we set up the Lagrangian, but this time, we start by setting up the dual function f', which halp alpha as an argument and contains a minimization over x and y.

We then maximize this value of f subject to the KKT conditions. To make this practical, we need to rewrite the dual function to eliminate all references to x and y. We do this by making the assumption that x and y are at the optimum: the derivative of the Lagrangian is zero for them.

link here

We worked this out already in slide 100. Under this assumption, we can express x and y in terms of the value of alpha.

click image for animation
link here

As we saw before, filling in these derivatives gives us an objective function that is a simple parabola in α. The parabola has it's maximum at α=1, so we get weak duality at that value. do we also get strong duality? We should check the KKT conditions to find out.

The first says that α should be nonnegative, and the second says that it should be larger or equal to 1. We can ignore the first, but either way our solution satisfies them both.

The complementary slackness is an equation with two solutions α=0 and α=1. The second corresponds to our solution, so we do indeed have strong duality.

Filling α=1 into the equations we derived before will tell us where in the x, y plane we will find this solution.

link here

We are now ready to begin our attack on the SVM objective. This is a much more complicated beast than the problems we've seen so far, but so long as we stick to the plan, and work step by step, we should be fine.

link here

Here are the start and end points of our journey. The objective at the bottom is the dual problem of the one at the top.

click image for animation
link here

First, we define our Lagrangian. We introduce two sets of multipliers: α's for the first type of constraint, and β's for the second type of constraint. If our datasets has n instances, we add n α's and n β's.

click image for animation
link here

Next, we'll rewrite the Lagrangian a little bit to isolate the terms we will be taking the derivative over. Remember we'll only do this over the parameters of the original problem w, b and pi.

In this form, the derivatives with respect to these variables should be straightforward to work out.

click image for animation
link here

That makes this our dual function. Before we set up the dual problem, we can rewrite this by finding the minimum, expressing each variable in terms of the multipliers and filling them in to this function.

link here

So, let's take the derivative with respect to the parameters, and set them equal to zero. We'll collect our findings in the box on the right.

We haven’t discussed taking derivatives with respect to vectors, but here we’ll just use two rules that are analogous to the way we multiply scalars.

The derivative wTw, with respect to w is 2 times w. This is analogous to the derivative of the square for scalars.

The derivative of w times some constant vector (wrt to w) is just that constant. This is similar to the constant multiplier rule for scalars.

This gives us an expression for w at the optimum, in terms of alpha, y and x.

link here

If we take the derivative with respect to b, we find a simple constraint: that at the optimum, the sum of all α values, multiplied by their corresponding y’s, should be zero.

link here

Finally, we take the derivative for pi and set that equal to zero.

The result essentially tells us that for any given instance i, the alpha plus the beta must equal C.

If we assume that alpha is between 0 and C, then we can just take beta to be the remainder.

click image for animation
link here

This (in the orange box) is what we have figured out so far about our function at the optimum.

If we fill in the three equalities, our function simplifies a lot. This function describes the optimum, subject to the constraints on the right. These are constraints of variables in our final form, so we need to remember these.

We've eliminated almost all original variables, except w. We have a good expression for w in terms of the α's, so we can just fill this in, and rewrite.

click image for animation
link here

We first replace one of the sums in the first line with w. This is going in the wrong direction, but it allows us to reduce the number of w's in the equation.

Then, we replace the final two occurrences of w, with the sum. This gives us a function purely in terms of alpha, with no reference to w or b. All we need to do is simplify it a little bit.

Note that the square brackets here are just brackets, they have no special meaning.

click image for animation
link here

To simplify, we distribute all dot products over the sums. Note that the dot product distributed over sums the same way as scalar multiplication: (a + b + c)Td -> (aTd + bTd + cTd).

It looks a little intimidating with the capital sigma notation, but it’s the same thing as you see on the right, except with dot products instead of of scalar multiplication.

click image for animation
link here

Here's what all that gives us for the dual problem. We've simplified the objective function to have only α's, and we've received some constraints in return.

Note that these constraints are not the KKT conditions. They are requirements for the Lagrangian to be at a minimum in the original parameters. All we have so far is weak duality. To get strong duality, we need to either prove that the KKT conditions hold for all solutions like these, or add some of them to the list of constraints. It turns out we can do the former: all KKT conditions hold already for this form of the problem, so we have strong duality.

link here

Here they are for our problem. If we take a solution to our dual problem (a set of alphas), use the expressions for w, b and pi in terms of alpha, and fill them in here, these six statements should hold.

This is easiest to do with the help of the original form of the Lagrangian, reproduced at the bottom of the slide. Note that we may assume that the original parameters are chosen so that this function is at a minimum, and the alphas are chosen to maximize over that.

We don't have an expression for pi in terms of the alphas. This is because when (C -ai- bi) = 0, which is true at the optimum, the Langrangian is constant irrespective of pi. In some sense, we can set pi to whatever value we like. If we find one value that causes the KKT conditions to be satisfied, we have strong duality. We'll see that this is the case when we set pi to the originally intended value: where necessary, it makes up the difference between the output of the linear function and the edges of the margin.

From top to bottom:

αi is explicitly constrained to be larger than zero in the dual problem

βi is the remainder between αi and C, so must also be positive.

As noted, we are free to set and interpret pi however we like. It must be positive to satisfy the next condition. The original definition states that pi was zero if the first term was 1 or larger and makes up the (positive) difference if not. For this value of pi, the third and fourth constraints are satisfied.

See above.

The complementary slackness states that either ai is zero or the corresponding constraint is.If we use the slack variable pi, the constraint becomes exactly zero so complementary slackness is satisfied. If we don't use the slack variable (and pi is zero), the left hand side of the constraint may be nonzero, and we should show that that ai is zero. Assume that it isn't, and look at the second term of the Lagrangian. This now consists of two positive factors. If ai is nonzero, we could increase L by reducing it to zero, which tell us that ai must be zero, since ai is chosen to maximize L.

The same argument as in the previous point can be used here.

Therefore, the KKT conditions are satisfied by any solution to the dual problem, and we have strong duality.

link here

The argument we used to prove complementary slackness also tells us what the Lagrange multipliers mean. This is usually a fruitful area of investigation. The multipliers almost always have a meaningful intepretation in the problem domain.

In this case, the alphas are a kind of complement to the slack parameters pi. Where the alphas are zero, the slack parameters aren't used, and so the point xi is on the correct side of the margin. Where alpha is nonzero, the slack parameters are active and we are dealing with points inside the margin.

link here

And there we have the dual problem for SVMs. Note that the dual always gives us a maximization over the Lagrange multilpliers (if we start with inqeuality constraints), but here we've flipped the sign to change it back to a minimization problem.


link here