[C2P2] Andrew Ng

zoukankan html css js c++ java

[C2P2] Andrew Ng
Linear Regression with One Variable

Linear regression predicts a real-valued output based on an input value. We discuss the application of linear regression to housing price prediction, present the notion of a cost function, and introduce the gradient descent method for learning.

7 videos, 8 readings

Video: Model Representation

Our first learning algorithm will be linear regression. In this video, you'll see
0:06
what the model looks like and more importantly you'll see what the overall process of supervised learning looks like. Let's use some motivating example of predicting housing prices. We're going to use a data set of housing prices from the city of Portland, Oregon. And here I'm gonna plot my data set of a number of houses that were different sizes that were sold for a range of different prices. Let's say that given this data set, you have a friend that's trying to sell a house and let's see if friend's house is size of 1250 square feet and you want to tell them how much they might be able to sell the house for. Well one thing you could do is fit a model. Maybe fit a straight line to this data. Looks something like that and based on that, maybe you could tell your friend that let's say maybe he can sell the house for around $220,000. So this is an example of a supervised learning algorithm. And it's supervised learning because we're given the, quotes, "right answer" for each of our examples. Namely we're told what was the actual house, what was the actual price of each of the houses in our data set were sold for and moreover, this is an example of a regression problem where the term regression refers to the fact that we are predicting a real-valued output namely the price. And just to remind you the other most common type of supervised learning problem is called the classification problem where we predict discrete-valued outputs such as if we are looking at cancer tumors and trying to decide if a tumor is malignant or benign. So that's a zero-one valued discrete output. More formally, in supervised learning, we have a data set and this data set is called a training set. So for housing prices example, we have a training set of different housing prices and our job is to learn from this data how to predict prices of the houses. Let's define some notation that we're using throughout this course. We're going to define quite a lot of symbols. It's okay if you don't remember all the symbols right now but as the course progresses it will be useful [inaudible] convenient notation. So I'm gonna use lower case m throughout this course to denote the number of training examples. So in this data set, if I have, you know, let's say 47 rows in this table. Then I have 47 training examples and m equals 47. Let me use lowercase x to denote the input variables often also called the features. That would be the x is here, it would the input features. And I'm gonna use y to denote my output variables or the target variable which I'm going to predict and so that's the second column here. [inaudible] notation, I'm going to use (x, y) to denote a single training example. So, a single row in this table corresponds to a single training example and to refer to a specific training example, I'm going to use this notation x(i) comma gives me y(i) And, we're
3:25
going to use this to refer to the ith training example. So this superscript i over here, this is not exponentiation right? This (x(i), y(i)), the superscript i in parentheses that's just an index into my training set and refers to the ith row in this table, okay? So this is not x to the power of i, y to the power of i. Instead (x(i), y(i)) just refers to the ith row of this table. So for example, x(1) refers to the input value for the first training example so that's 2104. That's this x in the first row. x(2) will be equal to 1416 right? That's the second x and y(1) will be equal to 460. The first, the y value for my first training example, that's what that (1) refers to. So as mentioned, occasionally I'll ask you a question to let you check your understanding and a few seconds in this video a multiple-choice question will pop up in the video. When it does, please use your mouse to select what you think is the right answer. What defined by the training set is. So here's how this supervised learning algorithm works. We saw that with the training set like our training set of housing prices and we feed that to our learning algorithm. Is the job of a learning algorithm to then output a function which by convention is usually denoted lowercase h and h stands for hypothesis And what the job of the hypothesis is, is, is a function that takes as input the size of a house like maybe the size of the new house your friend's trying to sell so it takes in the value of x and it tries to output the estimated value of y for the corresponding house. So h is a function that maps from x's to y's. People often ask me, you know, why is this function called hypothesis. Some of you may know the meaning of the term hypothesis, from the dictionary or from science or whatever. It turns out that in machine learning, this is a name that was used in the early days of machine learning and it kinda stuck. 'Cause maybe not a great name for this sort of function, for mapping from sizes of houses to the predictions, that you know.... I think the term hypothesis, maybe isn't the best possible name for this, but this is the standard terminology that people use in machine learning. So don't worry too much about why people call it that. When designing a learning algorithm, the next thing we need to decide is how do we represent this hypothesis h. For this and the next few videos, I'm going to choose our initial choice , for representing the hypothesis, will be the following. We're going to represent h as follows. And we will write this as htheta(x) equals theta0 plus theta1 of x. And as a shorthand, sometimes instead of writing, you know, h subscript theta of x, sometimes there's a shorthand, I'll just write as a h of x. But more often I'll write it as a subscript theta over there. And plotting this in the pictures, all this means is that, we are going to predict that y is a linear function of x. Right, so that's the data set and what this function is doing, is predicting that y is some straight line function of x. That's h of x equals theta 0 plus theta 1 x, okay? And why a linear function? Well, sometimes we'll want to fit more complicated, perhaps non-linear functions as well. But since this linear case is the simple building block, we will start with this example first of fitting linear functions, and we will build on this to eventually have more complex models, and more complex learning algorithms. Let me also give this particular model a name. This model is called linear regression or this, for example, is actually linear regression with one variable, with the variable being x. Predicting all the prices as functions of one variable X. And another name for this model is univariate linear regression. And univariate is just a fancy way of saying one variable. So, that's linear regression. In the next video we'll start to talk about just how we go about implementing this model.

Reading: Model Representation

Model Representation
To establish notation for future use, we’ll use (x^{(i)}) to denote the “input” variables (living area in this example), also called input features, and (y^{(i)}) to denote the “output” or target variable that we are trying to predict (price). A pair ((x^{(i)} , y^{(i)} )) is called a training example, and the dataset that we’ll be using to learn—a list of m training examples ((x^{(i)},y^{(i)});i=1,...,m)—is called a training set. Note that the superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with exponentiation. We will also use X to denote the space of input values, and Y to denote the space of output values. In this example, X = Y = ℝ.

To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:

When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.

Video: Cost Function

In this video we'll define something called the cost function, this will let us figure out how to fit the best possible straight line to our data.
0:10
In linear progression, we have a training set that I showed here remember on notation M was the number of training examples, so maybe m equals 47. And the form of our hypothesis,
0:22
which we use to make predictions is this linear function.
0:26
To introduce a little bit more terminology, these theta zero and theta one, they stabilize what I call the parameters of the model. And what we're going to do in this video is talk about how to go about choosing these two parameter values, theta 0 and theta 1. With different choices of the parameter's theta 0 and theta 1, we get different hypothesis, different hypothesis functions. I know some of you will probably be already familiar with what I am going to do on the slide, but just for review, here are a few examples. If theta 0 is 1.5 and theta 1 is 0, then the hypothesis function will look like this.
1:10
Because your hypothesis function will be h of x equals 1.5 plus 0 times x which is this constant value function which is phat at 1.5. If theta0 = 0, theta1 = 0.5, then the hypothesis will look like this, and it should pass through this point 2,1 so that you now have h(x). Or really h of theta(x), but sometimes I'll just omit theta for brevity. So h(x) will be equal to just 0.5 times x, which looks like that. And finally, if theta zero equals one, and theta one equals 0.5, then we end up with a hypothesis that looks like this. Let's see, it should pass through the two-two point. Like so, and this is my new vector of x, or my new h subscript theta of x. Whatever way you remember, I said that this is h subscript theta of x, but that's a shorthand, sometimes I'll just write this as h of x.
2:13
In linear regression, we have a training set, like maybe the one I've plotted here. What we want to do, is come up with values for the parameters theta zero and theta one so that the straight line we get out of this, corresponds to a straight line that somehow fits the data well, like maybe that line over there.
2:34
So, how do we come up with values, theta zero, theta one, that corresponds to a good fit to the data?
2:42
The idea is we get to choose our parameters theta 0, theta 1 so that h of x, meaning the value we predict on input x, that this is at least close to the values y for the examples in our training set, for our training examples. So in our training set, we've given a number of examples where we know X decides the wholes and we know the actual price is was sold for. So, let's try to choose values for the parameters so that, at least in the training set, given the X in the training set we make reason of the active predictions for the Y values. Let's formalize this. So linear regression, what we're going to do is, I'm going to want to solve a minimization problem. So I'll write minimize over theta0 theta1. And I want this to be small, right? I want the difference between h(x) and y to be small. And one thing I might do is try to minimize the square difference between the output of the hypothesis and the actual price of a house. Okay. So lets find some details. You remember that I was using the notation (x(i),y(i)) to represent the ith training example. So what I want really is to sum over my training set, something i = 1 to m, of the square difference between, this is the prediction of my hypothesis when it is input to size of house number i.
4:22
Right? Minus the actual price that house number I was sold for, and I want to minimize the sum of my training set, sum from I equals one through M, of the difference of this squared error, the square difference between the predicted price of a house, and the price that it was actually sold for. And just remind you of notation, m here was the size of my training set right? So my m there is my number of training examples. Right that hash sign is the abbreviation for number of training examples, okay? And to make some of our, make the math a little bit easier, I'm going to actually look at we are 1 over m times that so let's try to minimize my average minimize one over 2m. Putting the 2 at the constant one half in front, it may just sound the math probably easier so minimizing one-half of something, right, should give you the same values of the process, theta 0 theta 1, as minimizing that function.
5:24
And just to be sure, this equation is clear, right? This expression in here, h subscript theta(x), this is our usual, right?
5:37
That is equal to this plus theta one xi. And this notation, minimize over theta 0 theta 1, this means you'll find me the values of theta 0 and theta 1 that causes this expression to be minimized and this expression depends on theta 0 and theta 1, okay? So just a recap. We're closing this problem as, find me the values of theta zero and theta one so that the average, the 1 over the 2m, times the sum of square errors between my predictions on the training set minus the actual values of the houses on the training set is minimized. So this is going to be my overall objective function for linear regression.
6:22
And just to rewrite this out a little bit more cleanly, what I'm going to do is, by convention we usually define a cost function,
6:31
which is going to be exactly this, that formula I have up here.
6:37
And what I want to do is minimize over theta0 and theta1. My function j(theta0, theta1). Just write this out.
6:53
This is my cost function.
6:59
So, this cost function is also called the squared error function.
7:06
When sometimes called the squared error cost function and it turns out that why do we take the squares of the erros. It turns out that these squared error cost function is a reasonable choice and works well for problems for most regression programs. There are other cost functions that will work pretty well. But the square cost function is probably the most commonly used one for regression problems. Later in this class we'll talk about alternative cost functions as well, but this choice that we just had should be a pretty reasonable thing to try for most linear regression problems.
7:42
Okay. So that's the cost function.
7:45
So far we've just seen a mathematical definition of this cost function. In case this function j of theta zero, theta one. In case this function seems a little bit abstract, and you still don't have a good sense of what it's doing, in the next video, in the next couple videos, I'm actually going to go a little bit deeper into what the cause function "J" is doing and try to give you better intuition about what is computing and why we want to use it...

Reading: Cost Function

Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's.

[J( heta_0, heta_1) = dfrac {1}{2m} displaystyle sum _{i=1}^m left ( hat{y}_{i}- y_{i} ight)^2 = dfrac {1}{2m} displaystyle sum _{i=1}^m left (h_ heta (x_{i}) - y_{i} ight)^2 ]
To break it apart, it is (frac{1}{2}ar{x}) where (ar{x}) is the mean of the squares of (h_ heta (x_{i}) - y_{i}), or the difference between the predicted value and the actual value.

This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved (left(frac{1}{2} ight)) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the (frac{1}{2}) term. The following image summarizes what the cost function does:

Video: Cost Function - Intuition I

In the previous video, we gave the mathematical definition of the cost function. In this video, let's look at some examples, to get back to intuition about what the cost function is doing, and why we want to use it. To recap, here's what we had last time. We want to fit a straight line to our data, so we had this formed as a hypothesis with these parameters theta zero and theta one, and with different choices of the parameters we end up with different straight line
0:31
fits. So the data which are fit like so, and there's a cost function, and that was our optimization objective. [sound] So this video, in order to better visualize the cost function J, I'm going to work with a simplified hypothesis function, like that shown on the right. So I'm gonna use my simplified hypothesis, which is just theta one times X. We can, if you want, think of this as setting the parameter theta zero equal to 0. So I have only one parameter theta one and my cost function is similar to before except that now H of X that is now equal to just theta one times X. And I have only one parameter theta one and so my optimization objective is to minimize j of theta one. In pictures what this means is that if theta zero equals zero that corresponds to choosing only hypothesis functions that pass through the origin, that pass through the point (0, 0). Using this simplified definition of a hypothesizing cost function let's try to understand the cost function concept better. It turns out that two key functions we want to understand. The first is the hypothesis function, and the second is a cost function. So, notice that the hypothesis, right, H of X. For a face value of theta one, this is a function of X. So the hypothesis is a function of, what is the size of the house X. In contrast, the cost function, J, that's a function of the parameter, theta one, which controls the slope of the straight line. Let's plot these functions and try to understand them both better. Let's start with the hypothesis. On the left, let's say here's my training set with three points at (1, 1), (2, 2), and (3, 3). Let's pick a value theta one, so when theta one equals one, and if that's my choice for theta one, then my hypothesis is going to look like this straight line over here. And I'm gonna point out, when I'm plotting my hypothesis function. X-axis, my horizontal axis is labeled X, is labeled you know, size of the house over here. Now, of temporary, set theta one equals one, what I want to do is figure out what is j of theta one, when theta one equals one. So let's go ahead and compute what the cost function has for. You'll devalue one. Well, as usual, my cost function is defined as follows, right? Some from, some of 'em are training sets of this usual squared error term. And, this is therefore equal to. And this. Of theta one x I minus y I and if you simplify this turns out to be. That. Zero Squared to zero squared to zero squared which is of course, just equal to zero. Now, inside the cost function. It turns out each of these terms here is equal to zero. Because for the specific training set I have or my 3 training examples are (1, 1), (2, 2), (3,3). If theta one is equal to one. Then h of x. H of x i. Is equal to y I exactly, let me write this better. Right? And so, h of x minus y, each of these terms is equal to zero, which is why I find that j of one is equal to zero. So, we now know that j of one Is equal to zero. Let's plot that. What I'm gonna do on the right is plot my cost function j. And notice, because my cost function is a function of my parameter theta one, when I plot my cost function, the horizontal axis is now labeled with theta one. So I have j of one zero zero so let's go ahead and plot that. End up with. An X over there. Now lets look at some other examples. Theta-1 can take on a range of different values. Right? So theta-1 can take on the negative values, zero, positive values. So what if theta-1 is equal to 0.5. What happens then? Let's go ahead and plot that. I'm now going to set theta-1 equals 0.5, and in that case my hypothesis now looks like this. As a line with slope equals to 0.5, and, lets compute J, of 0.5. So that is going to be one over 2M of, my usual cost function. It turns out that the cost function is going to be the sum of square values of the height of this line. Plus the sum of square of the height of that line, plus the sum of square of the height of that line, right? ?Cause just this vertical distance, that's the difference between, you know, Y. I. and the predicted value, H of XI, right? So the first example is going to be 0.5 minus one squared. Because my hypothesis predicted 0.5. Whereas, the actual value was one. For my second example, I get, one minus two squared, because my hypothesis predicted one, but the actual housing price was two. And then finally, plus. 1.5 minus three squared. And so that's equal to one over two times three. Because, M when trading set size, right, have three training examples. In that, that's times simplifying for the parentheses it's 3.5. So that's 3.5 over six which is about 0.68. So now we know that j of 0.5 is about 0.68.[Should be 0.58] Lets go and plot that. Oh excuse me, math error, it's actually 0.58. So we plot that which is maybe about over there. Okay? Now, let's do one more. How about if theta one is equal to zero, what is J of zero equal to? It turns out that if theta one is equal to zero, then H of X is just equal to, you know, this flat line, right, that just goes horizontally like this. And so, measuring the errors. We have that J of zero is equal to one over two M, times one squared plus two squared plus three squared, which is, One six times fourteen which is about 2.3. So let's go ahead and plot as well. So it ends up with a value around 2.3 and of course we can keep on doing this for other values of theta one. It turns out that you can have you know negative values of theta one as well so if theta one is negative then h of x would be equal to say minus 0.5 times x then theta one is minus 0.5 and so that corresponds to a hypothesis with a slope of negative 0.5. And you can actually keep on computing these errors. This turns out to be, you know, for 0.5, it turns out to have really high error. It works out to be something, like, 5.25. And so on, and the different values of theta one, you can compute these things, right? And it turns out that you, your computed range of values, you get something like that. And by computing the range of values, you can actually slowly create out. What does function J of Theta say and that's what J of Theta is. To recap, for each value of theta one, right? Each value of theta one corresponds to a different hypothesis, or to a different straight line fit on the left. And for each value of theta one, we could then derive a different value of j of theta one. And for example, you know, theta one=1, corresponded to this straight line straight through the data. Whereas theta one=0.5. And this point shown in magenta corresponded to maybe that line, and theta one=zero which is shown in blue that corresponds to this horizontal line. Right, so for each value of theta one we wound up with a different value of J of theta one and we could then use this to trace out this plot on the right. Now you remember, the optimization objective for our learning algorithm is we want to choose the value of theta one. That minimizes J of theta one. Right? This was our objective function for the linear regression. Well, looking at this curve, the value that minimizes j of theta one is, you know, theta one equals to one. And low and behold, that is indeed the best possible straight line fit through our data, by setting theta one equals one. And just, for this particular training set, we actually end up fitting it perfectly. And that's why minimizing j of theta one corresponds to finding a straight line that fits the data well. So, to wrap up. In this video, we looked up some plots. To understand the cost function. To do so, we simplify the algorithm. So that it only had one parameter theta one. And we set the parameter theta zero to be only zero. In the next video. We'll go back to the original problem formulation and look at some visualizations involving both theta zero and theta one. That is without setting theta zero to zero. And hopefully that will give you, an even better sense of what the cost function j is doing in the original linear regression formulation.

Reading: Cost Function - Intuition I

Cost Function - Intuition I
If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We are trying to make a straight line (defined by (h_ heta(x)) which passes through these scattered data points.

Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered points from the line will be the least. Ideally, the line should pass through all the points of our training data set. In such a case, the value of (J( heta_0, heta_1)) will be 0. The following example shows the ideal situation where we have a cost function of 0.

When ( heta_1 = 1), we get a slope of 1 which goes through every single data point in our model. Conversely, when ( heta_1 = 0.5), we see the vertical distance from our fit to the data points increase.

This increases our cost function to 0.58. Plotting several other points yields to the following graph:

Thus as a goal, we should try to minimize the cost function. In this case, ( heta_1 = 1) is our global minimum.

Video: Cost Function - Intuition II

In this video, lets delve deeper and get even better intuition about what the cost function is doing. This video assumes that you're familiar with contour plots. If you are not familiar with contour plots or contour figures some of the illustrations in this video may or may not make sense to you but is okay and if you end up skipping this video or some of it does not quite make sense because you haven't seen contour plots before. That's okay and you will still understand the rest of this course without those parts of this. Here's our problem formulation as usual, with the hypothesis parameters, cost function, and our optimization objective. Unlike before, unlike the last video, I'm going to keep both of my parameters, theta zero, and theta one, as we generate our visualizations for the cost function. So, same as last time, we want to understand the hypothesis H and the cost function J. So, here's my training set of housing prices and let's make some hypothesis. You know, like that one, this is not a particularly good hypothesis. But, if I set theta zero=50 and theta one=0.06, then I end up with this hypothesis down here and that corresponds to that straight line. Now given these value of theta zero and theta one, we want to plot the corresponding, you know, cost function on the right. What we did last time was, right, when we only had theta one. In other words, drawing plots that look like this as a function of theta one. But now we have two parameters, theta zero, and theta one, and so the plot gets a little more complicated. It turns out that when we have only one parameter, that the parts we drew had this sort of bow shaped function. Now, when we have two parameters, it turns out the cost function also has a similar sort of bow shape. And, in fact, depending on your training set, you might get a cost function that maybe looks something like this. So, this is a 3-D surface plot, where the axes are labeled theta zero and theta one. So as you vary theta zero and theta one, the two parameters, you get different values of the cost function J (theta zero, theta one) and the height of this surface above a particular point of theta zero, theta one. Right, that's, that's the vertical axis. The height of the surface of the points indicates the value of J of theta zero, J of theta one. And you can see it sort of has this bow like shape. Let me show you the same plot in 3D. So here's the same figure in 3D, horizontal axis theta one and vertical axis J(theta zero, theta one), and if I rotate this plot around. You kinda of a get a sense, I hope, of this bowl shaped surface as that's what the cost function J looks like. Now for the purpose of illustration in the rest of this video I'm not actually going to use these sort of 3D surfaces to show you the cost function J, instead I'm going to use contour plots. Or what I also call contour figures. I guess they mean the same thing. To show you these surfaces. So here's an example of a contour figure, shown on the right, where the axis are theta zero and theta one. And what each of these ovals, what each of these ellipsis shows is a set of points that takes on the same value for J(theta zero, theta one). So concretely, for example this, you'll take that point and that point and that point. All three of these points that I just drew in magenta, they have the same value for J (theta zero, theta one). Okay. Where, right, these, this is the theta zero, theta one axis but those three have the same Value for J (theta zero, theta one) and if you haven't seen contour plots much before think of, imagine if you will. A bow shaped function that's coming out of my screen. So that the minimum, so the bottom of the bow is this point right there, right? This middle, the middle of these concentric ellipses. And imagine a bow shape that sort of grows out of my screen like this, so that each of these ellipses, you know, has the same height above my screen. And the minimum with the bow, right, is right down there. And so the contour figures is a, is way to, is maybe a more convenient way to visualize my function J. [sound] So, let's look at some examples. Over here, I have a particular point, right? And so this is, with, you know, theta zero equals maybe about 800, and theta one equals maybe a -0.15 . And so this point, right, this point in red corresponds to one set of pair values of theta zero, theta one and the corresponding, in fact, to that hypothesis, right, theta zero is about 800, that is, where it intersects the vertical axis is around 800, and this is slope of about -0.15. Now this line is really not such a good fit to the data, right. This hypothesis, h(x), with these values of theta zero, theta one, it's really not such a good fit to the data. And so you find that, it's cost. Is a value that's out here that's you know pretty far from the minimum right it's pretty far this is a pretty high cost because this is just not that good a fit to the data. Let's look at some more examples. Now here's a different hypothesis that's you know still not a great fit for the data but may be slightly better so here right that's my point that those are my parameters theta zero theta one and so my theta zero value. Right? That's bout 360 and my value for theta one. Is equal to zero. So, you know, let's break it out. Let's take theta zero equals 360 theta one equals zero. And this pair of parameters corresponds to that hypothesis, corresponds to flat line, that is, h(x) equals 360 plus zero times x. So that's the hypothesis. And this hypothesis again has some cost, and that cost is, you know, plotted as the height of the J function at that point. Let's look at just a couple of examples. Here's one more, you know, at this value of theta zero, and at that value of theta one, we end up with this hypothesis, h(x) and again, not a great fit to the data, and is actually further away from the minimum. Last example, this is actually not quite at the minimum, but it's pretty close to the minimum. So this is not such a bad fit to the, to the data, where, for a particular value, of, theta zero. Which, one of them has value, as in for a particular value for theta one. We get a particular h(x). And this is, this is not quite at the minimum, but it's pretty close. And so the sum of squares errors is sum of squares distances between my, training samples and my hypothesis. Really, that's a sum of square distances, right? Of all of these errors. This is pretty close to the minimum even though it's not quite the minimum. So with these figures I hope that gives you a better understanding of what values of the cost function J, how they are and how that corresponds to different hypothesis and so as how better hypotheses may corresponds to points that are closer to the minimum of this cost function J. Now of course what we really want is an efficient algorithm, right, a efficient piece of software for automatically finding The value of theta zero and theta one, that minimizes the cost function J, right? And what we, what we don't wanna do is to, you know, how to write software, to plot out this point, and then try to manually read off the numbers, that this is not a good way to do it. And, in fact, we'll see it later, that when we look at more complicated examples, we'll have high dimensional figures with more parameters, that, it turns out, we'll see in a few, we'll see later in this course, examples where this figure, you know, cannot really be plotted, and this becomes much harder to visualize. And so, what we want is to have software to find the value of theta zero, theta one that minimizes this function and in the next video we start to talk about an algorithm for automatically finding that value of theta zero and theta one that minimizes the cost function J.

Reading: Cost Function - Intuition II

Cost Function - Intuition II
A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line. An example of such a graph is the one to the right below.

Taking any color and going along the 'circle', one would expect to get the same value of the cost function. For example, the three green points found on the green line above have the same value for (J( heta_0, heta_1)) and as a result, they are found along the same line. The circled x displays the value of the cost function for the graph on the left when ( heta_0) = 800 and ( heta_1) = -0.15. Taking another h(x) and plotting its contour plot, one gets the following graphs:

When ( heta_0) = 360 and ( heta_1) = 0, the value of (J( heta_0, heta_1)) in the contour plot gets closer to the center thus reducing the cost function error. Now giving our hypothesis function a slightly positive slope results in a better fit of the data.

The graph above minimizes the cost function as much as possible and consequently, the result of ( heta_1) and ( heta_0) tend to be around 0.12 and 250 respectively. Plotting those values on our graph to the right seems to put our point in the center of the inner most 'circle'.

Video: Gradient Descent

We previously defined the cost function J. In this video, I want to tell you about an algorithm called gradient descent for minimizing the cost function J. It turns out gradient descent is a more general algorithm, and is used not only in linear regression. It's actually used all over the place in machine learning. And later in the class, we'll use gradient descent to minimize other functions as well, not just the cost function J for the linear regression.
0:26
So in this video, we'll talk about gradient descent for minimizing some arbitrary function J and then in later videos, we'll take this algorithm and apply it specifically to the cost function J that we have defined for linear regression.
0:41
So here's the problem setup. Going to assume that we have some function J(theta 0, theta 1) maybe it's the cost function from linear regression, maybe it's some other function we wanna minimize. And we want to come up with an algorithm for minimizing that as a function of J(theta 0, theta 1). Just as an aside it turns out that gradient descent actually applies to more general functions. So imagine, if you have a function that's a function of J, as theta 0, theta 1, theta 2, up to say some theta n, and you want to minimize theta 0. You minimize over theta 0 up to theta n of this J of theta 0 up to theta n. And it turns our gradient descent is an algorithm for solving this more general problem. But for the sake of brevity, for the sake of succinctness of notation, I'm just going to pretend I have only two parameters throughout the rest of this video. Here's the idea for gradient descent. What we're going to do is we're going to start off with some initial guesses for theta 0 and theta 1. Doesn't really matter what they are, but a common choice would be we set theta 0 to 0, and set theta 1 to 0, just initialize them to 0. What we're going to do in gradient descent is we'll keep changing theta 0 and theta 1 a little bit to try to reduce J(theta 0, theta 1), until hopefully, we wind at a minimum, or maybe at a local minimum.
2:09
So let's see in pictures what gradient descent does. Let's say you're trying to minimize this function. So notice the axes, this is theta 0, theta 1 on the horizontal axes and J is the vertical axis and so the height of the surface shows J and we want to minimize this function. So we're going to start off with theta 0, theta 1 at some point. So imagine picking some value for theta 0, theta 1, and that corresponds to starting at some point on the surface of this function. So whatever value of theta 0, theta 1 gives you some point here. I did initialize them to 0, 0 but sometimes you initialize it to other values as well.
2:49
Now, I want you to imagine that this figure shows a hole. Imagine this is like the landscape of some grassy park, with two hills like so, and I want us to imagine that you are physically standing at that point on the hill, on this little red hill in your park. In gradient descent, what we're going to do is we're going to spin 360 degrees around, just look all around us, and ask, if I were to take a little baby step in some direction, and I want to go downhill as quickly as possible, what direction do I take that little baby step in? If I wanna go down, so I wanna physically walk down this hill as rapidly as possible.
3:31
Turns out, that if you're standing at that point on the hill, you look all around and you find that the best direction is to take a little step downhill is roughly that direction. Okay, and now you're at this new point on your hill. You're gonna, again, look all around and say what direction should I step in order to take a little baby step downhill? And if you do that and take another step, you take a step in that direction.
3:57
And then you keep going. From this new point you look around, decide what direction would take you downhill most quickly. Take another step, another step, and so on until you converge to this local minimum down here.
4:11
Gradient descent has an interesting property.
4:14
This first time we ran gradient descent we were starting at this point over here, right? Started at that point over here. Now imagine we had initialized gradient descent just a couple steps to the right. Imagine we'd initialized gradient descent with that point on the upper right. If you were to repeat this process, so start from that point, look all around, take a little step in the direction of steepest descent, you would do that. Then look around, take another step, and so on.
4:43
And if you started just a couple of steps to the right, gradient descent would've taken you to this second local optimum over on the right.
4:54
So if you had started this first point, you would've wound up at this local optimum, but if you started just at a slightly different location, you would've wound up at a very different local optimum. And this is a property of gradient descent that we'll say a little bit more about later. So that's the intuition in pictures. Let's look at the math. This is the definition of the gradient descent algorithm. We're going to just repeatedly do this until convergence, we're going to update my parameter theta j by taking theta j and subtracting from it alpha times this term over here, okay? So let's see, there's lot of details in this equation so let me unpack some of it. First, this notation here, :=, gonna use := to denote assignment, so it's the assignment operator. So briefly, if I write a := b, what this means is, it means in a computer, this means take the value in b and use it overwrite whatever value is a. So this means set a to be equal to the value of b, which is assignment. And I can also do a := a + 1. This means take a and increase its value by one. Whereas in contrast, if I use the equal sign and I write a equals b, then this is a truth assertion.
6:24
Okay? So if I write a equals b, then I'll asserting that the value of a equals to the value of b, right? So the left hand side, that's the computer operation, where we set the value of a to a new value. The right hand side, this is asserting, I'm just making a claim that the values of a and b are the same, and so whereas you can write a := a + 1, that means increment a by 1, hopefully I won't ever write a = a + 1 because that's just wrong. a and a + 1 can never be equal to the same values. Okay? So this is first part of the definition. This alpha here is a number that is called the learning rate.
7:08
And what alpha does is it basically controls how big a step we take downhill with creating descent. So if alpha is very large, then that corresponds to a very aggressive gradient descent procedure where we're trying take huge steps downhill and if alpha is very small, then we're taking little, little baby steps downhill. And I'll come back and say more about this later, about how to set alpha and so on.
7:32
And finally, this term here, that's a derivative term. I don't wanna talk about it right now, but I will derive this derivative term and tell you exactly what this is later, okay? And some of you will be more familiar with calculus than others, but even if you aren't familiar with calculus, don't worry about it. I'll tell you what you need to know about this term here.
7:55
Now, there's one more subtlety about gradient descent which is in gradient descent we're going to update, you know, theta 0 and theta 1, right? So this update takes place for j = 0 and j = 1, so you're gonna update theta 0 and update theta 1. And the subtlety of how you implement gradient descent is for this expression, for this update equation, you want to simultaneously update theta 0 and theta 1. What I mean by that is that in this equation, we're gonna update theta 0 := theta 0 minus something, and update theta 1 := theta 1 minus something. And the way to implement is you should compute the right hand side, right? Compute that thing for theta 0 and theta 1 and then simultaneously, at the same time, update theta 0 and theta 1, okay? So let me say what I mean by that. This is a correct implementation of gradient descent meaning simultaneous update. So I'm gonna set temp0 equals that, set temp1 equals that so basic compute the right-hand sides, and then having computed the right-hand sides and stored them into variables temp0 and temp1, I'm gonna update theta 0 and theta 1 simultaneously because that's the correct implementation.
9:18
In contrast, here's an incorrect implementation that does not do a simultaneous update. So in this incorrect implementation, we compute temp0, and then we update theta 0, and then we compute temp1, and then we update temp1.
9:34
And the difference between the right hand side and the left hand side implementations is that If you look down here, you look at this step, if by this time you've already updated theta 0, then you would be using the new value of theta 0 to compute this derivative term. And so this gives you a different value of temp1, than the left-hand side, right? Because you've now plugged in the new value of theta 0 into this equation. And so, this on the right-hand side is not a correct implementation of gradient descent, okay? So I don't wanna say why you need to do the simultaneous updates. It turns out that the way gradient descent is usually implemented, which I'll say more about later, it actually turns out to be more natural to implement the simultaneous updates. And when people talk about gradient descent, they always mean simultaneous update. If you implement the non simultaneous update, it turns out it will probably work anyway. But this algorithm wasn't right. It's not what people refer to as gradient descent, and this is some other algorithm with different properties. And for various reasons this can behave in slightly stranger ways, and so what you should do is really implement the simultaneous update of gradient descent. So, that's the outline of the gradient descent algorithm. In the next video, we're going to go into the details of the derivative term, which I wrote up but didn't really define. And if you've taken a calculus class before and if you're familiar with partial derivatives and derivatives, it turns out that's exactly what that derivative term is, but in case you aren't familiar with calculus, don't worry about it. The next video will give you all the intuitions and will tell you everything you need to know to compute that derivative term, even if you haven't seen calculus, or even if you haven't seen partial derivatives before. And with that, with the next video, hopefully we'll be able to give you all the intuitions you need to apply gradient descent.

Reading: Gradient Descent

Gradient Descent
So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That's where gradient descent comes in.

Imagine that we graph our hypothesis function based on its fields ( heta_0) and ( heta_1) (actually we are graphing the cost function as a function of the parameter estimates). We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting a particular set of parameters.

We put ( heta_0) on the x axis and ( heta_1) on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

For example, the distance between each 'star' in the graph above represents a step determined by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger step. The direction in which the step is taken is determined by the partial derivative of (J( heta_0, heta_1)). Depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places.

The gradient descent algorithm is:

repeat until convergence:

[ heta_j := heta_j - alpha frac{partial}{partial heta_j} J( heta_0, heta_1) ]
where

j=0,1 represents the feature index number.

At each iteration j, one should simultaneously update the parameters ( heta_1, heta_2,..., heta_n). Updating a specific parameter prior to calculating another one on the (j^{(th)}) iteration would yield to a wrong implementation.

Video: Gradient Descent Intuition

In the previous video, we gave a mathematical definition of gradient descent. Let's delve deeper and in this video get better intuition about what the algorithm is doing and why the steps of the gradient descent algorithm might make sense.
0:15
Here's a gradient descent algorithm that we saw last time and just to remind you this parameter, or this term alpha is called the learning rate. And it controls how big a step we take when updating my parameter theory j.
0:31
And this second term here is the derivative term
0:39
And what I wanna do in this video is give you that intuition about what each of these two terms is doing and why when put together, this entire update makes sense. In order to convey these intuitions, what I want to do is use a slightly simpler example, where we want to minimize the function of just one parameter. So say we have a cost function, j of just one parameter, theta one, like we did a few videos back, where theta one is a real number. So we can have one d plots, which are a little bit simpler to look at. Let's try to understand what gradient decent would do on this function.
1:20
So let's say, here's my function, J of theta 1. And so that's mine. And where theta 1 is a real number.
1:32
All right? Now, let's have in this slide its grade in descent with theta one at this location. So imagine that we start off at that point on my function.
1:44
What grade in descent would do is it will update.
1:49
Theta one gets updated as theta one minus alpha times d d theta one J of theta one, right? And as an aside, this derivative term, right, if you're wondering why I changed the notation from these partial derivative symbols. If you don't know what the difference is between these partial derivative symbols and the dd theta, don't worry about it. Technically in mathematics you call this a partial derivative and call this a derivative, depending on the number of parameters in the function J. But that's a mathematical technicality. And so for the purpose of this lecture, think of these partial symbols and d, d theta 1, as exactly the same thing. And don't worry about what the real difference is. I'm gonna try to use the mathematically precise notation, but for our purposes these two notations are really the same thing. And so let's see what this equation will do. So we're going to compute this derivative, not sure if you've seen derivatives in calculus before, but what the derivative at this point does, is basically saying, now let's take the tangent to that point, like that straight line, that red line, is just touching this function, and let's look at the slope of this red line. That's what the derivative is, it's saying what's the slope of the line that is just tangent to the function. Okay, the slope of a line is just this height divided by this horizontal thing. Now, this line has a positive slope, so it has a positive derivative. And so my update to theta is going to be theta 1, it gets updated as theta 1, minus alpha times some positive number.
3:39
Okay. Alpha the the learning, is always a positive number. And, so we're going to take theta one is updated as theta one minus something. So I'm gonna end up moving theta one to the left. I'm gonna decrease theta one, and we can see this is the right thing to do cuz I actually wanna head in this direction. You know, to get me closer to the minimum over there.
4:00
So, gradient descent so far says we're going the right thing. Let's look at another example. So let's take my same function J, let's try to draw from the same function, J of theta 1. And now, let's say I had to say initialize my parameter over there on the left. So theta 1 is here. I glare at that point on the surface.
4:20
Now my derivative term DV theta one J of theta one when you value into that this point, we're gonna look at right the slope of that line, so this derivative term is a slope of this line. But this line is slanting down, so this line has negative slope.
4:41
Right. Or alternatively, I say that this function has negative derivative, just means negative slope at that point. So this is less than equals to 0, so when I update theta, I'm gonna have theta. Just update this theta of minus alpha times a negative number.
5:02
And so I have theta 1 minus a negative number which means I'm actually going to increase theta, because it's minus of a negative number, means I'm adding something to theta. And what that means is that I'm going to end up increasing theta until it's not here, and increase theta wish again seems like the thing I wanted to do to try to get me closer to the minimum.
5:26
So this whole theory of intuition behind what a derivative is doing, let's take a look at the rate term alpha and see what that's doing.
5:38
So here's my gradient descent update mural, that's this equation.
5:43
And let's look at what could happen if alpha is either too small or if alpha is too large. So this first example, what happens if alpha is too small? So here's my function J, J of theta. Let's all start here. If alpha is too small, then what I'm gonna do is gonna multiply my update by some small number, so end up taking a baby step like that. Okay, so this one step. Then from this new point, I'm gonna have to take another step. But if alpha's too small, I take another little baby step. And so if my learning rate is too small I'm gonna end up taking these tiny tiny baby steps as you try to get to the minimum. And I'm gonna need a lot of steps to get to the minimum and so if alpha is too small gradient descent can be slow because it's gonna take these tiny tiny baby steps and so it's gonna need a lot of steps before it gets anywhere close to the global minimum.
6:46
Now how about if our alpha is too large? So, here's my function Jf filter, turns out that alpha's too large, then gradient descent can overshoot the minimum and may even fail to convert or even divert, so here's what I mean. Let's say it's all our data there, it's actually close to minimum. So the derivative points to the right, but if alpha is too big, I want to take a huge step. Remember, take a huge step like that. So it ends up taking a huge step, and now my cost functions have strong roots. Cuz it starts off with this value, and now, my values are strong in verse. Now my derivative points to the left, it says I should decrease data. But if my learning is too big, I may take a huge step going from here all the way to out there. So we end up being over there, right? And if my is too big, we can take another huge step on the next elevation and kind of overshoot and overshoot and so on, until you already notice I'm actually getting further and further away from the minimum. So if alpha is to large, it can fail to converge or even diverge. Now, I have another question for you. So this a tricky one and when I was first learning this stuff it actually took me a long time to figure this out. What if your parameter theta 1 is already at a local minimum, what do you think one step of gradient descent will do?
8:06
So let's suppose you initialize theta 1 at a local minimum. So, suppose this is your initial value of theta 1 over here and is already at a local optimum or the local minimum.
8:19
It turns out the local optimum, your derivative will be equal to zero. So for that slope, that tangent point, so the slope of this line will be equal to zero and thus this derivative term is equal to zero. And so your gradient descent update, you have theta one cuz I updated this theta one minus alpha times zero. And so what this means is that if you're already at the local optimum it leaves theta 1 unchanged cause its updates as theta 1 equals theta 1. So if your parameters are already at a local minimum one step with gradient descent does absolutely nothing it doesn't your parameter which is what you want because it keeps your solution at the local optimum.
9:05
This also explains why gradient descent can converse the local minimum even with the learning rate alpha fixed. Here's what I mean by that let's look in the example. So here's a cost function J of theta that maybe I want to minimize and let's say I initialize my algorithm, my gradient descent algorithm, out there at that magenta point. If I take one step in gradient descent, maybe it will take me to that point, because my derivative's pretty steep out there. Right? Now, I'm at this green point, and if I take another step in gradient descent, you notice that my derivative, meaning the slope, is less steep at the green point than compared to at the magenta point out there. Because as I approach the minimum, my derivative gets closer and closer to zero, as I approach the minimum. So after one step of descent, my new derivative is a little bit smaller. So I wanna take another step in the gradient descent. I will naturally take a somewhat smaller step from this green point right there from the magenta point. Now with a new point, a red point, and I'm even closer to global minimum so the derivative here will be even smaller than it was at the green point. So I'm gonna another step in the gradient descent.
10:22
Now, my derivative term is even smaller and so the magnitude of the update to theta one is even smaller, so take a small step like so. And as gradient descent runs, you will automatically take smaller and smaller steps.
10:41
Until eventually you're taking very small steps, you know, and you finally converge to the to the local minimum.
10:50
So just to recap, in gradient descent as we approach a local minimum, gradient descent will automatically take smaller steps. And that's because as we approach the local minimum, by definition the local minimum is when the derivative is equal to zero. As we approach local minimum, this derivative term will automatically get smaller, and so gradient descent will automatically take smaller steps. This is what so no need to decrease alpha or the time.
11:22
So that's the gradient descent algorithm and you can use it to try to minimize any cost function J, not the cost function J that we defined for linear regression. In the next video, we're going to take the function J and set that back to be exactly linear regression's cost function, the square cost function that we came up with earlier. And taking gradient descent and this great cause function and putting them together. That will give us our first learning algorithm, that'll give us a linear regression algorithm.

Reading: Gradient Descent Intuition

Gradient Descent Intuition
In this video we explored the scenario where we used one parameter ( heta_1) and plotted its cost function to implement a gradient descent. Our formula for a single parameter was :

Repeat until convergence:

[ heta_1:= heta_1-alpha frac{d}{d heta_1} J( heta_1) ]
Regardless of the slope's sign for (frac{d}{d heta_1} J( heta_1)), ( heta_1) ventually converges to its minimum value. The following graph shows that when the slope is negative, the value of ( heta_1) increases and when it is positive, the value of ( heta_1) decreases.

On a side note, we should adjust our parameter (alpha) to ensure that the gradient descent algorithm converges in a reasonable time. Failure to converge or too much time to obtain the minimum value imply that our step size is wrong.

How does gradient descent converge with a fixed step size (alpha)?

The intuition behind the convergence is that (frac{d}{d heta_1} J( heta_1)) approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative will always be 0 and thus we get:

[ heta_1:= heta_1-alpha * 0 ]

Video: Gradient Descent For Linear Regression

In previous videos, we talked about the gradient descent algorithm and we talked about the linear regression model and the squared error cost function. In this video we're gonna put together gradient descent with our cost function, and that will give us an algorithm for linear regression or putting a straight line to our data.
0:20
So this was what we worked out in the previous videos. This gradient descent algorithm which you should be familiar and here's the linear regression model with our linear hypothesis and our squared error cost function. What we're going to do is apply gradient descent to minimize our squared error cost function. Now in order to apply gradient descent, in order to, you know, write this piece of code, the key term we need is this derivative term over here. So you need to figure out what is this partial derivative term and plugging in the definition of the cause function j. This turns out to be this.
1:13
Sum from y equals 1 though m. Of this squared error cost function term. And all I did here was I just, you know plug in the definition of the cost function there.
1:27
And simplifying a little bit more, this turns out to be equal to this. Sigma i equals one through m of theta zero plus theta one x i minus Yi squared. And all I did there was I took the definition for my hypothesis and plugged it in there. And turns out we need to figure out what is this partial derivative for two cases for J equals 0 and J equals 1. So we want to figure out what is this partial derivative for both the theta 0 case and the theta 1 case. And I'm just going to write out the answers. It turns out this first term is, simplifies to 1/M sum from over my training step of just that of X(i)- Y(i) and for this term partial derivative let's write the theta 1, it turns out I get this term. Minus Y(i) times X(i). Okay and computing these partial derivatives, so we're going from this equation. Right going from this equation to either of the equations down there. Computing those partial derivative terms requires some multivariate calculus. If you know calculus, feel free to work through the derivations yourself and check that if you take the derivatives, you actually get the answers that I got. But if you're less familiar with calculus, don't worry about it and it's fine to just take these equations that were worked out and you won't need to know calculus or anything like that, in order to do the homework so let's implement gradient descent and get back to work.
3:14
So armed with these definitions or armed with what we worked out to be the derivatives which is really just the slope of the cost function j
3:23
we can now plug them back in to our gradient descent algorithm. So here's gradient descent for linear regression which is gonna repeat until convergence, theta 0 and theta 1 get updated as you know this thing minus alpha times the derivative term.
3:39
So this term here.
3:43
So here's our linear regression algorithm.
3:47
This first term here.
3:52
That term is of course just the partial derivative with respect to theta zero, that we worked out on a previous slide. And this second term here, that term is just a partial derivative in respect to theta 1, that we worked out on the previous line. And just as a quick reminder, you must, when implementing gradient descent. There's actually this detail that you should be implementing it so the update theta 0 and theta 1 simultaneously.
4:24
So. Let's see how gradient descent works. One of the issues we saw with gradient descent is that it can be susceptible to local optima. So when I first explained gradient descent I showed you this picture of it going downhill on the surface, and we saw how depending on where you initialize it, you can end up at different local optima. You will either wind up here or here. But, it turns out that that the cost function for linear regression is always going to be a bow shaped function like this. The technical term for this is that this is called a convex function.
5:03
And I'm not gonna give the formal definition for what is a convex function, C, O, N, V, E, X. But informally a convex function means a bowl shaped function and so this function doesn't have any local optima except for the one global optimum. And does gradient descent on this type of cost function which you get whenever you're using linear regression it will always converge to the global optimum. Because there are no other local optimum, global optimum. So now let's see this algorithm in action.
5:38
As usual, here are plots of the hypothesis function and of my cost function j. And so let's say I've initialized my parameters at this value. Let's say, usually you initialize your parameters at zero, zero. Theta zero and theta equals zero. But for the demonstration, in this physical infrontation I've initialized you know, theta zero at 900 and theta one at about -0.1 okay. And so this corresponds to h(x)=-900-0.1x, [the intercept should be +900] is this line, out here on the cost function. Now, if we take one step in gradient descent, we end up going from this point out here, over to the down and left, to that second point over there. And you notice that my line changed a little bit, and as I take another step of gradient descent, my line on the left will change.
6:41
Right? And I've also moved to a new point on my cost function.
6:47
And as I take further steps of gradient descent, I'm going down in cost. So my parameters and such are following this trajectory.
6:57
And if you look on the left, this corresponds with hypotheses. That seem to be getting to be better and better fits to the data
7:08
until eventually I've now wound up at the global minimum and this global minimum corresponds to this hypothesis, which gets me a good fit to the data.
7:21
And so that's gradient descent, and we've just run it and gotten a good fit to my data set of housing prices. And you can now use it to predict, you know, if your friend has a house size 1250 square feet, you can now read off the value and tell them that I don't know maybe they could get $250,000 for their house. Finally just to give this another name it turns out that the algorithm that we just went over is sometimes called batch gradient descent. And it turns out in machine learning I don't know I feel like us machine learning people were not always great at giving names to algorithms. But the term batch gradient descent refers to the fact that in every step of gradient descent, we're looking at all of the training examples. So in gradient descent, when computing the derivatives, we're computing the sums [INAUDIBLE]. So ever step of gradient descent we end up computing something like this that sums over our m training examples and so the term batch gradient descent refers to the fact that we're looking at the entire batch of training examples. And again, it's really not a great name, but this is what machine learning people call it. And it turns out that there are sometimes other versions of gradient descent that are not batch versions, but they are instead. Do not look at the entire training set but look at small subsets of the training sets at a time. And we'll talk about those versions later in this course as well. But for now using the algorithm we just learned about or using batch gradient descent you now know how to implement gradient descent for linear regression.
9:05
So that's linear regression with gradient descent. If you've seen advanced linear algebra before, so some of you may have taken a class in advanced linear algebra. You might know that there exists a solution for numerically solving for the minimum of the cost function j without needing to use an iterative algorithm like gradient descent. Later in this course we'll talk about that method as well that just solves for the minimum of the cost function j without needing these multiple steps of gradient descent. That other method is called the normal equations method. But in case you've heard of that method it turns out that gradient descent will scale better to larger data sets than that normal equation method. And now that we know about gradient descent we'll be able to use it in lots of different contexts and we'll use it in lots of different machine learning problems as well.
9:55
So congrats on learning about your first machine learning algorithm. We'll later have exercises in which we'll ask you to implement gradient descent and hopefully see these algorithms right for yourselves. But before that I first want to tell you in the next set of videos. The first one to tell you about a generalization of the gradient descent algorithm that will make it much more powerful. And I guess I'll tell you about that in the next video.

Reading: Gradient Descent For Linear Regression

Gradient Descent For Linear Regression
Note: [At 6:15 "h(x) = -900 - 0.1x" should be "h(x) = 900 - 0.1x"]

When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to :

where m is the size of the training set, ( heta_0) a constant that will be changing simultaneously with ( heta_1) and (x_{i}, y_{i}) are values of the given training set (data).

Note that we have separated out the two cases for ( heta_j) into separate equations for ( heta_0) and ( heta_1); and that for ( heta_1) we are multiplying (x_{i}) at the end due to the derivative. The following is a derivation of (frac {partial}{partial heta_j}J( heta)) for a single example :

The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.

So, this is simply gradient descent on the original cost function J. This method looks at every example in the entire training set on every step, and is called batch gradient descent. Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function. Here is an example of gradient descent as it is run to minimize a quadratic function.

The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, which was initialized at (48,30). The x’s in the figure (joined by straight lines) mark the successive values of θ that gradient descent went through as it converged to its minimum.

Reading: Lecture Slides

Lecture2.pdf

Linear Regression with Multiple Variables(Multivariate)

What if your input has more than one value? In this module, we show how linear regression can be extended to accommodate multiple input features. We also discuss best practices for implementing linear regression.

8 videos, 16 readings

Reading: Setting Up Your Programming Assignment Environment

Setting Up Your Programming Assignment Environment
The Machine Learning course includes several programming assignments which you’ll need to finish to complete the course. The assignments require the Octave or MATLAB scientific computing languages.

Octave is a free, open-source application available for many platforms. It has a text interface and an experimental graphical one.
MATLAB is proprietary software, but a free trial license to MATLAB Online is being offered for the completion of this course.
FAQ
Does it cost money?
While you’re taking the course, both software packages are available free of charge. Octave is distributed under the GNU Public License, which means that it is always free to download and distribute. MATLAB Online licenses are available for completing the programming assignments in the course only. For any other purposes (like your own work after you complete the course), MATLAB can be licensed to individuals or companies from Mathworks directly.

Is there a difference in quality?
There are several subtle differences between the two software packages. MATLAB may offer a smoother experience (especially for Mac users), contains a larger number of functions, and can be more robust to failure. However, the functions used in this course are available in both packages, and many students have successfully completed the course using either.

How do I install one of them?
See installation instructions for Windows, Mac OS X (10.10 Yosemite and 10.9 Mavericks), other Mac OS X, or GNU/Linux.

Reading: Accessing MATLAB Online and Uploading the Exercise Files

Access MATLAB Online
Access to MATLAB Online is being provided by MathWorks for the duration of this course. MATLAB Online retains most of the features of the desktop program in a web-based interface. No download or installation is required, and the program can be accessed from any computer using a common web browser.Follow these steps to access MATLAB Online:
1. Create a MathWorks account if you do not already have one.
2. Click on the Machine Learning license link and provide your MathWorks account credentials if requested.
3. Click on the blue 'Access MATLAB Online' button and provide with your MathWorks account credentials.
4. Bookmark https://matlab.mathworks.com/ for quicker access to MATLAB Online in the future.
For help with access or technical issues, see the ‘MATLAB Help’ discussion forum which is monitored by MathWorks.

Upload the Programming Exercises to MATLAB Online

The programming exercises consist of code files, data files, and instructions. They are provided as compressed .zip files later in the course. (The first programming exercise is at the end of Week 2). Follow the instructions below to upload and unzip the exercises:
1. Download the exercise .zip file to your computer.
2. Log in to MATLAB Online, then drag and drop the exercise .zip file into your Current Folder (or use the ‘Upload’ button in the Home tab).*
3. Enter and run the following command at the command line to unzip the exercise folder: unzip machine-learning-exn.zip; (replace ‘n’ with the exercise number).
4. Confirm the exercise folder was unzipped correctly, then delete the zip file.
5. Right-click the ‘machine-learning-exn ’ folder, and select 'Remove from Path -> Selected Folder and Subfolders'.
*DO NOT unzip the homework files on your computer and upload the files individually!

Set your folder to the exercise folder

To work on and submit the programming exercises, your Current Folder must be set to the exn exercise folder.To set your current folder to the exercise folder, right-click the exn exercise folder and select 'Open'. You should then see only the exercise files and 'lib' folder in your Current Folder window. Your MATLAB Online environment should look similar to the example below*:

*The ex1.mlx Live Script (see below) and warmUpExercise.m function file have been opened in the above image for reference. The submit command has also been entered in the Command Window (you may see a warning the first time you submit a new exercise which can be ignored).

MATLAB Live Scripts (Optional)

MATLAB Online users can now use Live Scripts to complete the programming exercises. The Live Scripts combine the exercise instructions (e.g. ex1.pdf ) and exercise scripts (e.g. ex1.m, ex1_multi.m) into a single file (e.g. ex1.mlx ). The Live Scripts combine the rich text, images, and equations from the instructions with the executable code from the scripts. They offer a more convenient way to complete the exercises online, as well as improved handling of text and figure output, and interactive controls.

The workflow for completing the programming assignments using the Live Scripts differs from the original instructions and the lecture videos. To complete programming assignments without the Live Scripts, ignore this section and follow the instructions provided on the programming assignment page and in the exercise files. To complete a programming assignment with the Live Scripts, follow the instructions below:

Upload the Live Scripts to MATLAB Online
1. Download the Live Script .zip folder to your computer:
  machine-learning-live-scripts.zip
2. Rename the .zip file ‘machine-learning-live-scripts’. (Coursera adds a prefix to the .zip file name that must be removed prior to uploading to MATLAB Online).
3. Upload the .zip file to MATLAB Online and unzip it using the command ‘unzip machine-learning-live-scripts.zip’
Complete the programming exercises using Live Scripts
1. When you reach a programming assignment in the course, upload and unzip the exercise folder as described in the section 'Upload the Programming Exercises to MATLAB Online' above.
2. Move the Live Script for that exercise into the exercise file folder (e.g. move ex1.mlx into machine-learning-ex1ex1). (See Fig 1.)
3. Open the exercise file folder (e.g. right click machine-learning-ex1ex1 and select 'Open') and open the Live Script (e.g. right-click ex1.mlx and select 'Open'). The instructions in the Live Script will guide you through the exercise. In general, as you work through the exercises you will be prompted to:
a. Confirm you are in the correct exercise folder and you have all the necessary files. (See Fig 2.)

b. Execute sections to load, format, and visualize data.

c. Open, complete, and save function files, then execute sections to call your functions. (See Fig 3.)

d. Confirm your output is correct and submit your functions for assessment by entering the submit command in the command window. (See Fig 4.)

Please use the pinned thread ‘Live Script Help’ in the ‘MATLAB Help’ discussion forum to provide feedback, seek technical help, or report issues with the Live Script exercise files. All other questions about the programming exercises and course material should be directed to the discussion forum for appropriate course week.

Figure 1: Copy the Live Script into the machine-learning-exnexn exercise folder.

Figure 2: Open the Live Script and follow the instructions. Confirm you are in the correct folder and all files are present. There is no need to enter code in the Command Window except when submitting. You should execute code within the Live Script using the Run Section button, which will run the code in the current section.

Figure 3: Follow the instructions in the Live Script. When prompted, open and complete function files and/or run code sections in the Live Script.

Figure 4: When prompted, submit your function files for assessment. Enter the submit command in the Command Window. Then provide or confirm your Coursera email and assignment token (found in the assignment course page).

Reading: Installing Octave on Windows

Installing Octave on Windows

Use this link to install Octave for windows: http://wiki.octave.org/Octave_for_Microsoft_Windows

Octave on Windows can be used to submit programming assignments in this course but will likely need a patch provided in the discussion forum. Refer to https://www.coursera.org/learn/machine-learning/discussions/vgCyrQoMEeWv5yIAC00Eog? for more information about the patch for your version.

"Warning: Do not install Octave 4.0.0"; checkout the "Resources" menu's section of "Installation Issues".

Reading: Installing Octave on Mac OS X (10.10 Yosemite and 10.9 Mavericks and Later)

Installing Octave on Mac OS X (10.10 Yosemite and 10.9 Mavericks)
Mac OS X has a feature called Gatekeeper that may only let you install applications from the Mac App Store. You may need to configure it to allow the Octave installer. Visit your System Preferences, click Security & Privacy, and check the setting to allow apps downloaded from Anywhere. You may need to enter your password to unlock the settings page.
1. Download the Octave 3.8.0 installer or the latest version that isn't 4.0.0. The file is large so this may take some time.
2. Open the downloaded image, probably named GNU_Octave_3.8.0-6.dmg on your computer, and then open Octave-3.8.0-6.mpkg inside.
3. Follow the installer’s instructions. You may need to enter the administrator password for your computer.
4. After the installer completes, Octave should be installed on your computer. You can find Octave-cli in your Mac’s Applications, which is a text interface for Octave that you can use to complete Machine Learning’s programming assignments.
Octave also includes an experimental graphical interface which is called Octave-gui, also in your Mac’s Applications, but we recommend using Octave-cli because it’s more stable.

Note: If you use a package manager (like MacPorts or Homebrew), we recommend you follow the package manager installation instructions.

"Warning: Do not install Octave 4.0.0"; checkout the "Resources" menu's section of "Installation Issues".

Reading: Installing Octave on Mac OS X (10.8 Mountain Lion and Earlier)

Installing Octave on Mac OS X (10.8 Mountain Lion and Earlier)

If you use Mac OS X 10.9, we recommend following the instructions above. For other Mac OS X versions, the Octave project doesn’t distribute installers. We recommend installing Homebrew, a package manager, using their instructions.

"Warning: Do not install Octave 4.0.0"; checkout the "Resources" menu's section of "Installation Issues".

Reading: Installing Octave on GNU/Linux

Installing Octave on GNU/Linux

We recommend using your system package manager to install Octave.

On Ubuntu, you can use:

sudo apt-get update && sudo apt-get install octave
On Fedora, you can use:

sudo yum install octave-forge

Please consult the Octave maintainer’s instructions for other GNU/Linux systems.

"Warning: Do not install Octave 4.0.0"; checkout the "Resources" menu's section of "Installation Issues".

Reading: More Octave/MATLAB resources

Octave Resources
At the Octave command line, typing help followed by a function name displays documentation for a built-in function. For example, help plot will bring up help information for plotting. Further documentation can be found at the Octave documentation pages.

MATLAB Resources
At the MATLAB command line, typing help followed by a function name displays documentation for a built-in function. For example, help plot will bring up help information for plotting. Further documentation can be found at the MATLAB documentation pages.

Introduction to MATLAB with Onramp
Made for MATLAB beginners or those looking for a quick refresh, the MATLAB Onramp is a 1-2 hour interactive introduction to the basics of MATLAB programming. Octave users are also welcome to use Onramp (requires creation of a free MathWorks account). To access Onramp:
1. If you don’t already have one, create a MathWorks account at: https://www.mathworks.com/mwaccount/register
2. Go to: https://matlabacademy.mathworks.com/ and click on the MATLAB Onramp button to start learning MATLAB!
MATLAB Programming Tutorials
These short tutorial videos introduce MATLAB and cover various programming topics used in the assignments. Feel free to watch some now and return to reference them as you work through the programming assignments. Many of the topics below are also covered in MATLAB Onramp. *Indicates content covered in Onramp.

Get Started with MATLAB and MATLAB Online
Vectors
Visualization
Matrices
MATLAB Programming
Troubleshooting
Video: Multiple Features

in this video we will start to talk about a new version of linear regression that's more powerful. One that works with multiple variables
0:08
or with multiple features.
0:10
Here's what I mean.
0:12
In the original version of linear regression that we developed, we have a single feature x, the size of the house, and we wanted to use that to predict why the price of the house and this was
0:25
our form of our hypothesis.
0:28
But now imagine, what if we had not only the size of the house as a feature or as a variable of which to try to predict the price, but that we also knew the number of bedrooms, the number of house and the age of the home and years. It seems like this would give us a lot more information with which to predict the price.
0:47
To introduce a little bit of notation, we sort of started to talk about this earlier, I'm going to use the variables X subscript 1 X subscript 2 and so on to denote my, in this case, four features and I'm going to continue to use Y to denote the variable, the output variable price that we're trying to predict.
1:11
Let's introduce a little bit more notation.
1:13
Now that we have four features
1:16
I'm going to use lowercase "n"
1:19
to denote the number of features. So in this example we have n4 because we have, you know, one, two, three, four features.
1:28
And "n" is different from our earlier notation where we were using "n" to denote the number of examples. So if you have 47 rows "M" is the number of rows on this table or the number of training examples.
1:45
So I'm also going to use X superscript "I" to denote the input features of the "I" training example.
1:55
As a concrete example let say X2 is going to be a vector of the features for my second training example. And so X2 here is going to be a vector 1416, 3, 2, 40 since those are my four features that I have
2:17
to try to predict the price of the second house.
2:20
So, in this notation, the
2:24
superscript 2 here.
2:26
That's an index into my training set. This is not X to the power of 2. Instead, this is, you know, an index that says look at the second row of this table. This refers to my second training example.
2:39
With this notation X2 is a four dimensional vector. In fact, more generally, this is an in-dimensional feature back there.
2:51
With this notation, X2 is now a vector and so, I'm going to use also Xi subscript J to denote the value of the J,
3:02
of feature number J and the training example.
3:07
So concretely X2 subscript 3, will refer to feature number three in the x factor which is equal to 2,right? That was a 3 over there, just fix my handwriting. So x2 subscript 3 is going to be equal to 2.
3:26
Now that we have multiple features,
3:29
let's talk about what the form of our hypothesis should be. Previously this was the form of our hypothesis, where x was our single feature, but now that we have multiple features, we aren't going to use the simple representation any more.
3:44
Instead, a form of the hypothesis in linear regression
3:49
is going to be this, can be theta 0 plus theta 1 x1 plus theta 2 x2 plus theta 3 x3
3:58
plus theta 4 X4. And if we have N features then rather than summing up over our four features, we would have a sum over our N features.
4:08
Concretely for a particular
4:11
setting of our parameters we may have H of
4:17
X 80 + 0.1 X1 + 0.01x2 + 3x3 - 2x4. This would be one
4:25
example of a hypothesis and you remember a hypothesis is trying to predict the price of the house in thousands of dollars, just saying that, you know, the base price of a house is maybe 80,000 plus another open 1, so that's an extra, what, hundred dollars per square feet, yeah, plus the price goes up a little bit for each additional floor that the house has. X two is the number of floors, and it goes up further for each additional bedroom the house has, because X three was the number of bedrooms, and the price goes down a little bit with each additional age of the house. With each additional year of the age of the house.
5:08
Here's the form of a hypothesis rewritten on the slide. And what I'm gonna do is introduce a little bit of notation to simplify this equation.
5:17
For convenience of notation, let me define x subscript 0 to be equals one.
5:23
Concretely, this means that for every example i I have a feature vector X superscript I and X superscript I subscript 0 is going to be equal to 1. You can think of this as defining an additional zero feature. So whereas previously I had n features because x1, x2 through xn, I'm now defining an additional sort of zero
5:47
feature vector that always takes on the value of one.
5:52
So now my feature vector X becomes this N+1 dimensional
5:58
vector that is zero index.
6:02
So this is now a n+1 dimensional feature vector, but I'm gonna index it from 0 and I'm also going to think of my parameters as a vector. So, our parameters here, right that would be our theta zero, theta one, theta two, and so on all the way up to theta n, we're going to gather them up into a parameter vector written theta 0, theta 1, theta 2, and so on, down to theta n. This is another zero index vector. It's of index signed from zero.
6:32
That is another n plus 1 dimensional vector.
6:37
So, my hypothesis cannot be written theta 0x0 plus theta 1x1+ up to theta n Xn.
6:48
And this equation is the same as this on top because, you know, eight zero is equal to one.
6:58
Underneath and I now take this form of the hypothesis and write this as either transpose x, depending on how familiar you are with inner products of vectors if you write what theta transfers x is what theta transfer and this is theta zero, theta one, up to theta N. So this thing here is theta transpose and this is actually a N plus one by one matrix. [It should be a 1 by (n+1) matrix] It's also called a row vector
7:34
and you take that and multiply it with the vector X which is X zero, X one, and so on, down to X n.
7:43
And so, the inner product that is theta transpose X is just equal to this. This gives us a convenient way to write the form of the hypothesis as just the inner product between our parameter vector theta and our theta vector X. And it is this little bit of notation, this little excerpt of the notation convention that let us write this in this compact form. So that's the form of a hypthesis when we have multiple features. And, just to give this another name, this is also called multivariate linear regression.
8:15
And the term multivariable that's just maybe a fancy term for saying we have multiple features, or multivariables with which to try to predict the value Y.

Reading: Multiple Features

Multiple Features
Note: [7:25 - ( heta^T) is a 1 by (n+1) matrix and not an (n+1) by 1 matrix]

Linear regression with multiple variables is also known as "multivariate linear regression".

We now introduce notation for equations where we can have any number of input variables.

The multivariable form of the hypothesis function accommodating these multiple features is as follows:

[h_ heta (x) = heta_0 + heta_1 x_1 + heta_2 x_2 + heta_3 x_3 + cdots + heta_n x_n ]
In order to develop intuition about this function, we can think about ( heta_0) as the basic price of a house, ( heta_1) as the price per square meter, ( heta_2) as the price per floor, etc. (x_1) will be the number of square meters in the house, (x_2) the number of floors, etc.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.

Remark: Note that for convenience reasons in this course we assume (x_{0}^{(i)} =1 ext{ for } (iin { 1,dots, m } )). This allows us to do matrix operations with theta and x. Hence making the two vectors '( heta)' and (x^{(i)}) match each other element-wise (that is, have the same number of elements: n+1).

Video: Gradient Descent for Multiple Variables

In the previous video, we talked about the form of the hypothesis for linear regression with multiple features or with multiple variables. In this video, let's talk about how to fit the parameters of that hypothesis. In particular let's talk about how to use gradient descent for linear regression with multiple features. To quickly summarize our notation, this is our formal hypothesis in multivariable linear regression where we've adopted the convention that x0=1. The parameters of this model are theta0 through theta n, but instead of thinking of this as n separate parameters, which is valid, I'm instead going to think of the parameters as theta where theta here is a n+1-dimensional vector. So I'm just going to think of the parameters of this model as itself being a vector. Our cost function is J of theta0 through theta n which is given by this usual sum of square of error term. But again instead of thinking of J as a function of these n+1 numbers, I'm going to more commonly write J as just a function of the parameter vector theta so that theta here is a vector. Here's what gradient descent looks like. We're going to repeatedly update each parameter theta j according to theta j minus alpha times this derivative term. And once again we just write this as J of theta, so theta j is updated as theta j minus the learning rate alpha times the derivative, a partial derivative of the cost function with respect to the parameter theta j. Let's see what this looks like when we implement gradient descent and, in particular, let's go see what that partial derivative term looks like. Here's what we have for gradient descent for the case of when we had N=1 feature. We had two separate update rules for the parameters theta0 and theta1, and hopefully these look familiar to you. And this term here was of course the partial derivative of the cost function with respect to the parameter of theta0, and similarly we had a different update rule for the parameter theta1. There's one little difference which is that when we previously had only one feature, we would call that feature x(i) but now in our new notation we would of course call this x(i)1 to denote our one feature. So that was for when we had only one feature. Let's look at the new algorithm for we have more than one feature, where the number of features n may be much larger than one. We get this update rule for gradient descent and, maybe for those of you that know calculus, if you take the definition of the cost function and take the partial derivative of the cost function J with respect to the parameter theta j, you'll find that that partial derivative is exactly that term that I've drawn the blue box around. And if you implement this you will get a working implementation of gradient descent for multivariate linear regression. The last thing I want to do on this slide is give you a sense of why these new and old algorithms are sort of the same thing or why they're both similar algorithms or why they're both gradient descent algorithms. Let's consider a case where we have two features or maybe more than two features, so we have three update rules for the parameters theta0, theta1, theta2 and maybe other values of theta as well. If you look at the update rule for theta0, what you find is that this update rule here is the same as the update rule that we had previously for the case of n = 1. And the reason that they are equivalent is, of course, because in our notational convention we had this x(i)0 = 1 convention, which is why these two term that I've drawn the magenta boxes around are equivalent. Similarly, if you look the update rule for theta1, you find that this term here is equivalent to the term we previously had, or the equation or the update rule we previously had for theta1, where of course we're just using this new notation x(i)1 to denote our first feature, and now that we have more than one feature we can have similar update rules for the other parameters like theta2 and so on. There's a lot going on on this slide so I definitely encourage you if you need to to pause the video and look at all the math on this slide slowly to make sure you understand everything that's going on here. But if you implement the algorithm written up here then you have a working implementation of linear regression with multiple features.

Reading: Gradient Descent For Multiple Variables

Gradient Descent For Multiple Variables

The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features:

In other words:

The following image compares gradient descent with one variable to gradient descent with multiple variables:

Video: Gradient Descent in Practice I - Feature Scaling

In this video and in the video after this one, I wanna tell you about some of the practical tricks for making gradient descent work well. In this video, I want to tell you about an idea called feature skill.
0:11
Here's the idea. If you have a problem where you have multiple features, if you make sure that the features are on a similar scale, by which I mean make sure that the different features take on similar ranges of values,
0:24
then gradient descents can converge more quickly.
0:27
Concretely let's say you have a problem with two features where X1 is the size of house and takes on values between say zero to two thousand and two is the number of bedrooms, and maybe that takes on values between one and five. If you plot the contours of the cos function J of theta,
0:44
then the contours may look like this, where, let's see, J of theta is a function of parameters theta zero, theta one and theta two. I'm going to ignore theta zero, so let's about theta 0 and pretend as a function of only theta 1 and theta 2, but if x1 can take on them, you know, much larger range of values and x2 It turns out that the contours of the cause function J of theta
1:09
can take on this very very skewed elliptical shape, except that with the so 2000 to 5 ratio, it can be even more secure. So, this is very, very tall and skinny ellipses, or these very tall skinny ovals, can form the contours of the cause function J of theta.
1:29
And if you run gradient descents on this cos-function, your gradients may end up taking a long time and can oscillate back and forth and take a long time before it can finally find its way to the global minimum.
1:47
In fact, you can imagine if these contours are exaggerated even more when you draw incredibly skinny, tall skinny contours,
1:56
and it can be even more extreme than, then, gradient descent just have a much harder time taking it's way, meandering around, it can take a long time to find this way to the global minimum.
2:12
In these settings, a useful thing to do is to scale the features.
2:17
Concretely if you instead define the feature X one to be the size of the house divided by two thousand, and define X two to be maybe the number of bedrooms divided by five, then the count well as of the cost function J can become
2:32
much more, much less skewed so the contours may look more like circles.
2:38
And if you run gradient descent on a cost function like this, then gradient descent,
2:44
you can show mathematically, you can find a much more direct path to the global minimum rather than taking a much more convoluted path where you're sort of trying to follow a much more complicated trajectory to get to the global minimum.
2:57
So, by scaling the features so that there are, the consumer ranges of values. In this example, we end up with both features, X one and X two, between zero and one.
3:09
You can wind up with an implementation of gradient descent. They can convert much faster.
3:18
More generally, when we're performing feature scaling, what we often want to do is get every feature into approximately a -1 to +1 range and concretely, your feature x0 is always equal to 1. So, that's already in that range,
3:34
but you may end up dividing other features by different numbers to get them to this range. The numbers -1 and +1 aren't too important. So, if you have a feature,
3:44
x1 that winds up being between zero and three, that's not a problem. If you end up having a different feature that winds being between -2 and + 0.5, again, this is close enough to minus one and plus one that, you know, that's fine, and that's fine.
4:00
It's only if you have a different feature, say X 3 that is between, that
4:05
ranges from -100 tp +100 , then, this is a very different values than minus 1 and plus 1. So, this might be a less well-skilled feature and similarly, if your features take on a very, very small range of values so if X 4 takes on values between minus 0.0001 and positive 0.0001, then
4:29
again this takes on a much smaller range of values than the minus one to plus one range. And again I would consider this feature poorly scaled.
4:37
So you want the range of values, you know, can be bigger than plus or smaller than plus one, but just not much bigger, like plus 100 here, or too much smaller like 0.00 one over there. Different people have different rules of thumb. But the one that I use is that if a feature takes on the range of values from say minus three the plus 3 how you should think that should be just fine, but maybe it takes on much larger values than plus 3 or minus 3 unless not to worry and if it takes on values from say minus one-third to one-third.
5:10
You know, I think that's fine too or 0 to one-third or minus one-third to 0. I guess that's typical range of value sector 0 okay. But it will take on a much tinier range of values like x4 here than gain on mine not to worry. So, the take-home message is don't worry if your features are not exactly on the same scale or exactly in the same range of values. But so long as they're all close enough to this gradient descent it should work okay. In addition to dividing by so that the maximum value when performing feature scaling sometimes people will also do what's called mean normalization. And what I mean by that is that you want to take a feature Xi and replace it with Xi minus new i
5:52
to make your features have approximately 0 mean.
5:56
And obviously we want to apply this to the future x zero, because the future x zero is always equal to one, so it cannot have an average value of zero.
6:06
But it concretely for other features if the range of sizes of the house takes on values between 0 to 2000 and if you know, the average size of a house is equal to 1000 then you might
6:21
use this formula.
6:23
Size, set the feature X1 to the size minus the average value divided by 2000 and similarly, on average if your houses have one to five bedrooms and if
6:39
on average a house has two bedrooms then you might use this formula to mean normalize your second feature x2.
6:49
In both of these cases, you therefore wind up with features x1 and x2. They can take on values roughly between minus .5 and positive .5. Exactly not true - X2 can actually be slightly larger than .5 but, close enough. And the more general rule is that you might take a feature X1 and replace
7:08
it with X1 minus mu1 over S1 where to define these terms mu1 is the average value of x1
7:19
in the training sets
7:22
and S1 is the range of values of that feature and by range, I mean let's say the maximum value minus the minimum value or for those of you that understand the deviation of the variable is setting S1 to be the standard deviation of the variable would be fine, too. But taking, you know, this max minus min would be fine.
7:44
And similarly for the second feature, x2, you replace x2 with this sort of
7:51
subtract the mean of the feature and divide it by the range of values meaning the max minus min. And this sort of formula will get your features, you know, maybe not exactly, but maybe roughly into these sorts of ranges, and by the way, for those of you that are being super careful technically if we're taking the range as max minus min this five here will actually become a four. So if max is 5 minus 1 then the range of their own values is actually equal to 4, but all of these are approximate and any value that gets the features into anything close to these sorts of ranges will do fine. And the feature scaling doesn't have to be too exact, in order to get gradient descent to run quite a lot faster.
8:34
So, now you know about feature scaling and if you apply this simple trick, it and make gradient descent run much faster and converge in a lot fewer other iterations.
8:44
That was feature scaling. In the next video, I'll tell you about another trick to make gradient descent work well in practice.

Reading: Gradient Descent in Practice I - Feature Scaling

Gradient Descent in Practice I - Feature Scaling
Note: [6:20 - The average size of a house is 1000 but 100 is accidentally written instead]

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:

−1 ≤ (x_{(i)}) ≤ 1

or

−0.5 ≤ (x_{(i)}) ≤ 0.5

These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

[x_i := dfrac{x_i - mu_i}{s_i} ]
Where (μ_i) is the average of all the values for feature (i) and (s_i) is the range of values (max - min), or (s_i) is the standard deviation.

Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.

For example, if (x_i) represents housing prices with a range of 100 to 2000 and a mean value of 1000, then, (x_i := dfrac{price-1000}{1900}).

Video: Gradient Descent in Practice II - Learning Rate

In this video, I want to give you more practical tips for getting gradient descent to work. The ideas in this video will center around the learning rate alpha.
0:09
Concretely, here's the gradient descent update rule. And what I want to do in this video is tell you about what I think of as debugging, and some tips for making sure that gradient descent is working correctly. And second, I wanna tell you how to choose the learning rate alpha or at least how I go about choosing it. Here's something that I often do to make sure that gradient descent is working correctly. The job of gradient descent is to find the value of theta for you that hopefully minimizes the cost function J(theta). What I often do is therefore plot the cost function J(theta) as gradient descent runs. So the x axis here is a number of iterations of gradient descent and as gradient descent runs you hopefully get a plot that maybe looks like this.
0:59
Notice that the x axis is number of iterations. Previously we where looking at plots of J(theta) where the x axis, where the horizontal axis, was the parameter vector theta but this is not what this is. Concretely, what this point is, is I'm going to run gradient descent for 100 iterations. And whatever value I get for theta after 100 iterations, I'm going to get some value of theta after 100 iterations. And I'm going to evaluate the cost function J(theta). For the value of theta I get after 100 iterations, and this vertical height is the value of J(theta). For the value of theta I got after 100 iterations of gradient descent. And this point here that corresponds to the value of J(theta) for the theta that I get after I've run gradient descent for 200 iterations.
1:55
So what this plot is showing is, is it's showing the value of your cost function after each iteration of gradient decent. And if gradient is working properly then J(theta) should decrease after every iteration.
2:17
And one useful thing that this sort of plot can tell you also is that if you look at the specific figure that I've drawn, it looks like by the time you've gotten out to maybe 300 iterations, between 300 and 400 iterations, in this segment it looks like J(theta) hasn't gone down much more. So by the time you get to 400 iterations, it looks like this curve has flattened out here. And so way out here 400 iterations, it looks like gradient descent has more or less converged because your cost function isn't going down much more. So looking at this figure can also help you judge whether or not gradient descent has converged.
2:57
By the way, the number of iterations the gradient descent takes to converge for a physical application can vary a lot, so maybe for one application, gradient descent may converge after just thirty iterations. For a different application, gradient descent may take 3,000 iterations, for another learning algorithm, it may take 3 million iterations. It turns out to be very difficult to tell in advance how many iterations gradient descent needs to converge. And is usually by plotting this sort of plot, plotting the cost function as we increase in number in iterations, is usually by looking at these plots. But I try to tell if gradient descent has converged. It's also possible to come up with automatic convergence test, namely to have a algorithm try to tell you if gradient descent has converged. And here's maybe a pretty typical example of an automatic convergence test. And such a test may declare convergence if your cost function J(theta) decreases by less than some small value epsilon, some small value 10 to the minus 3 in one iteration. But I find that usually choosing what this threshold is is pretty difficult. And so in order to check your gradient descent's converge I actually tend to look at plots like these, like this figure on the left, rather than rely on an automatic convergence test. Looking at this sort of figure can also tell you, or give you an advance warning, if maybe gradient descent is not working correctly. Concretely, if you plot J(theta) as a function of the number of iterations. Then if you see a figure like this where J(theta) is actually increasing, then that gives you a clear sign that gradient descent is not working. And a theta like this usually means that you should be using learning rate alpha.
4:48
If J(theta) is actually increasing, the most common cause for that is if you're trying to minimize a function, that maybe looks like this.
4:59
But if your learning rate is too big then if you start off there, gradient descent may overshoot the minimum and send you there. And if the learning rate is too big, you may overshoot again and it sends you there, and so on. So that, what you really wanted was for it to start here and for it to slowly go downhill, right? But if the learning rate is too big, then gradient descent can instead keep on overshooting the minimum. So that you actually end up getting worse and worse instead of getting to higher values of the cost function J(theta). So you end up with a plot like this and if you see a plot like this, the fix is usually just to use a smaller value of alpha. Oh, and also, of course, make sure your code doesn't have a bug of it. But usually too large a value of alpha could be a common problem.
5:49
Similarly sometimes you may also see J(theta) do something like this, it may go down for a while then go up then go down for a while then go up go down for a while go up and so on. And a fix for something like this is also to use a smaller value of alpha.
6:04
I'm not going to prove it here, but under other assumptions about the cost function J, that does hold true for linear regression, mathematicians have shown that if your learning rate alpha is small enough, then J(theta) should decrease on every iteration. So if this doesn't happen probably means the alpha's too big, you should set it smaller. But of course, you also don't want your learning rate to be too small because if you do that then the gradient descent can be slow to converge.
6:31
And if alpha were too small, you might end up starting out here, say, and end up taking just minuscule baby steps. And just taking a lot of iterations before you finally get to the minimum, and so if alpha is too small, gradient descent can make very slow progress and be slow to converge. To summarize, if the learning rate is too small, you can have a slow convergence problem, and if the learning rate is too large, J(theta) may not decrease on every iteration and it may not even converge. In some cases if the learning rate is too large, slow convergence is also possible. But the more common problem you see is just that J(theta) may not decrease on every iteration. And in order to debug all of these things, often plotting that J(theta) as a function of the number of iterations can help you figure out what's going on. Concretely, what I actually do when I run gradient descent is I would try a range of values. So just try running gradient descent with a range of values for alpha, like 0.001 and 0.01. So these are factor of ten differences. And for these different values of alpha are just plot J(theta) as a function of number of iterations, and then pick the value of alpha that seems to be causing J(theta) to decrease rapidly. In fact, what I do actually isn't these steps of ten. So this is a scale factor of ten of each step up. What I actually do is try this range of values.
8:06
And so on, where this is 0.001. I'll then increase the learning rate threefold to get 0.003. And then this step up, this is another roughly threefold increase from 0.003 to 0.01. And so these are, roughly, trying out gradient descents with each value I try being about 3x bigger than the previous value. So what I'll do is try a range of values until I've found one value that's too small and made sure that I've found one value that's too large. And then I'll sort of try to pick the largest possible value, or just something slightly smaller than the largest reasonable value that I found. And when I do that usually it just gives me a good learning rate for my problem. And if you do this too, maybe you'll be able to choose a good learning rate for your implementation of gradient descent.

Reading: Gradient Descent in Practice II - Learning Rate

Gradient Descent in Practice II - Learning Rate
Note: [5:20 - the x -axis label in the right graph should be ( heta) rather than No. of iterations ]

Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease (alpha).

Automatic convergence test. Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as (10^{−3}). However in practice it's difficult to choose this threshold value.

It has been proven that if learning rate (alpha) is sufficiently small, then J(θ) will decrease on every iteration.

To summarize:

If (alpha) is too small: slow convergence.

If (alpha) is too large: may not decrease on every iteration and thus may not converge.

Video: Features and Polynomial Regression

You now know about linear regression with multiple variables. In this video, I wanna tell you a bit about the choice of features that you have and how you can get different learning algorithm, sometimes very powerful ones by choosing appropriate features. And in particular I also want to tell you about polynomial regression allows you to use the machinery of linear regression to fit very complicated, even very non-linear functions. Let's take the example of predicting the price of the house. Suppose you have two features, the frontage of house and the depth of the house. So, here's the picture of the house we're trying to sell. So, the frontage is defined as this distance is basically the width or the length of how wide your lot is if this that you own, and the depth of the house is how deep your property is, so there's a frontage, there's a depth. called frontage and depth. You might build a linear regression model like this where frontage is your first feature x1 and and depth is your second feature x2, but when you're applying linear regression, you don't necessarily have to use just the features x1 and x2 that you're given. What you can do is actually create new features by yourself. So, if I want to predict the price of a house, what I might do instead is decide that what really determines the size of the house is the area or the land area that I own. So, I might create a new feature. I'm just gonna call this feature x which is frontage, times depth. This is a multiplication symbol. It's a frontage x depth because this is the land area that I own and I might then select my hypothesis as that using just one feature which is my land area, right? Because the area of a rectangle is you know, the product of the length of the size So, depending on what insight you might have into a particular problem, rather than just taking the features [xx] that we happen to have started off with, sometimes by defining new features you might actually get a better model. Closely related to the idea of choosing your features is this idea called polynomial regression. Let's say you have a housing price data set that looks like this. Then there are a few different models you might fit to this. One thing you could do is fit a quadratic model like this. It doesn't look like a straight line fits this data very well. So maybe you want to fit a quadratic model like this where you think the size, where you think the price is a quadratic function and maybe that'll give you, you know, a fit to the data that looks like that. But then you may decide that your quadratic model doesn't make sense because of a quadratic function, eventually this function comes back down and well, we don't think housing prices should go down when the size goes up too high. So then maybe we might choose a different polynomial model and choose to use instead a cubic function, and where we have now a third-order term and we fit that, maybe we get this sort of model, and maybe the green line is a somewhat better fit to the data cause it doesn't eventually come back down. So how do we actually fit a model like this to our data? Using the machinery of multivariant linear regression, we can do this with a pretty simple modification to our algorithm. The form of the hypothesis we, we know how the fit looks like this, where we say H of x is theta zero plus theta one x one plus x two theta X3. And if we want to fit this cubic model that I have boxed in green, what we're saying is that to predict the price of a house, it's theta 0 plus theta 1 times the size of the house plus theta 2 times the square size of the house. So this term is equal to that term. And then plus theta 3 times the cube of the size of the house raises that third term. In order to map these two definitions to each other, well, the natural way to do that is to set the first feature x one to be the size of the house, and set the second feature x two to be the square of the size of the house, and set the third feature x three to be the cube of the size of the house. And, just by choosing my three features this way and applying the machinery of linear regression, I can fit this model and end up with a cubic fit to my data. I just want to point out one more thing, which is that if you choose your features like this, then feature scaling becomes increasingly important. So if the size of the house ranges from one to a thousand, so, you know, from one to a thousand square feet, say, then the size squared of the house will range from one to one million, the square of a thousand, and your third feature x cubed, excuse me you, your third feature x three, which is the size cubed of the house, will range from one two ten to the nine, and so these three features take on very different ranges of values, and it's important to apply feature scaling if you're using gradient descent to get them into comparable ranges of values. Finally, here's one last example of how you really have broad choices in the features you use. Earlier we talked about how a quadratic model like this might not be ideal because, you know, maybe a quadratic model fits the data okay, but the quadratic function goes back down and we really don't want, right, housing prices that go down, to predict that, as the size of housing freezes. But rather than going to a cubic model there, you have, maybe, other choices of features and there are many possible choices. But just to give you another example of a reasonable choice, another reasonable choice might be to say that the price of a house is theta zero plus theta one times the size, and then plus theta two times the square root of the size, right? So the square root function is this sort of function, and maybe there will be some value of theta one, theta two, theta three, that will let you take this model and, for the curve that looks like that, and, you know, goes up, but sort of flattens out a bit and doesn't ever come back down. And, so, by having insight into, in this case, the shape of a square root function, and, into the shape of the data, by choosing different features, you can sometimes get better models. In this video, we talked about polynomial regression. That is, how to fit a polynomial, like a quadratic function, or a cubic function, to your data. Was also throw out this idea, that you have a choice in what features to use, such as that instead of using the frontish and the depth of the house, maybe, you can multiply them together to get a feature that captures the land area of a house. In case this seems a little bit bewildering, that with all these different feature choices, so how do I decide what features to use. Later in this class, we'll talk about some algorithms were automatically choosing what features are used, so you can have an algorithm look at the data and automatically choose for you whether you want to fit a quadratic function, or a cubic function, or something else. But, until we get to those algorithms now I just want you to be aware that you have a choice in what features to use, and by designing different features you can fit more complex functions your data then just fitting a straight line to the data and in particular you can put polynomial functions as well and sometimes by appropriate insight into the feature simply get a much better model for your data.

Reading: Features and Polynomial Regression

Features and Polynomial Regression
We can improve our features and the form of our hypothesis function in a couple different ways.

We can combine multiple features into one. For example, we can combine (x_1) and (x_2) into a new feature (x_3) by taking (x_1⋅x_2).

Polynomial Regression

Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

For example, if our hypothesis function is (h_ heta(x) = heta_0 + heta_1 x_1) then we can create additional features based on (x_1) , to get the quadratic function (h_ heta(x) = heta_0 + heta_1 x_1 + heta_2 x_1^2) or the cubic function (h_ heta(x) = heta_0 + heta_1 x_1 + heta_2 x_1^2 + heta_3 x_1^3)

In the cubic version, we have created new features (x_2) and (x_3) where (x_2 = x_1^2) and (x_3 = x_1^3).

To make it a square root function, we could do: (h_ heta(x) = heta_0 + heta_1 x_1 + heta_2 sqrt{x_1})

One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

eg. if (x_1) has range 1 - 1000 then range of (x_1^2) becomes 1 - 1000000 and that of (x_1^3) becomes 1 - 1000000000

Video: Normal Equation

In this video, we'll talk about the normal equation, which for some linear regression problems, will give us a much better way to solve for the optimal value of the parameters theta. Concretely, so far the algorithm that we've been using for linear regression is gradient descent where in order to minimize the cost function J of Theta, we would take this iterative algorithm that takes many steps, multiple iterations of gradient descent to converge to the global minimum. In contrast, the normal equation would give us a method to solve for theta analytically, so that rather than needing to run this iterative algorithm, we can instead just solve for the optimal value for theta all at one go, so that in basically one step you get to the optimal value right there.
0:49
It turns out the normal equation that has some advantages and some disadvantages, but before we get to that and talk about when you should use it, let's get some intuition about what this method does. For this week's planetary example, let's imagine, let's take a very simplified cost function J of Theta, that's just the function of a real number Theta. So, for now, imagine that Theta is just a scalar value or that Theta is just a row value. It's just a number, rather than a vector. Imagine that we have a cost function J that's a quadratic function of this real value parameter Theta, so J of Theta looks like that. Well, how do you minimize a quadratic function? For those of you that know a little bit of calculus, you may know that the way to minimize a function is to take derivatives and to set derivatives equal to zero. So, you take the derivative of J with respect to the parameter of Theta. You get some formula which I am not going to derive, you set that derivative equal to zero, and this allows you to solve for the value of Theda that minimizes J of Theta. That was a simpler case of when data was just real number. In the problem that we are interested in, Theta is no longer just a real number, but, instead, is this n+1-dimensional parameter vector, and, a cost function J is a function of this vector value or Theta 0 through Theta m. And, a cost function looks like this, some square cost function on the right. How do we minimize this cost function J? Calculus actually tells us that, if you, that one way to do so, is to take the partial derivative of J, with respect to every parameter of Theta J in turn, and then, to set all of these to 0. If you do that, and you solve for the values of Theta 0, Theta 1, up to Theta N, then, this would give you that values of Theta to minimize the cost function J. Where, if you actually work through the calculus and work through the solution to the parameters Theta 0 through Theta N, the derivation ends up being somewhat involved. And, what I am going to do in this video, is actually to not go through the derivation, which is kind of long and kind of involved, but what I want to do is just tell you what you need to know in order to implement this process so you can solve for the values of the thetas that corresponds to where the partial derivatives is equal to zero. Or alternatively, or equivalently, the values of Theta is that minimize the cost function J of Theta. I realize that some of the comments I made that made more sense only to those of you that are normally familiar with calculus. So, but if you don't know, if you're less familiar with calculus, don't worry about it. I'm just going to tell you what you need to know in order to implement this algorithm and get it to work. For the example that I want to use as a running example let's say that I have m = 4 training examples.
3:50
In order to implement this normal equation at big, what I'm going to do is the following. I'm going to take my data set, so here are my four training examples. In this case let's assume that, you know, these four examples is all the data I have. What I am going to do is take my data set and add an extra column that corresponds to my extra feature, x0, that is always takes on this value of 1. What I'm going to do is I'm then going to construct a matrix called X that's a matrix are basically contains all of the features from my training data, so completely here is my here are all my features and we're going to take all those numbers and put them into this matrix "X", okay? So just, you know, copy the data over one column at a time and then I am going to do something similar for y's. I am going to take the values that I'm trying to predict and construct now a vector, like so and call that a vector y. So X is going to be a
4:59
m by (n+1) - dimensional matrix, and Y is going to be a m-dimensional vector where m is the number of training examples and n is, n is a number of features, n+1, because of this extra feature X0 that I had. Finally if you take your matrix X and you take your vector Y, and if you just compute this, and set theta to be equal to X transpose X inverse times X transpose Y, this would give you the value of theta that minimizes your cost function. There was a lot that happened on the slides and I work through it using one specific example of one dataset. Let me just write this out in a slightly more general form and then let me just, and later on in this video let me explain this equation a little bit more.
5:57
It is not yet entirely clear how to do this. In a general case, let us say we have M training examples so X1, Y1 up to Xn, Yn and n features. So, each of the training example x(i) may looks like a vector like this, that is a n+1 dimensional feature vector. The way I'm going to construct the matrix "X", this is also called the design matrix is as follows. Each training example gives me a feature vector like this. say, sort of n+1 dimensional vector. The way I am going to construct my design matrix x is only construct the matrix like this. and what I'm going to do is take the first training example, so that's a vector, take its transpose so it ends up being this, you know, long flat thing and make x1 transpose the first row of my design matrix. Then I am going to take my second training example, x2, take the transpose of that and put that as the second row of x and so on, down until my last training example. Take the transpose of that, and that's my last row of my matrix X. And, so, that makes my matrix X, an M by N +1 dimensional matrix. As a concrete example, let's say I have only one feature, really, only one feature other than X zero, which is always equal to 1. So if my feature vectors X-i are equal to this 1, which is X-0, then some real feature, like maybe the size of the house, then my design matrix, X, would be equal to this. For the first row, I'm going to basically take this and take its transpose. So, I'm going to end up with 1, and then X-1-1. For the second row, we're going to end up with 1 and then X-1-2 and so on down to 1, and then X-1-M. And thus, this will be a m by 2-dimensional matrix. So, that's how to construct the matrix X. And, the vector Y--sometimes I might write an arrow on top to denote that it is a vector, but very often I'll just write this as Y, either way. The vector Y is obtained by taking all all the labels, all the correct prices of houses in my training set, and just stacking them up into an M-dimensional vector, and that's Y. Finally, having constructed the matrix X and the vector Y, we then just compute theta as X'(1/X) x X'Y. I just want to make I just want to make sure that this equation makes sense to you and that you know how to implement it. So, you know, concretely, what is this X'(1/X)? Well, X'(1/X) is the inverse of the matrix X'X. Concretely, if you were to say set A to be equal to X' x X, so X' is a matrix, X' x X gives you another matrix, and we call that matrix A. Then, you know, X'(1/X) is just you take this matrix A and you invert it, right! This gives, let's say 1/A.
9:26
And so that's how you compute this thing. You compute X'X and then you compute its inverse. We haven't yet talked about Octave. We'll do so in the later set of videos, but in the Octave programming language or a similar view, and also the matlab programming language is very similar. The command to compute this quantity, X transpose X inverse times X transpose Y, is as follows. In Octave X prime is the notation that you use to denote X transpose. And so, this expression that's boxed in red, that's computing X transpose times X. pinv is a function for computing the inverse of a matrix, so this computes X transpose X inverse, and then you multiply that by X transpose, and you multiply that by Y. So you end computing that formula which I didn't prove, but it is possible to show mathematically even though I'm not going to do so here, that this formula gives you the optimal value of theta in the sense that if you set theta equal to this, that's the value of theta that minimizes the cost function J of theta for the new regression. One last detail in the earlier video. I talked about the feature skill and the idea of getting features to be on similar ranges of Scales of similar ranges of values of each other. If you are using this normal equation method then feature scaling isn't actually necessary and is actually okay if, say, some feature X one is between zero and one, and some feature X two is between ranges from zero to one thousand and some feature x three ranges from zero to ten to the minus five and if you are using the normal equation method this is okay and there is no need to do features scaling, although of course if you are using gradient descent, then, features scaling is still important. Finally, where should you use the gradient descent and when should you use the normal equation method. Here are some of the their advantages and disadvantages. Let's say you have m training examples and n features. One disadvantage of gradient descent is that, you need to choose the learning rate Alpha. And, often, this means running it few times with different learning rate alphas and then seeing what works best. And so that is sort of extra work and extra hassle. Another disadvantage with gradient descent is it needs many more iterations. So, depending on the details, that could make it slower, although there's more to the story as we'll see in a second. As for the normal equation, you don't need to choose any learning rate alpha. So that, you know, makes it really convenient, makes it simple to implement. You just run it and it usually just works. And you don't need to iterate, so, you don't need to plot J of Theta or check the convergence or take all those extra steps. So far, the balance seems to favor normal the normal equation. Here are some disadvantages of the normal equation, and some advantages of gradient descent. Gradient descent works pretty well, even when you have a very large number of features. So, even if you have millions of features you can run gradient descent and it will be reasonably efficient. It will do something reasonable. In contrast to normal equation, In, in order to solve for the parameters data, we need to solve for this term. We need to compute this term, X transpose, X inverse. This matrix X transpose X. That's an n by n matrix, if you have n features. Because, if you look at the dimensions of X transpose the dimension of X, you multiply, figure out what the dimension of the product is, the matrix X transpose X is an n by n matrix where n is the number of features, and for almost computed implementations the cost of inverting the matrix, rose roughly as the cube of the dimension of the matrix. So, computing this inverse costs, roughly order, and cube time. Sometimes, it's slightly faster than N cube but, it's, you know, close enough for our purposes. So if n the number of features is very large,
13:37
then computing this quantity can be slow and the normal equation method can actually be much slower. So if n is large then I might usually use gradient descent because we don't want to pay this all in q time. But, if n is relatively small, then the normal equation might give you a better way to solve the parameters. What does small and large mean? Well, if n is on the order of a hundred, then inverting a hundred-by-hundred matrix is no problem by modern computing standards. If n is a thousand, I would still use the normal equation method. Inverting a thousand-by-thousand matrix is actually really fast on a modern computer. If n is ten thousand, then I might start to wonder. Inverting a ten-thousand- by-ten-thousand matrix starts to get kind of slow, and I might then start to maybe lean in the direction of gradient descent, but maybe not quite. n equals ten thousand, you can sort of convert a ten-thousand-by-ten-thousand matrix. But if it gets much bigger than that, then, I would probably use gradient descent. So, if n equals ten to the sixth with a million features, then inverting a million-by-million matrix is going to be very expensive, and I would definitely favor gradient descent if you have that many features. So exactly how large set of features has to be before you convert a gradient descent, it's hard to give a strict number. But, for me, it is usually around ten thousand that I might start to consider switching over to gradient descents or maybe, some other algorithms that we'll talk about later in this class. To summarize, so long as the number of features is not too large, the normal equation gives us a great alternative method to solve for the parameter theta. Concretely, so long as the number of features is less than 1000, you know, I would use, I would usually is used in normal equation method rather than, gradient descent. To preview some ideas that we'll talk about later in this course, as we get to the more complex learning algorithm, for example, when we talk about classification algorithm, like a logistic regression algorithm, We'll see that those algorithm actually... The normal equation method actually do not work for those more sophisticated learning algorithms, and, we will have to resort to gradient descent for those algorithms. So, gradient descent is a very useful algorithm to know. The linear regression will have a large number of features and for some of the other algorithms that we'll see in this course, because, for them, the normal equation method just doesn't apply and doesn't work. But for this specific model of linear regression, the normal equation can give you a alternative
16:07
that can be much faster, than gradient descent. So, depending on the detail of your algortithm, depending of the detail of the problems and how many features that you have, both of these algorithms are well worth knowing about.

Reading: Normal Equation

Normal Equation
Note: [8:00 to 8:44 - The design matrix X (in the bottom right side of the slide) given in the example should have elements x with subscript 1 and superscripts varying from 1 to m because for all m training sets there are only 2 features (x_0) and (x_1). 12:56 - The X matrix is m by (n+1) and NOT n by n. ]

Gradient descent gives one way of minimizing J. Let’s discuss a second way of doing so, this time performing the minimization explicitly and without resorting to an iterative algorithm. In the "Normal Equation" method, we will minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation formula is given below:

[ heta = (X^TX)^{−1}X^Ty ]

There is no need to do feature scaling with the normal equation.

The following is a comparison of gradient descent and the normal equation:

With the normal equation, computing the inversion has complexity (mathcal{O}(n^3)). So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

Video: Normal Equation Noninvertibility

In this video I want to talk about the Normal equation and non-invertibility. This is a somewhat more advanced concept, but it's something that I've often been asked about. And so I want to talk it here and address it here. But this is a somewhat more advanced concept, so feel free to consider this optional material.
0:18
And there's a phenomenon that you may run into that may be somewhat useful to understand, but even if you don't understand the normal equation and linear progression, you should really get that to work okay.
0:31
Here's the issue.
0:33
For those of you there are, maybe some are more familiar with linear algebra, what some students have asked me is, when computing this Theta equals X transpose X inverse X transpose Y. What if the matrix X transpose X is non-invertible? So for those of you that know a bit more linear algebra you may know that only some matrices are invertible and some matrices do not have an inverse we call those non-invertible matrices. Singular or degenerate matrices. The issue or the problem of x transpose x being non invertible should happen pretty rarely. And in Octave if you implement this to compute theta, it turns out that this will actually do the right thing. I'm getting a little technical now, and I don't want to go into the details, but Octave hast two functions for inverting matrices. One is called pinv, and the other is called inv. And the differences between these two are somewhat technical. One's called the pseudo-inverse, one's called the inverse. But you can show mathematically that so long as you use the pinv function then this will actually compute the value of data that you want even if X transpose X is non-invertible. The specific details between inv. What is the difference between pinv? What is inv? That's somewhat advanced numerical computing concepts, I don't really want to get into. But I thought in this optional video, I'll try to give you little bit of intuition about what it means for X transpose X to be non-invertible. For those of you that know a bit more linear Algebra might be interested. I'm not gonna prove this mathematically but if X transpose X is non-invertible, there usually two most common causes for this. The first cause is if somehow in your learning problem you have redundant features. Concretely, if you're trying to predict housing prices and if x1 is the size of the house in feet, in square feet and x2 is the size of the house in square meters, then you know 1 meter is equal to 3.28 feet Rounded to two decimals. And so your two features will always satisfy the constraint x1 equals 3.28 squared times x2. And you can show for those of you that are somewhat advanced in linear Algebra, but if you're explaining the algebra you can actually show that if your two features are related, are a linear equation like this. Then matrix X transpose X would be non-invertable. The second thing that can cause X transpose X to be non-invertable is if you are training, if you are trying to run the learning algorithm with a lot of features. Concretely, if m is less than or equal to n. For example, if you imagine that you have m = 10 training examples that you have n equals 100 features then you're trying to fit a parameter back to theta which is, you know, n plus one dimensional. So this is 101 dimensional, you're trying to fit 101 parameters from just 10 training examples.
3:44
This turns out to sometimes work but not always be a good idea. Because as we'll see later, you might not have enough data if you only have 10 examples to fit you know, 100 or 101 parameters. We'll see later in this course why this might be too little data to fit this many parameters. But commonly what we do then if m is less than n, is to see if we can either delete some features or to use a technique called regularization which is something that we'll talk about later in this class as well, that will kind of let you fit a lot of parameters, use a lot features, even if you have a relatively small training set. But this regularization will be a later topic in this course. But to summarize if ever you find that x transpose x is singular or alternatively you find it non-invertable, what I would recommend you do is first look at your features and see if you have redundant features like this x1, x2. You're being linearly dependent or being a linear function of each other like so. And if you do have redundant features and if you just delete one of these features, you really don't need both of these features. If you just delete one of these features, that would solve your non-invertibility problem. And so I would first think through my features and check if any are redundant. And if so then keep deleting redundant features until they're no longer redundant. And if your features are not redundant, I would check if I may have too many features. And if that's the case, I would either delete some features if I can bear to use fewer features or else I would consider using regularization. Which is this topic that we'll talk about later.
5:24
So that's it for the normal equation and what it means for if the matrix X transpose X is non-invertable but this is a problem that you should run that hopefully you run into pretty rarely and if you just implement it in octave using P and using the P n function which is called a pseudo inverse function so you could use a different linear out your alive in Is called a pseudo-inverse but that implementation should just do the right thing, even if X transpose X is non-invertable, which should happen pretty rarely anyways, so this should not be a problem for most implementations of linear regression.

Reading: Normal Equation Noninvertibility

Normal Equation Noninvertibility
When implementing the normal equation in octave we want to use the 'pinv' function rather than 'inv.' The 'pinv' function will give you a value of ( heta) even if (X^TX) is not invertible.

If (X^TX)is noninvertible, the common causes might be having :
- Redundant features, where two features are very closely related (i.e. they are linearly dependent)
- Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization" (to be explained in a later lesson).
Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.

Video: Working on and Submitting Programming Assignments

In this video, I want to just quickly step you through the logistics of how to work on homeworks in this class and how to use the submission system which will let you verify right away that you got the right answer for your machine learning program exercise. Here's my Octave window and let's first go to my desktop. I saved the files for my first exercise, some of the files on my desktop: in this directory, 'ml-class-ex1'. And we provide a number files and ask you to edit some of them. So the first file should meet the details in the pdf file for this programming exercise. But one of the files we ask you to edit is this file called warmUpExercise.m, where the exercise is really just to make sure that you're familiar with the submission system. And all you need to do is return the 5x5 identity matrix. So the solution to this exercise I just showed you is to write A = eye(5). So that modifies this function to generate the 5x5 identity matrix. And this function warmUpExercise() now returns the 5x5 identity matrix. And I'm just going to save it. So I've done the first part of this homework. Going back to my Octave window, let's now go to my directory, 'C:UsersangDesktopml-class-ex1'. And if I want to make sure that I've implemented this, type 'warmUpExercise()' like so. And yup, it returns the 5x5 identity matrix that we just wrote the code to create. And I can now submit the code as follows. I'm going to type 'submit()' in this directory and I'm ready to submit part 1 so I'm going to enter choice '1'. So it asks me for my email address. I'm going go to the course website. This is an internal testing site, so your version of the website may look a little bit different. But that's my email address and this is my submission password, and I'm just going to type them in here. So I have ang@cs.stanford.edu and my submission password is 9yC75USsGf. I'm going to hit enter; it connects to the server and submits it, and right away it tells you "Congratulations! You have successfully completed Homework 1 Part 1". And this gives you a verification that you got this part right. And if you don't submit the right answer, then it will give you a message indicating that you haven't quite gotten it right yet. And you can use this submission password and you can generate new passwords; it doesn't matter. But you can also use your regular website login password, but because this password here is typed in clear text on your monitor, we gave you this extra submission password in case you don't want to type in your website's normal password onto a window that, depending on your operating system, may or may not appear as text when you type it into the Octave submission script. So, that's how you submit the homeworks after you've done it. Good luck, and, when you get around to homeworks, I hope you get all of them right. And finally, in the next and final Octave tutorial video, I want to tell you about vectorization, which is a way to get your Octave code to run much more efficiently.

Reading: Programming tips from Mentors

Thank you to Machine Learning Mentor, Tom Mosher, for compiling this list

Subject: Confused about "h(x) = theta' * x" vs. "h(x) = X * theta?"

Text:

The lectures and exercise PDF files are based on Prof. Ng's feeling that novice programmers will adapt to for-loop techniques more readily than vectorized methods. So the videos (and PDF files) are organized toward processing one training example at a time. The course uses column vectors (in most cases), so h (a scalar for one training example) is theta' * x.

Lower-case x typically indicates a single training example.

The more efficient vectorized techniques always use X as a matrix of all training examples, with each example as a row, and the features as columns. That makes X have dimensions of (m x n). where m is the number of training examples. This leaves us with h (a vector of all the hypothesis values for the entire training set) as X * theta, with dimensions of (m x 1).

X (as a matrix of all training examples) is denoted as upper-case X.

Throughout this course, dimensional analysis is your friend.

Subject: Tips from the Mentors: submit problems and fixing program errors

Text:

This post contains some frequently-used tips about the course, and to help get your programs working correctly.

The Most Important Tip:

Search the forum before posting a new question. If you've got a question, the chances are that someone else has already posted it, and received an answer. Save time for yourself and the Forum users by searching for topics before posting a new one.

Running your scripts:

At the Octave/Matlab command line, you do not need to include the ".m" portion of the script file name. If you include the ".m", you'll get an error message about an invalid indexing operation. So, run the Exercise 1 script by typing just "ex1" at the command line.

You also do not need to include parenthesis () when using the submit script. Just type "submit".

You cannot execute your functions by simply typing the name. All of the functions you will work on require a set of parameter values, enter between a set of parenthesis. Your three methods of testing your code are:

1 - use an exercise script, such as "ex1"

2 - use a Unit Test (see below) where you type-in the entire command line including the parameters.

3 - use the submit script.

Making the grader happy:

The submit grader uses a different test case than what is in the PDF file. These test cases use a different size of data set and are more sensitive to small errors than the ex test cases. Your code must work correctly with any size of data set.

Your functions must handle the general case. This means:

- You should avoid using hard-coded array indexes.

- You should avoid having fixed-length arrays and matrices.

It is very common for students to think that getting the same answer as listed in the PDF file means they should get full credit from the grader. This is a false hope. The PDF file is just one test case. The grader uses a different test case.

Also, the grader does not like your code to send any additional outputs to the workspace. So, every line of code should end with a semicolon.

Getting Help:

When you want help from the Forum community, please use this two-step procedure:

1 - Search the Forum for keywords that relate to your problem. Searching by the function name is a good start.

2 - If you don't find a suitable thread, then do this:

2a - Find the unit tests for that exercise (see below), and run the appropriate test. Attempt to debug your code.

2b - Take a screen capture of your whole console workspace (including the command line), and post it to the forum, along with any other useful information (computer type, Octave/Matlab version, other tests you've tried, etc).

Debugging:
If your code runs but gives the wrong answers, you can insert a "keyboard" command in your script, just before the function ends. This will cause the program to exit to the debugger, so you can inspect all your variables from the command line. This often is very helpful in analysing math errors, or trying out what commands to use to implement your function.

There are additional test cases and tutorials listed in pinned threads under "All Course Discussions". The test cases are especially helpful in debugging in situations where you get the expected output in ex but get no points or an error when submitting.

Unit Tests:
Each programming assignment has a "Discussions" area in the Forum. In this section you can often find "unit tests". These are additional test cases, which give you a command to type, and provides the expected results. It is always a good idea to test your functions using the unit tests before submitting to the grader.

If you run a unit test and do not get the correct results, you can most easily get help on the forums by posting a screen capture of your workspace - including the command line you entered, and the results.

Having trouble submitting your work to the grader?:
- This section will need to be supplemented with info appropriate to the new submission system. If you run the submit script and get a message that your identity can't be verified, be sure that you have logged-in using your Coursera account email and your Programming Assignment submission password.

- If you get the message "submit undefined", first check that you are in the working directory where you extracted the files from the ZIP archive. Use "cd" to get there if necessary.

- If the "submit undefined" error persists, or any other "function undefined" messages appear, try using the "addpath(pwd)" command to add your present working directory (pwd) to the Octave execution path.

-If the submit script crashes with an error message, please see the thread "Mentor tips for submitting your work" under "All Course Discussions".

-The submit script does not ask for what part of the exercise you want to submit. It automatically grades any function you have modified.

Found some errata in the course materials?
This course material has been used for many previous sessions. Most likely all of the errata has been discovered, and it's all documented in the 'Errata' section under 'Supplementary Materials'. Please check there before posting errata to the Forum.

Error messages with fmincg()

The "short-circuit" warnings are due to use a change in the syntax for conditional expressions (| and & vs || and &&) in the newer versions of Matlab. You can edit the fmincg.m file and the warnings may be resolved.

Warning messages about "automatic broadcasting"?
See this link for info.

Warnings about "divide by zero"
These are normal in some of the exercises, and do not represent a problem in your function. You can ignore them - Octave senses the issue and substitutes a +Inf or -Inf value so your program continues to execute.

Reading: Lecture Slides

Lecture4.pdf

Logistic Regression

Logistic regression is a method for classifying data into discrete outcomes. For example, we might use logistic regression to classify an email as spam or not spam. In this module, we introduce the notion of classification, the cost function for logistic regression, and the application of logistic regression to multi-class classification.

7 videos, 8 readings

Video: Classification

In this and the next few videos, I want to start to talk about classification problems, where the variable y that you want to predict is valued. We'll develop an algorithm called logistic regression, which is one of the most popular and most widely used learning algorithms today.
0:19
Here are some examples of classification problems. Earlier we talked about email spam classification as an example of a classification problem. Another example would be classifying online transactions. So if you have a website that sells stuff and if you want to know if a particular transaction is fraudulent or not, whether someone is using a stolen credit card or has stolen the user's password. There's another classification problem. And earlier we also talked about the example of classifying tumors as cancerous, malignant or as benign tumors.
0:55
In all of these problems the variable that we're trying to predict is a variable y that we can think of as taking on two values either zero or one, either spam or not spam, fraudulent or not fraudulent, related malignant or benign.
1:10
Another name for the class that we denote with zero is the negative class, and another name for the class that we denote with one is the positive class. So zero we denote as the benign tumor, and one, positive class we denote a malignant tumor. The assignment of the two classes, spam not spam and so on. The assignment of the two classes to positive and negative to zero and one is somewhat arbitrary and it doesn't really matter but often there is this intuition that a negative class is conveying the absence of something like the absence of a malignant tumor. Whereas one the positive class is conveying the presence of something that we may be looking for, but the definition of which is negative and which is positive is somewhat arbitrary and it doesn't matter that much.
2:00
For now we're going to start with classification problems with just two classes zero and one. Later one we'll talk about multi class problems as well where therefore y may take on four values zero, one, two, and three. This is called a multiclass classification problem. But for the next few videos, let's start with the two class or the binary classification problem and we'll worry about the multiclass setting later. So how do we develop a classification algorithm? Here's an example of a training set for a classification task for classifying a tumor as malignant or benign. And notice that malignancy takes on only two values, zero or no, one or yes. So one thing we could do given this training set is to apply the algorithm that we already know.
2:51
Linear regression to this data set and just try to fit the straight line to the data. So if you take this training set and fill a straight line to it, maybe you get a hypothesis that looks like that, right. So that's my hypothesis. H(x) equals theta transpose x. If you want to make predictions one thing you could try doing is then threshold the classifier outputs at 0.5 that is at a vertical axis value 0.5 and if the hypothesis outputs a value that is greater than equal to 0.5 you can take y = 1. If it's less than 0.5 you can take y=0. Let's see what happens if we do that. So 0.5 and so that's where the threshold is and that's using linear regression this way. Everything to the right of this point we will end up predicting as the positive cross. Because the output values is greater than 0.5 on the vertical axis and everything to the left of that point we will end up predicting as a negative value.
3:55
In this particular example, it looks like linear regression is actually doing something reasonable. Even though this is a classification toss we're interested in. But now let's try changing the problem a bit. Let me extend out the horizontal access a little bit and let's say we got one more training example way out there on the right. Notice that that additional training example, this one out here, it doesn't actually change anything, right. Looking at the training set it's pretty clear what a good hypothesis is. Is that well everything to the right of somewhere around here, to the right of this we should predict this positive. Everything to the left we should probably predict as negative because from this training set, it looks like all the tumors larger than a certain value around here are malignant, and all the tumors smaller than that are not malignant, at least for this training set.
4:46
But once we've added that extra example over here, if you now run linear regression, you instead get a straight line fit to the data. That might maybe look like this.
4:57
And if you know threshold hypothesis at 0.5, you end up with a threshold that's around here, so that everything to the right of this point you predict as positive and everything to the left of that point you predict as negative.
5:14
And this seems a pretty bad thing for linear regression to have done, right, because you know these are our positive examples, these are our negative examples. It's pretty clear we really should be separating the two somewhere around there, but somehow by adding one example way out here to the right, this example really isn't giving us any new information. I mean, there should be no surprise to the learning algorithm. That the example way out here turns out to be malignant. But somehow having that example out there caused linear regression to change its straight-line fit to the data from this magenta line out here to this blue line over here, and caused it to give us a worse hypothesis.
5:56
So, applying linear regression to a classification problem often isn't a great idea. In the first example, before I added this extra training example, previously linear regression was just getting lucky and it got us a hypothesis that worked well for that particular example, but usually applying linear regression to a data set, you might get lucky but often it isn't a good idea. So I wouldn't use linear regression for classification problems.
6:29
Here's one other funny thing about what would happen if we were to use linear regression for a classification problem. For classification we know that y is either zero or one. But if you are using linear regression where the hypothesis can output values that are much larger than one or less than zero, even if all of your training examples have labels y equals zero or one.
6:53
And it seems kind of strange that even though we know that the labels should be zero, one it seems kind of strange if the algorithm can output values much larger than one or much smaller than zero.
7:09
So what we'll do in the next few videos is develop an algorithm called logistic regression, which has the property that the output, the predictions of logistic regression are always between zero and one, and doesn't become bigger than one or become less than zero.
7:26
And by the way, logistic regression is, and we will use it as a classification algorithm, is some, maybe sometimes confusing that the term regression appears in this name even though logistic regression is actually a classification algorithm. But that's just a name it was given for historical reasons. So don't be confused by that logistic regression is actually a classification algorithm that we apply to settings where the label y is discrete value, when it's either zero or one. So hopefully you now know why, if you have a classification problem, using linear regression isn't a good idea. In the next video, we'll start working out the details of the logistic regression algorithm.

Reading: Classification

Classification
To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn't work well because classification is not actually a linear function.

The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1. (Most of what we say here will also generalize to the multiple-class case.) For instance, if we are trying to build a spam classifier for email, then (x^{(i)}) may be some features of a piece of email, and y may be 1 if it is a piece of spam mail, and 0 otherwise. Hence, y∈{0,1}. 0 is also called the negative class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+.” Given (x^{(i)}), the corresponding (y^{(i)}) is also called the label for the training example.

Video: Hypothesis Representation

Let's start talking about logistic regression. In this video, I'd like to show you the hypothesis representation. That is, what is the function we're going to use to represent our hypothesis when we have a classification problem?
0:15
Earlier, we said that we would like our classifier to output values that are between 0 and 1. So we'd like to come up with a hypothesis that satisfies this property, that is, predictions are maybe between 0 and 1. When we were using linear regression, this was the form of a hypothesis, where h(x) is theta transpose x. For logistic regression, I'm going to modify this a little bit and make the hypothesis g of theta transpose x. Where I'm going to define the function g as follows. G(z), z is a real number, is equal to one over one plus e to the negative z.
0:58
This is called the sigmoid function, or the logistic function, and the term logistic function, that's what gives rise to the name logistic regression. And by the way, the terms sigmoid function and logistic function are basically synonyms and mean the same thing. So the two terms are basically interchangeable, and either term can be used to refer to this function g. And if we take these two equations and put them together, then here's just an alternative way of writing out the form of my hypothesis. I'm saying that h(x) Is 1 over 1 plus e to the negative theta transpose x. And all I've do is I've taken this variable z, z here is a real number, and plugged in theta transpose x. So I end up with theta transpose x in place of z there. Lastly, let me show you what the sigmoid function looks like. We're gonna plot it on this figure here. The sigmoid function, g(z), also called the logistic function, it looks like this. It starts off near 0 and then it rises until it crosses 0.5 and the origin, and then it flattens out again like so. So that's what the sigmoid function looks like. And you notice that the sigmoid function, while it asymptotes at one and asymptotes at zero, as a z axis, the horizontal axis is z. As z goes to minus infinity, g(z) approaches zero. And as g(z) approaches infinity, g(z) approaches one. And so because g(z) upwards values are between zero and one, we also have that h(x) must be between zero and one. Finally, given this hypothesis representation, what we need to do, as before, is fit the parameters theta to our data. So given a training set we need to a pick a value for the parameters theta and this hypothesis will then let us make predictions. We'll talk about a learning algorithm later for fitting the parameters theta, but first let's talk a bit about the interpretation of this model.
3:18
Here's how I'm going to interpret the output of my hypothesis, h(x).
3:25
When my hypothesis outputs some number, I am going to treat that number as the estimated probability that y is equal to one on a new input, example x. Here's what I mean, here's an example. Let's say we're using the tumor classification example, so we may have a feature vector x, which is this x zero equals one as always. And then one feature is the size of the tumor.
3:52
Suppose I have a patient come in and they have some tumor size and I feed their feature vector x into my hypothesis. And suppose my hypothesis outputs the number 0.7. I'm going to interpret my hypothesis as follows. I'm gonna say that this hypothesis is telling me that for a patient with features x, the probability that y equals 1 is 0.7. In other words, I'm going to tell my patient that the tumor, sadly, has a 70 percent chance, or a 0.7 chance of being malignant. To write this out slightly more formally, or to write this out in math, I'm going to interpret my hypothesis output as. P of y=1 given x parameterized by theta. So for those of you that are familiar with probability, this equation may make sense. If you're a little less familiar with probability, then here's how I read this expression. This is the probability that y is equal to one. Given x, given that my patient has features x, so given my patient has a particular tumor size represented by my features x. And this probability is parameterized by theta. So I'm basically going to count on my hypothesis to give me estimates of the probability that y is equal to 1. Now, since this is a classification task, we know that y must be either 0 or 1, right? Those are the only two values that y could possibly take on, either in the training set or for new patients that may walk into my office, or into the doctor's office in the future. So given h(x), we can therefore compute the probability that y = 0 as well, completely because y must be either 0 or 1. We know that the probability of y = 0 plus the probability of y = 1 must add up to 1. This first equation looks a little bit more complicated. It's basically saying that probability of y=0 for a particular patient with features x, and given our parameters theta.
6:00
Plus the probability of y=1 for that same patient with features x and given theta parameters theta must add up to one. If this equation looks a little bit complicated, feel free to mentally imagine it without that x and theta. And this is just saying that the product of y equals zero plus the product of y equals one, must be equal to one. And we know this to be true because y has to be either zero or one, and so the chance of y equals zero, plus the chance that y is one. Those two must add up to one. And so if you just take this term and move it to the right hand side, then you end up with this equation. That says probability that y equals zero is 1 minus probability of y equals 1, and thus if our hypothesis feature of x gives us that term. You can therefore quite simply compute the probability or compute the estimated probability that y is equal to 0 as well. So, you now know what the hypothesis representation is for logistic regression and we're seeing what the mathematical formula is, defining the hypothesis for logistic regression. In the next video, I'd like to try to give you better intuition about what the hypothesis function looks like. And I wanna tell you about something called the decision boundary. And we'll look at some visualizations together to try to get a better sense of what this hypothesis function of logistic regression really looks like.

Reading: Hypothesis Representation

Hypothesis Representation
We could approach the classification problem ignoring the fact that y is discrete-valued, and use our old linear regression algorithm to try to predict y given x. However, it is easy to construct examples where this method performs very poorly. Intuitively, it also doesn’t make sense for (h_ heta (x)) to take values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}. To fix this, let’s change the form for our hypotheses (h_ heta (x)) to satisfy (0 leq h_ heta (x) leq 1). This is accomplished by plugging ( heta^Tx) into the Logistic Function.

Our new form uses the "Sigmoid Function," also called the "Logistic Function":

The following image shows us what the sigmoid function looks like:

The function g(z), shown here, maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification.

(h_ heta (x)) will give us the probability that our output is 1. For example, (h_ heta (x)=0.7h) gives us a probability of 70% that our output is 1. Our probability that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%).

Video: Decision Boundary

In the last video, we talked about the hypothesis representation for logistic regression. What Id like to do now is tell you about something called the decision boundary, and this will give us a better sense of what the logistic regressions hypothesis function is computing.
0:17
To recap, this is what we wrote out last time, where we said that the hypothesis is represented as h of x equals g of theta transpose x, where g is this function called the sigmoid function, which looks like this. It slowly increases from zero to one, asymptoting at one.
0:38
What I want to do now is try to understand better when this hypothesis will make predictions that y is equal to 1 versus when it might make predictions that y is equal to 0. And understand better what hypothesis function looks like particularly when we have more than one feature. Concretely, this hypothesis is outputting estimates of the probability that y is equal to one, given x and parameterized by theta. So if we wanted to predict is y equal to one or is y equal to zero, here's something we might do. Whenever the hypothesis outputs that the probability of y being one is greater than or equal to 0.5, so this means that if there is more likely to be y equals 1 than y equals 0, then let's predict y equals 1. And otherwise, if the probability, the estimated probability of y being over 1 is less than 0.5, then let's predict y equals 0. And I chose a greater than or equal to here and less than here If h of x is equal to 0.5 exactly, then you could predict positive or negative, but I probably created a loophole here, so we default maybe to predicting positive if h of x is 0.5, but that's a detail that really doesn't matter that much.
1:56
What I want to do is understand better when is it exactly that h of x will be greater than or equal to 0.5, so that we'll end up predicting y is equal to 1. If we look at this plot of the sigmoid function, we'll notice that the sigmoid function, g of z is greater than or equal to 0.5 whenever z is greater than or equal to zero. So is in this half of the figure that g takes on values that are 0.5 and higher. This notch here, that's 0.5, and so when z is positive, g of z, the sigmoid function is greater than or equal to 0.5.
2:42
Since the hypothesis for logistic regression is h of x equals g of theta and transpose x, this is therefore going to be greater than or equal to 0.5, whenever theta transpose x is greater than or equal to 0. So what we're shown, right, because here theta transpose x takes the role of z.
3:08
So what we're shown is that a hypothesis is gonna predict y equals 1 whenever theta transpose x is greater than or equal to 0. Let's now consider the other case of when a hypothesis will predict y is equal to 0. Well, by similar argument, h(x) is going to be less than 0.5 whenever g(z) is less than 0.5 because the range of values of z that cause g(z) to take on values less than 0.5, well, that's when z is negative. So when g(z) is less than 0.5, a hypothesis will predict that y is equal to 0. And by similar argument to what we had earlier, h(x) is equal to g of theta transpose x and so we'll predict y equals 0 whenever this quantity theta transpose x is less than 0.
4:04
To summarize what we just worked out, we saw that if we decide to predict whether y=1 or y=0 depending on whether the estimated probability is greater than or equal to 0.5, or whether less than 0.5, then that's the same as saying that when we predict y=1 whenever theta transpose x is greater than or equal to 0. And we'll predict y is equal to 0 whenever theta transpose x is less than 0. Let's use this to better understand how the hypothesis of logistic regression makes those predictions. Now, let's suppose we have a training set like that shown on the slide. And suppose a hypothesis is h of x equals g of theta zero plus theta one x one plus theta two x two.
4:52
We haven't talked yet about how to fit the parameters of this model. We'll talk about that in the next video. But suppose that via a procedure to specified. We end up choosing the following values for the parameters. Let's say we choose theta 0 equals 3, theta 1 equals 1, theta 2 equals 1. So this means that my parameter vector is going to be theta equals minus 3, 1, 1.
5:24
So, when given this choice of my hypothesis parameters, let's try to figure out where a hypothesis would end up predicting y equals one and where it would end up predicting y equals zero.
5:39
Using the formulas that we were taught on the previous slide, we know that y equals one is more likely, that is the probability that y equals one is greater than or equal to 0.5, whenever theta transpose x is greater than zero. And this formula that I just underlined, -3 + x1 + x2, is, of course, theta transpose x when theta is equal to this value of the parameters that we just chose.
6:12
So for any example, for any example which features x1 and x2 that satisfy this equation, that minus 3 plus x1 plus x2 is greater than or equal to 0, our hypothesis will think that y equals 1, the small x will predict that y is equal to 1.
6:32
We can also take -3 and bring this to the right and rewrite this as x1+x2 is greater than or equal to 3, so equivalently, we found that this hypothesis would predict y=1 whenever x1+x2 is greater than or equal to 3.
6:51
Let's see what that means on the figure, if I write down the equation, X1 + X2 = 3, this defines the equation of a straight line and if I draw what that straight line looks like, it gives me the following line which passes through 3 and 3 on the x1 and the x2 axis.
7:16
So the part of the infospace, the part of the X1 X2 plane that corresponds to when X1 plus X2 is greater than or equal to 3, that's going to be this right half thing, that is everything to the up and everything to the upper right portion of this magenta line that I just drew. And so, the region where our hypothesis will predict y = 1, is this region, just really this huge region, this half space over to the upper right. And let me just write that down, I'm gonna call this the y = 1 region. And, in contrast, the region where x1 + x2 is less than 3, that's when we will predict that y is equal to 0. And that corresponds to this region. And there's really a half plane, but that region on the left is the region where our hypothesis will predict y = 0. I wanna give this line, this magenta line that I drew a name. This line, there, is called the decision boundary.
8:24
And concretely, this straight line, X1 plus X equals 3. That corresponds to the set of points, so that corresponds to the region where H of X is equal to 0.5 exactly and the decision boundary that is this straight line, that's the line that separates the region where the hypothesis predicts Y equals 1 from the region where the hypothesis predicts that y is equal to zero. And just to be clear, the decision boundary is a property of the hypothesis
8:57
including the parameters theta zero, theta one, theta two. And in the figure I drew a training set, I drew a data set, in order to help the visualization. But even if we take away the data set this decision boundary and the region where we predict y =1 versus y = 0, that's a property of the hypothesis and of the parameters of the hypothesis and not a property of the data set.
9:22
Later on, of course, we'll talk about how to fit the parameters and there we'll end up using the training set, using our data. To determine the value of the parameters. But once we have particular values for the parameters theta0, theta1, theta2 then that completely defines the decision boundary and we don't actually need to plot a training set in order to plot the decision boundary.
9:49
Let's now look at a more complex example where as usual, I have crosses to denote my positive examples and Os to denote my negative examples. Given a training set like this, how can I get logistic regression to fit the sort of data?
10:05
Earlier when we were talking about polynomial regression or when we're talking about linear regression, we talked about how we could add extra higher order polynomial terms to the features. And we can do the same for logistic regression. Concretely, let's say my hypothesis looks like this where I've added two extra features, x1 squared and x2 squared, to my features. So that I now have five parameters, theta zero through theta four.
10:32
As before, we'll defer to the next video, our discussion on how to automatically choose values for the parameters theta zero through theta four. But let's say that varied procedure to be specified, I end up choosing theta zero equals minus one, theta one equals zero, theta two equals zero, theta three equals one and theta four equals one.
10:59
What this means is that with this particular choose of parameters, my parameter effect theta theta looks like minus one, zero, zero, one, one.
11:10
Following our earlier discussion, this means that my hypothesis will predict that y=1 whenever -1 + x1 squared + x2 squared is greater than or equal to 0. This is whenever theta transpose times my theta transfers, my features is greater than or equal to zero. And if I take minus 1 and just bring this to the right, I'm saying that my hypothesis will predict that y is equal to 1 whenever x1 squared plus x2 squared is greater than or equal to 1. So what does this decision boundary look like? Well, if you were to plot the curve for x1 squared plus x2 squared equals 1 Some of you will recognize that, that is the equation for circle of radius one, centered around the origin. So that is my decision boundary.
12:10
And everything outside the circle, I'm going to predict as y=1. So out here is my y equals 1 region, we'll predict y equals 1 out here and inside the circle is where I'll predict y is equal to 0. So by adding these more complex, or these polynomial terms to my features as well, I can get more complex decision boundaries that don't just try to separate the positive and negative examples in a straight line that I can get in this example, a decision boundary that's a circle.
12:44
Once again, the decision boundary is a property, not of the trading set, but of the hypothesis under the parameters. So, so long as we're given my parameter vector theta, that defines the decision boundary, which is the circle. But the training set is not what we use to define the decision boundary. The training set may be used to fit the parameters theta. We'll talk about how to do that later. But, once you have the parameters theta, that is what defines the decisions boundary.
13:14
Let me put back the training set just for visualization.
13:18
And finally let's look at a more complex example.
13:22
So can we come up with even more complex decision boundaries then this? If I have even higher polynomial terms so things like
13:32
X1 squared, X1 squared X2, X1 squared equals squared and so on. And have much higher polynomials, then it's possible to show that you can get even more complex decision boundaries and the regression can be used to find decision boundaries that may, for example, be an ellipse like that or maybe a little bit different setting of the parameters maybe you can get instead a different decision boundary which may even look like some funny shape like that.
14:03
Or for even more complete examples maybe you can also get this decision boundaries that could look like more complex shapes like that where everything in here you predict y = 1 and everything outside you predict y = 0. So this higher autopolynomial features you can a very complex decision boundaries. So, with these visualizations, I hope that gives you a sense of what's the range of hypothesis functions we can represent using the representation that we have for logistic regression.
14:34
Now that we know what h(x) can represent, what I'd like to do next in the following video is talk about how to automatically choose the parameters theta so that given a training set we can automatically fit the parameters to our data.

Reading: Decision Boundary

Decision Boundary
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:

Remember.

So if our input to g is ( heta^T X), then that means:

From these statements we can now say:

The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.

Example:

In this case, our decision boundary is a straight vertical line placed on the graph where (x_1 = 5), and everything to the left of that denotes y = 1, while everything to the right denotes y = 0.

Again, the input to the sigmoid function g(z) (e.g. ( heta^T X)) doesn't need to be linear, and could be a function that describes a circle (e.g. (z = heta_0 + heta_1 x_1^2 + heta_2 x_2^2)) or any shape to fit our data.

Video: Cost Function

In this video, we'll talk about how to fit the parameters of theta for the logistic compression. In particular, I'd like to define the optimization objective, or the cost function that we'll use to fit the parameters.
0:15
Here's the supervised learning problem of fitting logistic regression model. We have a training set of m training examples and as usual, each of our examples is represented by a that's n plus one dimensional,
0:32
and as usual we have x o equals one. First feature or a zero feature is always equal to one. And because this is a computation problem, our training set has the property that every label y is either 0 or 1. This is a hypothesis, and the parameters of a hypothesis is this theta over here. And the question that I want to talk about is given this training set, how do we choose, or how do we fit the parameter's theta? Back when we were developing the linear regression model, we used the following cost function. I've written this slightly differently where instead of 1 over 2m, I've taken a one-half and put it inside the summation instead. Now I want to use an alternative way of writing out this cost function. Which is that instead of writing out this square of return here, let's write in here costs of h of x, y and I'm going to define that total cost of h of x, y to be equal to this. Just equal to this one-half of the squared error. So now we can see more clearly that the cost function is a sum over my training set, which is 1 over n times the sum of my training set of this cost term here.
1:56
And to simplify this equation a little bit more, it's going to be convenient to get rid of those superscripts. So just define cost of h of x comma y to be equal to one half of this squared error. And interpretation of this cost function is that, this is the cost I want my learning algorithm to have to pay if it outputs that value, if its prediction is h of x, and the actual label was y. So just cross off the superscripts, right, and no surprise for linear regression the cost we've defined is that or the cost of this is that is one-half times the square difference between what I predicted and the actual value that we have, 0 for y. Now this cost function worked fine for linear regression. But here, we're interested in logistic regression. If we could minimize this cost function that is plugged into J here, that will work okay. But it turns out that if we use this particular cost function, this would be a non-convex function of the parameter's data. Here's what I mean by non-convex. Have some cross function j of theta and for logistic regression, this function h here
3:12
has a nonlinearity that is one over one plus e to the negative theta transpose. So this is a pretty complicated nonlinear function. And if you take the function, plug it in here. And then take this cost function and plug it in there and then plot what j of theta looks like. You find that j of theta can look like a function that's like this
3:33
with many local optima. And the formal term for this is that this is a non-convex function. And you can kind of tell, if you were to run gradient descent on this sort of function It is not guaranteed to converge to the global minimum. Whereas in contrast what we would like is to have a cost function j of theta that is convex, that is a single bow-shaped function that looks like this so that if you run theta in the we would be guaranteed that
4:01
would converge to the global minimum. And the problem with using this great cost function is that because of this very nonlinear function that appears in the middle here, J of theta ends up being a nonconvex function if you were to define it as a square cost function. So what we'd like to do is, instead of come up with a different cost function, that is convex, and so that we can apply a great algorithm, like gradient descent and be guaranteed to find the global minimum. Here's the cost function that we're going to use for logistic regression. We're going to say that the cost, or the penalty that the algorithm pays, if it upwards the value of h(x), so if this is some number like 0.7, it predicts the value h of x. And the actual cost label turns out to be y. The cost is going to be -log(h(x)) if y = 1 and -log(1- h(x)) if y = 0. This looks like a pretty complicated function, but let's plot this function to gain some intuition about what it's doing. Let's start off with the case of y = 1. If y = 1, then the cost function is -log(h(x)). And if we plot that, so let's say that the horizontal axis is h(x), so we know that a hypothesis is going to output a value between 0 and 1. Right, so h(x), that varies between 0 and 1. If you plot what this cost function looks like, you find that it looks like this. One way to see why the plot looks like this is because if you were to plot log z
5:45
with z on the horizontal axis, then that looks like that. And it approaches minus infinity, right? So this is what the log function looks like. And this is 0, this is 1. Here, z is of course playing the role of h of x. And so -log z will look like this.
6:06
Just flipping the sign, minus log z, and we're interested only in the range of when this function goes between zero and one, so get rid of that. And so we're just left with, you know, this part of the curve, and that's what this curve on the left looks like. Now, this cost function has a few interesting and desirable properties. First, you notice that if y is equal to 1 and h(x) is equal to 1, in other words, if the hypothesis exactly predicts h equals 1 and y is exactly equal to what it predicted, then the cost = 0 right? That corresponds to the curve doesn't actually flatten out. The curve is still going. First, notice that if h(x) = 1, if that hypothesis predicts that y = 1 and if indeed y = 1 then the cost = 0. That corresponds to this point down here, right? If h(x) = 1 and we're only considering the case of y = 1 here. But if h(x) = 1 then the cost is down here, is equal to 0. And that's where we'd like it to be because if we correctly predict the output y, then the cost is 0. But now notice also that as h(x) approaches 0, so as the output of a hypothesis approaches 0, the cost blows up and it goes to infinity. And what this does is this captures the intuition that if a hypothesis of 0, that's like saying a hypothesis saying the chance of y equals 1 is equal to 0. It's kinda like our going to our medical patients and saying the probability that you have a malignant tumor, the probability that y=1, is zero. So, it's like absolutely impossible that your tumor is malignant.
7:55
But if it turns out that the tumor, the patient's tumor, actually is malignant, so if y is equal to one, even after we told them, that the probability of it happening is zero. So it's absolutely impossible for it to be malignant. But if we told them this with that level of certainty and we turn out to be wrong, then we penalize the learning algorithm by a very, very large cost. And that's captured by having this cost go to infinity if y equals 1 and h(x) approaches 0. This slide consider the case of y equals 1. Let's look at what the cost function looks like for y equals 0.
8:32
If y is equal to 0, then the cost looks like this, it looks like this expression over here, and if you plot the function, -log(1-z), what you get is the cost function actually looks like this. So it goes from 0 to 1, something like that and so if you plot the cost function for the case of y equals 0, you find that it looks like this. And what this curve does is it now goes up and it goes to plus infinity as h of x goes to 1 because as I was saying, that if y turns out to be equal to 0. But we predicted that y is equal to 1 with almost certainly, probably 1, then we end up paying a very large cost. And conversely, if h of x is equal to 0 and y equals 0, then the hypothesis melted. The protected y of z is equal to 0, and it turns out y is equal to 0, so at this point, the cost function is going to be 0. In this video, we will define the cost function for a single train example. The topic of convexity analysis is now beyond the scope of this course, but it is possible to show that with a particular choice of cost function, this will give a convex optimization problem. Overall cost function j of theta will be convex and local optima free. In the next video we're gonna take these ideas of the cost function for a single training example and develop that further, and define the cost function for the entire training set. And we'll also figure out a simpler way to write it than we have been using so far, and based on that we'll work out grading descent, and that will give us logistic compression algorithm.

Reading: Cost Function

Cost Function
We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.

Instead, our cost function for logistic regression looks like:

When y = 1, we get the following plot for (J( heta)) vs (h_ heta (x)):

Similarly, when y = 0, we get the following plot for (J( heta)) vs (h_ heta (x)):

If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.

If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity.

Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.

Video: Simplified Cost Function and Gradient Descent

In this video, we'll figure out a slightly simpler way to write the cost function than we have been using so far. And we'll also figure out how to apply gradient descent to fit the parameters of logistic regression. So, by the end of this, video you know how to implement a fully working version of logistic regression.
0:22
Here's our cost function for logistic regression. Our overall cost function is 1 over m times the sum over the trading set of the cost of making different predictions on the different examples of labels y i. And this is the cost of a single example that we worked out earlier. And just want to remind you that for classification problems in our training sets, and in fact even for examples, now that our training set y is always equal to zero or one, right? That's sort of part of the mathematical definition of y.
0:55
Because y is either zero or one, we'll be able to come up with a simpler way to write this cost function. And in particular, rather than writing out this cost function on two separate lines with two separate cases, so y equals one and y equals zero. I'm going to show you a way to take these two lines and compress them into one equation. And this would make it more convenient to write out a cost function and derive gradient descent. Concretely, we can write out the cost function as follows. We say that cost of H(x), y. I'm gonna write this as -y times log h(x)- (1-y) times log (1-h(x)). And I'll show you in a second that this expression, no, this equation, is an equivalent way, or more compact way, of writing out this definition of the cost function that we have up here. Let's see why that's the case.
2:03
We know that there are only two possible cases. Y must be zero or one. So let's suppose Y equals one.
2:11
If y is equal to 1, than this equation is saying that the cost is equal to, well if y is equal to 1, then this thing here is equal to 1. And 1 minus y is going to be equal to 0, right. So if y is equal to 1, then 1 minus y is 1 minus 1, which is therefore 0. So the second term gets multiplied by 0 and goes away. And we're left with only this first term, which is y times log- y times log (h(x)). Y is 1 so that's equal to -log h(x). And this equation is exactly what we have up here for if y = 1. The other case is if y = 0. And if that's the case, then our writing of the cos function is saying that, well, if y is equal to 0, then this term here would be equal to zero. Whereas 1 minus y, if y is equal to zero would be equal to 1, because 1 minus y becomes 1 minus zero which is just equal to 1. And so the cost function simplifies to just this last term here, right? Because the fist term over here gets multiplied by zero, and so it disappears, and so it's just left with this last term, which is -log (1- h(x)). And you can verify that this term here is just exactly what we had for when y is equal to 0.
3:40
So this shows that this definition for the cost is just a more compact way of taking both of these expressions, the cases y =1 and y = 0, and writing them in a more convenient form with just one line. We can therefore write all our cost functions for logistic regression as follows. It is this 1 over m of the sum of these cost functions. And plugging in the definition for the cost that we worked out earlier, we end up with this. And we just put the minus sign outside. And why do we choose this particular function, while it looks like there could be other cost functions we could have chosen. Although I won't have time to go into great detail of this in this course, this cost function can be derived from statistics using the principle of maximum likelihood estimation. Which is an idea in statistics for how to efficiently find parameters' data for different models. And it also has a nice property that it is convex. So this is the cost function that essentially everyone uses when fitting logistic regression models. If you don't understand the terms that I just said, if you don't know what the principle of maximum likelihood estimation is, don't worry about it. But it's just a deeper rationale and justification behind this particular cost function than I have time to go into in this class. Given this cost function, in order to fit the parameters, what we're going to do then is try to find the parameters theta that minimize J of theta. So if we try to minimize this, this would give us some set of parameters theta. Finally, if we're given a new example with some set of features x, we can then take the thetas that we fit to our training set and output our prediction as this. And just to remind you, the output of my hypothesis I'm going to interpret as the probability that y is equal to one. And given the input x and parameterized by theta. But just, you can think of this as just my hypothesis as estimating the probability that y is equal to one. So all that remains to be done is figure out how to actually minimize J of theta as a function of theta so that we can actually fit the parameters to our training set. The way we're going to minimize the cost function is using gradient descent. Here's our cost function and if we want to minimize it as a function of theta, here's our usual template for graded descent where we repeatedly update each parameter by taking, updating it as itself minus learning ray alpha times this derivative term. If you know some calculus, feel free to take this term and try to compute the derivative yourself and see if you can simplify it to the same answer that I get. But even if you don't know calculus don't worry about it.
6:30
If you actually compute this, what you get is this equation, and just write it out here. It's sum from i equals one through m of essentially the error times xij. So if you take this partial derivative term and plug it back in here, we can then write out our gradient descent algorithm as follows.
6:55
And all I've done is I took the derivative term for the previous slide and plugged it in there. So if you have n features, you would have a parameter vector theta, which with parameters theta 0, theta 1, theta 2, down to theta n. And you will use this update to simultaneously update all of your values of theta. Now, if you take this update rule and compare it to what we were doing for linear regression. You might be surprised to realize that, well, this equation was exactly what we had for linear regression. In fact, if you look at the earlier videos, and look at the update rule, the Gradient Descent rule for linear regression. It looked exactly like what I drew here inside the blue box. So are linear regression and logistic regression different algorithms or not? Well, this is resolved by observing that for logistic regression, what has changed is that the definition for this hypothesis has changed. So as whereas for linear regression, we had h(x) equals theta transpose X, now this definition of h(x) has changed. And is instead now one over one plus e to the negative transpose x. So even though the update rule looks cosmetically identical, because the definition of the hypothesis has changed, this is actually not the same thing as gradient descent for linear regression. In an earlier video, when we were talking about gradient descent for linear regression, we had talked about how to monitor a gradient descent to make sure that it is converging. I usually apply that same method to logistic regression, too to monitor a gradient descent, to make sure it's converging correctly. And hopefully, you can figure out how to apply that technique to logistic regression yourself.
8:43
When implementing logistic regression with gradient descent, we have all of these different parameter values, theta zero down to theta n, that we need to update using this expression. And one thing we could do is have a for loop. So for i equals zero to n, or for i equals one to n plus one. So update each of these parameter values in turn. But of course rather than using a for loop, ideally we would also use a vector rise implementation. So that a vector rise implementation can update all of these m plus one parameters all in one fell swoop. And to check your own understanding, you might see if you can figure out how to do the vector rise implementation with this algorithm as well.
9:31
So, now you you know how to implement gradient descents for logistic regression. There was one last idea that we had talked about earlier, for linear regression, which was feature scaling. We saw how feature scaling can help gradient descent converge faster for linear regression. The idea of feature scaling also applies to gradient descent for logistic regression. And yet we have features that are on very different scale, then applying feature scaling can also make grading descent run faster for logistic regression.
10:01
So that's it, you now know how to implement logistic regression and this is a very powerful, and probably the most widely used, classification algorithm in the world. And you now know how we get it to work for yourself.

Reading: Simplified Cost Function and Gradient Descent

Simplified Cost Function and Gradient Descent
Note: [6:53 - the gradient descent equation should have a 1/m factor]

We can compress our cost function's two conditional cases into one case:

[mathrm{Cost}(h_ heta(x),y) = - y ; log(h_ heta(x)) - (1 - y) log(1 - h_ heta(x)) ]
Notice that when y is equal to 1, then the second term ((1-y)log(1-h_ heta(x))) will be zero and will not affect the result. If y is equal to 0, then the first term (-y log(h_ heta(x))) will be zero and will not affect the result.

We can fully write out our entire cost function as follows:

[J( heta) = - frac{1}{m} displaystyle sum_{i=1}^m [y^{(i)}log (h_ heta (x^{(i)})) + (1 - y^{(i)})log (1 - h_ heta(x^{(i)}))] ]
A vectorized implementation is:

Gradient Descent

Remember that the general form of gradient descent is:

We can work out the derivative part using calculus to get:

Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.

A vectorized implementation is:

( heta := heta - frac{alpha}{m} X^{T} (g(X heta ) - vec{y}))

Video: Advanced Optimization

In the last video, we talked about gradient descent for minimizing the cost function J of theta for logistic regression.
0:07
In this video, I'd like to tell you about some advanced optimization algorithms and some advanced optimization concepts.
0:15
Using some of these ideas, we'll be able to get logistic regression
0:19
to run much more quickly than it's possible with gradient descent. And this will also let the algorithms scale much better to very large machine learning problems, such as if we had a very large number of features. Here's an alternative view of what gradient descent is doing. We have some cost function J and we want to minimize it. So what we need to is, we need to write code that can take as input the parameters theta and they can compute two things: J of theta and these partial derivative terms for, you know, J equals 0, 1 up to N. Given code that can do these two things, what gradient descent does is it repeatedly performs the following update. Right? So given the code that we wrote to compute these partial derivatives, gradient descent plugs in here and uses that to update our parameters theta.
1:08
So another way of thinking about gradient descent is that we need to supply code to compute J of theta and these derivatives, and then these get plugged into gradient descents, which can then try to minimize the function for us. For gradient descent, I guess technically you don't actually need code to compute the cost function J of theta. You only need code to compute the derivative terms. But if you think of your code as also monitoring convergence of some such, we'll just think of ourselves as providing code to compute both the cost function and the derivative terms.
1:42
So, having written code to compute these two things, one algorithm we can use is gradient descent.
1:48
But gradient descent isn't the only algorithm we can use. And there are other algorithms, more advanced, more sophisticated ones, that, if we only provide them a way to compute these two things, then these are different approaches to optimize the cost function for us. So conjugate gradient BFGS and L-BFGS are examples of more sophisticated optimization algorithms that need a way to compute J of theta, and need a way to compute the derivatives, and can then use more sophisticated strategies than gradient descent to minimize the cost function.
2:21
The details of exactly what these three algorithms is well beyond the scope of this course. And in fact you often end up spending, you know, many days, or a small number of weeks studying these algorithms. If you take a class and advance the numerical computing.
2:36
But let me just tell you about some of their properties.
2:40
These three algorithms have a number of advantages. One is that, with any of this algorithms you usually do not need to manually pick the learning rate alpha.
2:50
So one way to think of these algorithms is that given is the way to compute the derivative and a cost function. You can think of these algorithms as having a clever inter-loop. And, in fact, they have a clever
3:01
inter-loop called a line search algorithm that automatically tries out different values for the learning rate alpha and automatically picks a good learning rate alpha so that it can even pick a different learning rate for every iteration. And so then you don't need to choose it yourself.
3:21
These algorithms actually do more sophisticated things than just pick a good learning rate, and so they often end up converging much faster than gradient descent.
3:32
These algorithms actually do more sophisticated things than just pick a good learning rate, and so they often end up converging much faster than gradient descent, but detailed discussion of exactly what they do is beyond the scope of this course.
3:45
In fact, I actually used to have used these algorithms for a long time, like maybe over a decade, quite frequently, and it was only, you know, a few years ago that I actually figured out for myself the details of what conjugate gradient, BFGS and O-BFGS do. So it is actually entirely possible to use these algorithms successfully and apply to lots of different learning problems without actually understanding the inter-loop of what these algorithms do.
4:12
If these algorithms have a disadvantage, I'd say that the main disadvantage is that they're quite a lot more complex than gradient descent. And in particular, you probably should not implement these algorithms - conjugate gradient, L-BGFS, BFGS - yourself unless you're an expert in numerical computing.
4:30
Instead, just as I wouldn't recommend that you write your own code to compute square roots of numbers or to compute inverses of matrices, for these algorithms also what I would recommend you do is just use a software library. So, you know, to take a square root what all of us do is use some function that someone else has written to compute the square roots of our numbers.
4:51
And fortunately, Octave and the closely related language MATLAB - we'll be using that - Octave has a very good. Has a pretty reasonable library implementing some of these advanced optimization algorithms. And so if you just use the built-in library, you know, you get pretty good results.
5:08
I should say that there is a difference between good and bad implementations of these algorithms. And so, if you're using a different language for your machine learning application, if you're using C, C++, Java, and so on, you might want to try out a couple of different libraries to make sure that you find a good library for implementing these algorithms. Because there is a difference in performance between a good implementation of, you know, contour gradient or LPFGS versus less good implementation of contour gradient or LPFGS.
5:43
So now let's explain how to use these algorithms, I'm going to do so with an example.
5:48
Let's say that you have a problem with two parameters
5:53
equals theta zero and theta one. And let's say your cost function is J of theta equals theta one minus five squared, plus theta two minus five squared.
6:02
So with this cost function. You know the value for theta 1 and theta 2. If you want to minimize J of theta as a function of theta. The value that minimizes it is going to be theta 1 equals 5, theta 2 equals equals five.
6:15
Now, again, I know some of you know more calculus than others, but the derivatives of the cost function J turn out to be these two expressions. I've done the calculus.
6:26
So if you want to apply one of the advanced optimization algorithms to minimize cost function J. So, you know, if we didn't know the minimum was at 5, 5, but if you want to have a cost function 5 the minimum numerically using something like gradient descent but preferably more advanced than gradient descent, what you would do is implement an octave function like this, so we implement a cost function,
6:49
cost function theta function like that,
6:52
and what this does is that it returns two arguments, the first J-val, is how
6:58
we would compute the cost function J. And so this says J-val equals, you know, theta one minus five squared plus theta two minus five squared. So it's just computing this cost function over here.
7:10
And the second argument that this function returns is gradient. So gradient is going to be a two by one vector,
7:18
and the two elements of the gradient vector correspond to the two partial derivative terms over here.
7:27
Having implemented this cost function,
7:29
you would, you can then
7:31
call the advanced optimization
7:34
function called the fminunc - it stands for function minimization unconstrained in Octave -and the way you call this is as follows. You set a few options. This is a options as a data structure that stores the options you want. So grant up on, this sets the gradient objective parameter to on. It just means you are indeed going to provide a gradient to this algorithm. I'm going to set the maximum number of iterations to, let's say, one hundred. We're going give it an initial guess for theta. There's a 2 by 1 vector. And then this command calls fminunc. This at symbol presents a pointer to the cost function
8:13
that we just defined up there. And if you call this, this will compute, you know, will use one of the more advanced optimization algorithms. And if you want to think it as just like gradient descent. But automatically choosing the learning rate alpha for so you don't have to do so yourself. But it will then attempt to use the sort of advanced optimization algorithms. Like gradient descent on steroids. To try to find the optimal value of theta for you. Let me actually show you what this looks like in Octave.
8:40
So I've written this cost function of theta function exactly as we had it on the previous line. It computes J-val which is the cost function. And it computes the gradient with the two elements being the partial derivatives of the cost function with respect to, you know, the two parameters, theta one and theta two.
8:59
Now let's switch to my Octave window. I'm gonna type in those commands I had just now. So, options equals optimset. This is the notation for setting my
9:09
parameters on my options, for my optimization algorithm. Grant option on, maxIter, 100 so that says 100 iterations, and I am going to provide the gradient to my algorithm.
9:23
Let's say initial theta equals zero's two by one. So that's my initial guess for theta.
9:30
And now I have of theta,
9:32
function val exit flag
9:37
equals fminunc constraint.
9:40
A pointer to the cost function.
9:43
and provide my initial guess.
9:46
And the options like so. And if I hit enter this will run the optimization algorithm.
9:53
And it returns pretty quickly. This funny formatting that's because my line, you know, my
9:59
code wrapped around. So, this funny thing is just because my command line had wrapped around. But what this says is that numerically renders, you know, think of it as gradient descent on steroids, they found the optimal value of a theta is theta 1 equals 5, theta 2 equals 5, exactly as we're hoping for. The function value at the optimum is essentially 10 to the minus 30. So that's essentially zero, which is also what we're hoping for. And the exit flag is 1, and this shows what the convergence status of this. And if you want you can do help fminunc to read the documentation for how to interpret the exit flag. But the exit flag let's you verify whether or not this algorithm thing has converged.
10:43
So that's how you run these algorithms in Octave.
10:47
I should mention, by the way, that for the Octave implementation, this value of theta, your parameter vector of theta, must be in rd for d greater than or equal to 2. So if theta is just a real number. So, if it is not at least a two-dimensional vector or some higher than two-dimensional vector, this fminunc may not work, so and if in case you have a one-dimensional function that you use to optimize, you can look in the octave documentation for fminunc for additional details.
11:18
So, that's how we optimize our trial example of this simple quick driving cost function. However, we apply this to let's just say progression.
11:27
In logistic regression we have a parameter vector theta, and I'm going to use a mix of octave notation and sort of math notation. But I hope this explanation will be clear, but our parameter vector theta comprises these parameters theta 0 through theta n because octave indexes,
11:46
vectors using indexing from 1, you know, theta 0 is actually written theta 1 in octave, theta 1 is gonna be written. So, if theta 2 in octave and that's gonna be a written theta n+1, right? And that's because Octave indexes is vectors starting from index of 1 and so the index of 0.
12:06
So what we need to do then is write a cost function that captures the cost function for logistic regression. Concretely, the cost function needs to return J-val, which is, you know, J-val as you need some codes to compute J of theta and we also need to give it the gradient. So, gradient 1 is going to be some code to compute the partial derivative in respect to theta 0, the next partial derivative respect to theta 1 and so on. Once again, this is gradient
12:37
1, gradient 2 and so on, rather than gradient 0, gradient 1 because octave indexes is vectors starting from one rather than from zero.
12:47
But the main concept I hope you take away from this slide is, that what you need to do, is write a function that returns
12:55
the cost function and returns the gradient.
12:58
And so in order to apply this to logistic regression or even to linear regression, if you want to use these optimization algorithms for linear regression.
13:07
What you need to do is plug in the appropriate code to compute these things over here.
13:15
So, now you know how to use these advanced optimization algorithms.
13:19
Because, using, because for these algorithms, you're using a sophisticated optimization library, it makes the just a little bit more opaque and so just maybe a little bit harder to debug. But because these algorithms often run much faster than gradient descent, often quite typically whenever I have a large machine learning problem, I will use these algorithms instead of using gradient descent.
13:43
And with these ideas, hopefully, you'll be able to get logistic progression and also linear regression to work on much larger problems. So, that's it for advanced optimization concepts.
13:55
And in the next and final video on Logistic Regression, I want to tell you how to take the logistic regression algorithm that you already know about and make it work also on multi-class classification problems.

Reading: Advanced Optimization

Advanced Optimization
Note: [7:35 - '100' should be 100 instead. The value provided should be an integer and not a character string.]

"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them.

We first need to provide a function that evaluates the following two functions for a given input value θ:

We can write a single function that returns both of these:
```
function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end
```
Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()". (Note: the value for MaxIter should be an integer, not a character string - errata in the video at 7:30)
```
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
```
We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand.

Video: Multiclass Classification: One-vs-all

In this video we'll talk about how to get logistic regression to work for multiclass classification problems. And in particular I want to tell you about an algorithm called one-versus-all classification.
0:12
What's a multiclass classification problem? Here are some examples. Lets say you want a learning algorithm to automatically put your email into different folders or to automatically tag your emails so you might have different folders or different tags for work email, email from your friends, email from your family, and emails about your hobby. And so here we have a classification problem with four classes which we might assign to the classes y = 1, y =2, y =3, and y = 4 too. And another example, for medical diagnosis, if a patient comes into your office with maybe a stuffy nose, the possible diagnosis could be that they're not ill. Maybe that's y = 1. Or they have a cold, 2. Or they have a flu.
0:59
And a third and final example if you are using machine learning to classify the weather, you know maybe you want to decide that the weather is sunny, cloudy, rainy, or snow, or if it's gonna be snow, and so in all of these examples, y can take on a small number of values, maybe one to three, one to four and so on, and these are multiclass classification problems. And by the way, it doesn't really matter whether we index is at 0, 1, 2, 3, or as 1, 2, 3, 4. I tend to index my classes starting from 1 rather than starting from 0, but either way we're off and it really doesn't matter. Whereas previously for a binary classification problem, our data sets look like this. For a multi-class classification problem our data sets may look like this where here I'm using three different symbols to represent our three classes. So the question is given the data set with three classes where this is an example of one class, that's an example of a different class, and that's an example of yet a third class. How do we get a learning algorithm to work for the setting? We already know how to do binary classification using a regression. We know how to you know maybe fit a straight line to set for the positive and negative classes. You see an idea called one-vs-all classification. We can then take this and make it work for multi-class classification as well. Here's how a one-vs-all classification works. And this is also sometimes called one-vs-rest. Let's say we have a training set like that shown on the left, where we have three classes of y equals 1, we denote that with a triangle, if y equals 2, the square, and if y equals three, then the cross. What we're going to do is take our training set and turn this into three separate binary classification problems. I'll turn this into three separate two class classification problems. So let's start with class one which is the triangle. We're gonna essentially create a new sort of fake training set where classes two and three get assigned to the negative class. And class one gets assigned to the positive class. You want to create a new training set like that shown on the right, and we're going to fit a classifier which I'm going to call h subscript theta superscript one of x where here the triangles are the positive examples and the circles are the negative examples. So think of the triangles being assigned the value of one and the circles assigned the value of zero. And we're just going to train a standard logistic regression classifier and maybe that will give us a position boundary that looks like that. Okay?
3:34
This superscript one here stands for class one, so we're doing this for the triangles of class one. Next we do the same thing for class two. Gonna take the squares and assign the squares as the positive class, and assign everything else, the triangles and the crosses, as a negative class. And then we fit a second logistic regression classifier and call this h of x superscript two, where the superscript two denotes that we're now doing this, treating the square class as the positive class. And maybe we get classified like that. And finally, we do the same thing for the third class and fit a third classifier h super script three of x, and maybe this will give us a decision bounty of the visible cross fire. This separates the positive and negative examples like that.
4:22
So to summarize, what we've done is, we've fit three classifiers. So, for i = 1, 2, 3, we'll fit a classifier x super script i subscript theta of x. Thus trying to estimate what is the probability that y is equal to class i, given x and parametrized by theta. Right? So in the first instance for this first one up here, this classifier was learning to recognize the triangles. So it's thinking of the triangles as a positive clause, so x superscript one is essentially trying to estimate what is the probability that the y is equal to one, given that x is parametrized by theta. And similarly, this is treating the square class as a positive class and so it's trying to estimate the probability that y = 2 and so on. So we now have three classifiers, each of which was trained to recognize one of the three classes. Just to summarize, what we've done is we want to train a logistic regression classifier h superscript i of x for each class i to predict the probability that y is equal to i. Finally to make a prediction, when we're given a new input x, and we want to make a prediction. What we do is we just run all three of our classifiers on the input x and we then pick the class i that maximizes the three. So we just basically pick the classifier, I think whichever one of the three classifiers is most confident and so the most enthusiastically says that it thinks it has the right clause. So whichever value of i gives us the highest probability we then predict y to be that value.
6:02
So that's it for multi-class classification and one-vs-all method. And with this little method you can now take the logistic regression classifier and make it work on multi-class classification problems as well

Reading: Multiclass Classification: One-vs-all

Multiclass Classification: One-vs-all
Now we will approach the classification of data when we have more than two categories. Instead of y = {0,1} we will expand our definition so that y = {0,1...n}.

Since y = {0,1...n}, we divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that 'y' is a member of one of our classes.

We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.

The following image shows how one could classify 3 classes:

To summarize:

Train a logistic regression classifier (h_ heta(x)) for each class to predict the probability that y = i .

To make a prediction on a new x, pick the class that maximizes (h_ heta (x))

Reading: Lecture Slides

Lecture6.pdf

Regularization(Solving the Problem of Overfitting)

Machine learning models need to generalize well to new examples that the model has not seen in practice. In this module, we introduce regularization, which helps prevent models from overfitting the training data.

4 videos, 5 readings

Video: The Problem of Overfitting

By now, you've seen a couple different learning algorithms, linear regression and logistic regression. They work well for many problems, but when you apply them to certain machine learning applications, they can run into a problem called overfitting that can cause them to perform very poorly. What I'd like to do in this video is explain to you what is this overfitting problem, and in the next few videos after this, we'll talk about a technique called regularization, that will allow us to ameliorate or to reduce this overfitting problem and get these learning algorithms to maybe work much better. So what is overfitting? Let's keep using our running example of predicting housing prices with linear regression where we want to predict the price as a function of the size of the house. One thing we could do is fit a linear function to this data, and if we do that, maybe we get that sort of straight line fit to the data. But this isn't a very good model. Looking at the data, it seems pretty clear that as the size of the housing increases, the housing prices plateau, or kind of flattens out as we move to the right and so this algorithm does not fit the training and we call this problem underfitting, and another term for this is that this algorithm has high bias. Both of these roughly mean that it's just not even fitting the training data very well. The term is kind of a historical or technical one, but the idea is that if a fitting a straight line to the data, then, it's as if the algorithm has a very strong preconception, or a very strong bias that housing prices are going to vary linearly with their size and despite the data to the contrary. Despite the evidence of the contrary is preconceptions still are bias, still closes it to fit a straight line and this ends up being a poor fit to the data. Now, in the middle, we could fit a quadratic functions enter and, with this data set, we fit the quadratic function, maybe, we get that kind of curve and, that works pretty well. And, at the other extreme, would be if we were to fit, say a fourth other polynomial to the data. So, here we have five parameters, theta zero through theta four, and, with that, we can actually fill a curve that process through all five of our training examples. You might get a curve that looks like this.
2:31
That, on the one hand, seems to do a very good job fitting the training set and, that is processed through all of my data, at least. But, this is still a very wiggly curve, right? So, it's going up and down all over the place, and, we don't actually think that's such a good model for predicting housing prices. So, this problem we call overfitting, and, another term for this is that this algorithm has high variance.. The term high variance is another historical or technical one. But, the intuition is that, if we're fitting such a high order polynomial, then, the hypothesis can fit, you know, it's almost as if it can fit almost any function and this face of possible hypothesis is just too large, it's too variable. And we don't have enough data to constrain it to give us a good hypothesis so that's called overfitting. And in the middle, there isn't really a name but I'm just going to write, you know, just right. Where a second degree polynomial, quadratic function seems to be just right for fitting this data. To recap a bit the problem of over fitting comes when if we have too many features, then to learn hypothesis may fit the training side very well. So, your cost function may actually be very close to zero or may be even zero exactly, but you may then end up with a curve like this that, you know tries too hard to fit the training set, so that it even fails to generalize to new examples and fails to predict prices on new examples as well, and here the term generalized refers to how well a hypothesis applies even to new examples. That is to data to houses that it has not seen in the training set. On this slide, we looked at over fitting for the case of linear regression. A similar thing can apply to logistic regression as well. Here is a logistic regression example with two features X1 and x2. One thing we could do, is fit logistic regression with just a simple hypothesis like this, where, as usual, G is my sigmoid function. And if you do that, you end up with a hypothesis, trying to use, maybe, just a straight line to separate the positive and the negative examples. And this doesn't look like a very good fit to the hypothesis. So, once again, this is an example of underfitting or of the hypothesis having high bias. In contrast, if you were to add to your features these quadratic terms, then, you could get a decision boundary that might look more like this. And, you know, that's a pretty good fit to the data. Probably, about as good as we could get, on this training set. And, finally, at the other extreme, if you were to fit a very high-order polynomial, if you were to generate lots of high-order polynomial terms of speeches, then, logistical regression may contort itself, may try really hard to find a decision boundary that fits your training data or go to great lengths to contort itself, to fit every single training example well. And, you know, if the features X1 and X2 offer predicting, maybe, the cancer to the, you know, cancer is a malignant, benign breast tumors. This doesn't, this really doesn't look like a very good hypothesis, for making predictions. And so, once again, this is an instance of overfitting and, of a hypothesis having high variance and not really, and, being unlikely to generalize well to new examples. Later, in this course, when we talk about debugging and diagnosing things that can go wrong with learning algorithms, we'll give you specific tools to recognize when overfitting and, also, when underfitting may be occurring. But, for now, lets talk about the problem of, if we think overfitting is occurring, what can we do to address it? In the previous examples, we had one or two dimensional data so, we could just plot the hypothesis and see what was going on and select the appropriate degree polynomial. So, earlier for the housing prices example, we could just plot the hypothesis and, you know, maybe see that it was fitting the sort of very wiggly function that goes all over the place to predict housing prices. And we could then use figures like these to select an appropriate degree polynomial. So plotting the hypothesis, could be one way to try to decide what degree polynomial to use. But that doesn't always work. And, in fact more often we may have learning problems that where we just have a lot of features. And there is not just a matter of selecting what degree polynomial. And, in fact, when we have so many features, it also becomes much harder to plot the data and it becomes much harder to visualize it, to decide what features to keep or not. So concretely, if we're trying predict housing prices sometimes we can just have a lot of different features. And all of these features seem, you know, maybe they seem kind of useful. But, if we have a lot of features, and, very little training data, then, over fitting can become a problem. In order to address over fitting, there are two main options for things that we can do. The first option is, to try to reduce the number of features. Concretely, one thing we could do is manually look through the list of features, and, use that to try to decide which are the more important features, and, therefore, which are the features we should keep, and, which are the features we should throw out. Later in this course, where also talk about model selection algorithms. Which are algorithms for automatically deciding which features to keep and, which features to throw out. This idea of reducing the number of features can work well, and, can reduce over fitting. And, when we talk about model selection, we'll go into this in much greater depth. But, the disadvantage is that, by throwing away some of the features, is also throwing away some of the information you have about the problem. For example, maybe, all of those features are actually useful for predicting the price of a house, so, maybe, we don't actually want to throw some of our information or throw some of our features away. The second option, which we'll talk about in the next few videos, is regularization. Here, we're going to keep all the features, but we're going to reduce the magnitude or the values of the parameters theta J. And, this method works well, we'll see, when we have a lot of features, each of which contributes a little bit to predicting the value of Y, like we saw in the housing price prediction example. Where we could have a lot of features, each of which are, you know, somewhat useful, so, maybe, we don't want to throw them away. So, this subscribes the idea of regularization at a very high level. And, I realize that, all of these details probably don't make sense to you yet. But, in the next video, we'll start to formulate exactly how to apply regularization and, exactly what regularization means. And, then we'll start to figure out, how to use this, to make how learning algorithms work well and avoid overfitting.

Reading: The Problem of Overfitting

The Problem of Overfitting
Consider the problem of predicting y from x ∈ R. The leftmost figure below shows the result of fitting a (y = θ_0 + θ_1x) to a dataset. We see that the data doesn’t really lie on straight line, and so the fit is not very good.

Instead, if we had added an extra feature (x^2), and fit (y = heta_0 + heta_1x + heta_2x^2), then we obtain a slightly better fit to the data (See middle figure). Naively, it might seem that the more features we add, the better. However, there is also a danger in adding too many features: The rightmost figure is the result of fitting a (5^{th}) order polynomial (y = sum_{j=0} ^5 heta_j x^j). We see that even though the fitted curve passes through the data perfectly, we would not expect this to be a very good predictor of, say, housing prices (y) for different living areas (x). Without formally defining what these terms mean, we’ll say the figure on the left shows an instance of underfitting—in which the data clearly shows structure not captured by the model—and the figure on the right is an example of overfitting.

Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. At the other extreme, overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting:
1. Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm (studied later in the course).
1. Regularization
- Keep all the features, but reduce the magnitude of parameters ( heta_j).
- Regularization works well when we have a lot of slightly useful features.
Video: Cost Function

In this video, I'd like to convey to you, the main intuitions behind how regularization works. And, we'll also write down the cost function that we'll use, when we were using regularization. With the hand drawn examples that we have on these slides, I think I'll be able to convey part of the intuition. But, an even better way to see for yourself, how regularization works, is if you implement it, and, see it work for yourself. And, if you do the appropriate exercises after this, you get the chance to self see regularization in action for yourself. So, here is the intuition. In the previous video, we saw that, if we were to fit a quadratic function to this data, it gives us a pretty good fit to the data. Whereas, if we were to fit an overly high order degree polynomial, we end up with a curve that may fit the training set very well, but, really not be a, but overfit the data poorly, and, not generalize well. Consider the following, suppose we were to penalize, and, make the parameters theta 3 and theta 4 really small. Here's what I mean, here is our optimization objective, or here is our optimization problem, where we minimize our usual squared error cause function. Let's say I take this objective and modify it and add to it, plus 1000 theta 3 squared, plus 1000 theta 4 squared. 1000 I am just writing down as some huge number. Now, if we were to minimize this function, the only way to make this new cost function small is if theta 3 and data 4 are small, right? Because otherwise, if you have a thousand times theta 3, this new cost functions gonna be big. So when we minimize this new function we are going to end up with theta 3 close to 0 and theta 4 close to 0, and as if we're getting rid of these two terms over there. And if we do that, well then, if theta 3 and theta 4 close to 0 then we are being left with a quadratic function, and, so, we end up with a fit to the data, that's, you know, quadratic function plus maybe, tiny contributions from small terms, theta 3, theta 4, that they may be very close to 0. And, so, we end up with essentially, a quadratic function, which is good. Because this is a much better hypothesis. In this particular example, we looked at the effect of penalizing two of the parameter values being large. More generally, here is the idea behind regularization. The idea is that, if we have small values for the parameters, then, having small values for the parameters, will somehow, will usually correspond to having a simpler hypothesis. So, for our last example, we penalize just theta 3 and theta 4 and when both of these were close to zero, we wound up with a much simpler hypothesis that was essentially a quadratic function. But more broadly, if we penalize all the parameters usually that, we can think of that, as trying to give us a simpler hypothesis as well because when, you know, these parameters are as close as you in this example, that gave us a quadratic function. But more generally, it is possible to show that having smaller values of the parameters corresponds to usually smoother functions as well for the simpler. And which are therefore, also, less prone to overfitting. I realize that the reasoning for why having all the parameters be small. Why that corresponds to a simpler hypothesis; I realize that reasoning may not be entirely clear to you right now. And it is kind of hard to explain unless you implement yourself and see it for yourself. But I hope that the example of having theta 3 and theta 4 be small and how that gave us a simpler hypothesis, I hope that helps explain why, at least give some intuition as to why this might be true. Lets look at the specific example. For housing price prediction we may have our hundred features that we talked about where may be x1 is the size, x2 is the number of bedrooms, x3 is the number of floors and so on. And we may we may have a hundred features. And unlike the polynomial example, we don't know, right, we don't know that theta 3, theta 4, are the high order polynomial terms. So, if we have just a bag, if we have just a set of a hundred features, it's hard to pick in advance which are the ones that are less likely to be relevant. So we have a hundred or a hundred one parameters. And we don't know which ones to pick, we don't know which parameters to try to pick, to try to shrink. So, in regularization, what we're going to do, is take our cost function, here's my cost function for linear regression. And what I'm going to do is, modify this cost function to shrink all of my parameters, because, you know, I don't know which one or two to try to shrink. So I am going to modify my cost function to add a term at the end. Like so we have square brackets here as well. When I add an extra regularization term at the end to shrink every single parameter and so this term we tend to shrink all of my parameters theta 1, theta 2, theta 3 up to theta 100.
5:36
By the way, by convention the summation here starts from one so I am not actually going penalize theta zero being large. That sort of the convention that, the sum I equals one through N, rather than I equals zero through N. But in practice, it makes very little difference, and, whether you include, you know, theta zero or not, in practice, make very little difference to the results. But by convention, usually, we regularize only theta through theta 100. Writing down our regularized optimization objective, our regularized cost function again. Here it is. Here's J of theta where, this term on the right is a regularization term and lambda here is called the regularization parameter and what lambda does, is it controls a trade off between two different goals. The first goal, capture it by the first goal objective, is that we would like to train, is that we would like to fit the training data well. We would like to fit the training set well. And the second goal is, we want to keep the parameters small, and that's captured by the second term, by the regularization objective. And by the regularization term. And what lambda, the regularization parameter does is the controls the trade of between these two goals, between the goal of fitting the training set well and the goal of keeping the parameter plan small and therefore keeping the hypothesis relatively simple to avoid overfitting. For our housing price prediction example, whereas, previously, if we had fit a very high order polynomial, we may have wound up with a very, sort of wiggly or curvy function like this. If you still fit a high order polynomial with all the polynomial features in there, but instead, you just make sure, to use this sole of regularized objective, then what you can get out is in fact a curve that isn't quite a quadratic function, but is much smoother and much simpler and maybe a curve like the magenta line that, you know, gives a much better hypothesis for this data. Once again, I realize it can be a bit difficult to see why strengthening the parameters can have this effect, but if you implement yourselves with regularization you will be able to see this effect firsthand.
8:00
In regularized linear regression, if the regularization parameter monitor is set to be very large, then what will happen is we will end up penalizing the parameters theta 1, theta 2, theta 3, theta 4 very highly. That is, if our hypothesis is this is one down at the bottom. And if we end up penalizing theta 1, theta 2, theta 3, theta 4 very heavily, then we end up with all of these parameters close to zero, right? Theta 1 will be close to zero; theta 2 will be close to zero. Theta three and theta four will end up being close to zero. And if we do that, it's as if we're getting rid of these terms in the hypothesis so that we're just left with a hypothesis that will say that. It says that, well, housing prices are equal to theta zero, and that is akin to fitting a flat horizontal straight line to the data. And this is an example of underfitting, and in particular this hypothesis, this straight line it just fails to fit the training set well. It's just a fat straight line, it doesn't go, you know, go near. It doesn't go anywhere near most of the training examples. And another way of saying this is that this hypothesis has too strong a preconception or too high bias that housing prices are just equal to theta zero, and despite the clear data to the contrary, you know chooses to fit a sort of, flat line, just a flat horizontal line. I didn't draw that very well. This just a horizontal flat line to the data. So for regularization to work well, some care should be taken, to choose a good choice for the regularization parameter lambda as well. And when we talk about multi-selection later in this course, we'll talk about a way, a variety of ways for automatically choosing the regularization parameter lambda as well. So, that's the idea of the high regularization and the cost function reviews in order to use regularization In the next two videos, lets take these ideas and apply them to linear regression and to logistic regression, so that we can then get them to avoid overfitting.

Reading: Cost Function

Cost Function
Note: [5:18 - There is a typo. It should be (sum_{j=1}^{n} heta _j ^2) instead of (sum_{i=1}^{n} heta _j ^2)]

If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.

Say we wanted to make the following function more quadratic:

( heta_0 + heta_1x + heta_2x^2 + heta_3x^3 + heta_4x^4)

We'll want to eliminate the influence of ( heta_3x^3) and ( heta_4x^4). Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:

(min_ heta dfrac{1}{2m}sum_{i=1}^m (h_ heta(x^{(i)}) - y^{(i)})^2 + 1000cdot heta_3^2 + 1000cdot heta_4^2)

We've added two extra terms at the end to inflate the cost of ( heta_3) and ( heta_4). Now, in order for the cost function to get close to zero, we will have to reduce the values of ( heta_3) and ( heta_4) to near zero. This will in turn greatly reduce the values of ( heta_3x^3) and ( heta_4x^4) in our hypothesis function. As a result, we see that the new hypothesis (depicted by the pink curve) looks like a quadratic function but fits the data better due to the extra small terms ( heta_3x^3) and ( heta_4x^4).

We could also regularize all of our theta parameters in a single summation as:

(min_ heta dfrac{1}{2m}sum_{i=1}^m (h_ heta(x^{(i)}) - y^{(i)})^2 + lambdasum_{j=1}^{n} heta _j ^2)

The (lambda), or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated.

Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting. Hence, what would happen if (lambda = 0) or is too small ?

Video: Regularized Linear Regression

For linear regression, we have previously worked out two learning algorithms. One based on gradient descent and one based on the normal equation. In this video, we'll take those two algorithms and generalize them to the case of regularized linear regression.
0:17
Here's the optimization objective that we came up with last time for regularized linear regression. This first part is our usual objective for linear regression. And we now have this additional regularization term, where lambda is our regularization parameter, and we like to find parameters theta that minimizes this cost function, this regularized cost function, J of theta. Previously, we were using gradient descent for the original cost function without the regularization term. And we had the following algorithm, for regular linear regression, without regularization, we would repeatedly update the parameters theta J as follows for J equals 0, 1, 2, up through n.
1:02
Let me take this and just write the case for theta 0 separately. So I'm just going to write the update for theta 0 separately than for the update for the parameters 1, 2, 3, and so on up to n. And so this is, I haven't changed anything yet, right. This is just writing the update for theta 0 separately from the updates for theta 1, theta 2, theta 3, up to theta n. And the reason I want to do this is you may remember that for our regularized linear regression, we penalize the parameters theta 1, theta 2, and so on up to theta n. But we don't penalize theta 0. So, when we modify this algorithm for regularized linear regression, we're going to end up treating theta zero slightly differently.
1:48
Concretely, if we want to take this algorithm and modify it to use the regular rise objective, all we need to do is take this term at the bottom and modify it as follows. We'll take this term and add minus lambda over m times theta j. And if you implement this, then you have gradient descent for trying to minimize the regularized cost function, j of theta. And concretely, I'm not gonna do the calculus to prove it, but concretely if you look at this term, this term hat I've written in square brackets, if you know calculus it's possible to prove that that term is the partial derivative with respect to J of theta using the new definition of J of theta with the regularization term. And similarly, this term up on top which I'm drawing the cyan box, that's still the partial derivative respect of theta zero of J of theta. If you look at the update for theta j, it's possible to show something very interesting. Concretely, theta j gets updated as theta j minus alpha times, and then you have this other term here that depends on theta J. So if you group all the terms together that depend on theta j, you can show that this update can be written equivalently as follows. And all I did was add theta j here is, so theta j times 1. And this term is, right, lambda over m, there's also an alpha here, so you end up with alpha lambda over m multiplied into theta j.
3:30
And this term here, 1 minus alpha times lambda m, is a pretty interesting term. It has a pretty interesting effect.
3:42
Concretely this term, 1 minus alpha times lambda over m, is going to be a number that is usually a little bit less than one, because alpha times lambda over m is going to be positive, and usually if your learning rate is small and if m is large, this is usually pretty small. So this term here is gonna be a number that's usually a little bit less than 1, so think of it as a number like 0.99, let's say. And so the effect of our update to theta j is, we're going to say that theta j gets replaced by theta j times 0.99, right?
4:16
So theta j times 0.99 has the effect of shrinking theta j a little bit towards zero. So this makes theta j a bit smaller. And more formally, this makes the square norm of theta j a little bit smaller. And then after that, the second term here, that's actually exactly the same as the original gradient descent update that we had, before we added all this regularization stuff. So, hopefully this gradient descent, hopefully this update makes sense. When we're using a regularized linear regression and what we're doing is on every iteration we're multiplying theta j by a number that's a little bit less then one, so its shrinking the parameter a little bit, and then we're performing a similar update as before. Of course that's just the intuition behind what this particular update is doing. Mathematically what it's doing is it's exactly gradient descent on the cost function J of theta that we defined on the previous slide that uses the regularization term. Gradient descent was just one of our two algorithms for fitting a linear regression model. The second algorithm was the one based on the normal equation, where what we did was we created the design matrix X where each row corresponded to a separate training example. And we created a vector y, so this is a vector, that's an m dimensional vector. And that contained the labels from my training set. So whereas X is an m by (n+1) dimensional matrix, y is an m dimensional vector. And in order to minimize the cost function J, we found that one way to do so is to set theta to be equal to this. Right, you have X transpose X, inverse, X transpose Y. I'm leaving room here to fill in stuff of course. And what this value for theta does is this minimizes the cost function J of theta, when we were not using regularization.
6:26
Now that we are using regularization, if you were to derive what the minimum is, and just to give you a sense of how to derive the minimum, the way you derive it is you take partial derivatives with respect to each parameter. Set this to zero, and then do a bunch of math and you can then show that it's a formula like this that minimizes the cost function. And concretely, if you are using regularization, then this formula changes as follows. Inside this parenthesis, you end up with a matrix like this. 0, 1, 1, 1, and so on, 1, until the bottom. So this thing over here is a matrix whose upper left-most entry is 0. There are ones on the diagonals, and then zeros everywhere else in this matrix. Because I'm drawing this rather sloppily. But as a example, if n = 2, then this matrix is going to be a three by three matrix. More generally, this matrix is an (n+1) by (n+1) dimensional matrix. So if n = 2, then that matrix becomes something that looks like this. It would be 0, and then 1s on the diagonals, and then 0s on the rest of the diagonals. And once again, I'm not going to show this derivation, which is frankly somewhat long and involved, but it is possible to prove that if you are using the new definition of J of theta, with the regularization objective, then this new formula for theta is the one that we give you, the global minimum of J of theta.
8:01
So finally I want to just quickly describe the issue of non-invertibility. This is relatively advanced material, so you should consider this as optional. And feel free to skip it, or if you listen to it and positive it doesn't really make sense, don't worry about it either. But earlier when I talked about the normal equation method, we also had an optional video on the non-invertibility issue. So this is another optional part to this, sort of an add-on to that earlier optional video on non-invertibility. Now, consider a setting where m, the number of examples, is less than or equal to n, the number of features. If you have fewer examples than features, than this matrix, X transpose X will be non-invertible, or singular. Or the other term for this is the matrix will be degenerate. And if you implement this in Octave anyway and you use the pinv function to take the pseudo inverse, it will kind of do the right thing, but it's not clear that it would give you a very good hypothesis, even though numerically the Octave pinv function will give you a result that kinda makes sense.
9:13
But if you were doing this in a different language, and if you were taking just the regular inverse, which in Octave denoted with the function inv, we're trying to take the regular inverse of X transpose X. Then in this setting, you find that X transpose X is singular, is non-invertible, and if you're doing this in different program language and using some linear algebra library to try to take the inverse of this matrix, it just might not work because that matrix is non-invertible or singular. Fortunately, regularization also takes care of this for us. And concretely, so long as the regularization parameter lambda is strictly greater than 0, it is actually possible to prove that this matrix, X transpose X plus lambda times this funny matrix here, it is possible to prove that this matrix will not be singular and that this matrix will be invertible. So using regularization also takes care of any non-invertibility issues of the X transpose X matrix as well. So you now know how to implement regularized linear regression. Using this you'll be able to avoid overfitting even if you have lots of features in a relatively small training set. And this should let you get linear regression to work much better for many problems. In the next video we'll take this regularization idea and apply it to logistic regression. So that you'd be able to get logistic regression to avoid overfitting and perform much better as well.

Reading: Regularized Linear Regression

Regularized Linear Regression
Note: [8:43 - It is said that X is non-invertible if m ≤ n. The correct statement should be that X is non-invertible if m < n, and may be non-invertible if m = n.

We can apply regularization to both linear regression and logistic regression. We will approach linear regression first.

Gradient Descent
We will modify our gradient descent function to separate out ( heta_0) from the rest of the parameters because we do not want to penalize ( heta_0).

The term (frac{lambda}{m} heta_j) performs our regularization. With some manipulation our update rule can also be represented as:

( heta_j := heta_j(1 - alphafrac{lambda}{m}) - alphafrac{1}{m}sum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x_j^{(i)})

The first term in the above equation, (1 - alphafrac{lambda}{m}) will always be less than 1. Intuitively you can see it as reducing the value of ( heta_j) by some amount on every update. Notice that the second term is now exactly the same as it was before.

Normal Equation

Now let's approach regularization using the alternate method of the non-iterative normal equation.

To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:

L is a matrix with 0 at the top left and 1's down the diagonal, with 0's everywhere else. It should have dimension (n+1)×(n+1). Intuitively, this is the identity matrix (though we are not including (x_0), multiplied with a single real number (lambda).

Recall that if m < n, then (X^TX) is non-invertible. However, when we add the term (lambda⋅L), then (X^TX + lambda⋅L) becomes invertible.

Video: Regularized Logistic Regression

For logistic regression, we previously talked about two types of optimization algorithms. We talked about how to use gradient descent to optimize as cost function J of theta. And we also talked about advanced optimization methods. Ones that require that you provide a way to compute your cost function J of theta and that you provide a way to compute the derivatives.
0:22
In this video, we'll show how you can adapt both of those techniques, both gradient descent and the more advanced optimization techniques in order to have them work for regularized logistic regression.
0:35
So, here's the idea. We saw earlier that Logistic Regression can also be prone to overfitting if you fit it with a very, sort of, high order polynomial features like this. Where G is the sigmoid function and in particular you end up with a hypothesis, you know, whose decision bound to be just sort of an overly complex and extremely contortive function that really isn't such a great hypothesis for this training set, and more generally if you have logistic regression with a lot of features. Not necessarily polynomial ones, but just with a lot of features you can end up with overfitting.
1:11
This was our cost function for logistic regression. And if we want to modify it to use regularization, all we need to do is add to it the following term plus londer over 2M, sum from J equals 1, and as usual sum from J equals 1. Rather than the sum from J equals 0, of theta J squared. And this has to effect therefore, of penalizing the parameters theta 1 theta 2 and so on up to theta N from being too large.
1:43
And if you do this,
1:45
then it will the have the effect that even though you're fitting a very high order polynomial with a lot of parameters. So long as you apply regularization and keep the parameters small you're more likely to get a decision boundary.
1:58
You know, that maybe looks more like this. It looks more reasonable for separating
2:02
the positive and the negative examples.
2:05
So, when using regularization
2:08
even when you have a lot of features, the regularization can help take care of the overfitting problem.
2:14
How do we actually implement this? Well, for the original gradient descent algorithm, this was the update we had. We will repeatedly perform the following update to theta J. This slide looks a lot like the previous one for linear regression. But what I'm going to do is write the update for theta 0 separately. So, the first line is for update for theta 0 and a second line is now my update for theta 1 up to theta N. Because I'm going to treat theta 0 separately. And in order to modify this algorithm, to use
2:46
a regularized cos function, all I need to do is pretty similar to what we did for linear regression is actually to just modify this second update rule as follows.
2:58
And, once again, this, you know, cosmetically looks identical what we had for linear regression. But of course is not the same algorithm as we had, because now the hypothesis is defined using this. So this is not the same algorithm as regularized linear regression. Because the hypothesis is different. Even though this update that I wrote down. It actually looks cosmetically the same as what we had earlier. We're working out gradient descent for regularized linear regression.
3:26
And of course, just to wrap up this discussion, this term here in the square brackets, so this term here, this term is, of course, the new partial derivative for respect of theta J of the new cost function J of theta. Where J of theta here is the cost function we defined on a previous slide that does use regularization.
3:49
So, that's gradient descent for regularized linear regression.
3:55
Let's talk about how to get regularized linear regression to work using the more advanced optimization methods.
4:03
And just to remind you for those methods what we needed to do was to define the function that's called the cost function, that takes us input the parameter vector theta and once again in the equations we've been writing here we used 0 index vectors. So we had theta 0 up to theta N. But because Octave indexes the vectors starting from 1. Theta 0 is written in Octave as theta 1. Theta 1 is written in Octave as theta 2, and so on down to theta
4:36
N plus 1. And what we needed to do was provide a function. Let's provide a function called cost function that we would then pass in to what we have, what we saw earlier. We will use the fminunc and then you know at cost function,
4:54
and so on, right. But the F min, u and c was the F min unconstrained and this will work with fminunc was what will take the cost function and minimize it for us.
5:05
So the two main things that the cost function needed to return were first J-val. And for that, we need to write code to compute the cost function J of theta.
5:17
Now, when we're using regularized logistic regression, of course the cost function j of theta changes and, in particular,
5:24
now a cost function needs to include this additional regularization term at the end as well. So, when you compute j of theta be sure to include that term at the end.
5:34
And then, the other thing that this cost function thing needs to derive with a gradient. So gradient one needs to be set to the partial derivative of J of theta with respect to theta zero, gradient two needs to be set to that, and so on. Once again, the index is off by one. Right, because of the indexing from one Octave users.
5:55
And looking at these terms.
5:57
This term over here. We actually worked this out on a previous slide is actually equal to this. It doesn't change. Because the derivative for theta zero doesn't change. Compared to the version without regularization.
6:10
And the other terms do change. And in particular the derivative respect to theta one. We worked this out on the previous slide as well. Is equal to, you know, the original term and then minus londer M times theta 1. Just so we make sure we pass this correctly. And we can add parentheses here. Right, so the summation doesn't extend. And similarly, you know, this other term here looks like this, with this additional term that we had on the previous slide, that corresponds to the gradient from their regularization objective. So if you implement this cost function and pass this into fminunc or to one of those advanced optimization techniques, that will minimize the new regularized cost function J of theta.
6:56
And the parameters you get out
6:59
will be the ones that correspond to logistic regression with regularization.
7:04
So, now you know how to implement regularized logistic regression.
7:09
When I walk around Silicon Valley, I live here in Silicon Valley, there are a lot of engineers that are frankly, making a ton of money for their companies using machine learning algorithms.
7:19
And I know we've only been, you know, studying this stuff for a little while. But if you understand linear
7:26
regression, the advanced optimization algorithms and regularization, by now, frankly, you probably know quite a lot more machine learning than many, certainly now, but you probably know quite a lot more machine learning right now than frankly, many of the Silicon Valley engineers out there having very successful careers. You know, making tons of money for the companies. Or building products using machine learning algorithms.
7:50
So, congratulations.
7:52
You've actually come a long ways. And you can actually, you actually know enough to apply this stuff and get to work for many problems.
7:59
So congratulations for that. But of course, there's still a lot more that we want to teach you, and in the next set of videos after this, we'll start to talk about a very powerful cause of non-linear classifier. So whereas linear regression, logistic regression, you know, you can form polynomial terms, but it turns out that there are much more powerful nonlinear quantifiers that can then sort of polynomial regression. And in the next set of videos after this one, I'll start telling you about them. So that you have even more powerful learning algorithms than you have now to apply to different problems.

Reading: Regularized Logistic Regression

Regularized Logistic Regression
We can regularize logistic regression in a similar way that we regularize linear regression. As a result, we can avoid overfitting. The following image shows how the regularized function, displayed by the pink line, is less likely to overfit than the non-regularized function represented by the blue line:

Cost Function
Recall that our cost function for logistic regression was:

(J( heta) = - frac{1}{m} sum_{i=1}^m large[ y^{(i)} log (h_ heta (x^{(i)})) + (1 - y^{(i)}) log (1 - h_ heta(x^{(i)})) large])

We can regularize this equation by adding a term to the end:

(J( heta) = - frac{1}{m} sum_{i=1}^m large[ y^{(i)} log (h_ heta (x^{(i)})) + (1 - y^{(i)}) log (1 - h_ heta(x^{(i)})) large] + frac{lambda}{2m}sum_{j=1}^n heta_j^2)

The second sum, (sum_{j=1}^n heta_j^2) means to explicitly exclude the bias term, ( heta_0). I.e. the θ vector is indexed from 0 to n (holding n+1 values, ( heta_0) through ( heta_n)), and this sum explicitly skips ( heta_0), by running from 1 to n, skipping 0. Thus, when computing the equation, we should continuously update the two following equations:

Reading: Lecture Slides

Lecture7.pdf

Programming: Logistic Regression

Download the programming assignment here. This ZIP file contains the instructions in a PDF and the starter code. You may use either MATLAB or Octave (>= 3.8.0).

Neural Networks: Representation

Neural networks is a model inspired by how the brain works. It is widely used today in many applications: when your phone interprets and understand your voice commands, it is likely that a neural network is helping to understand your speech; when you cash a check, the machines that automatically read the digits also use neural networks.

7 videos, 6 readings

Video: Non-linear Hypotheses

In this and in the next set of videos, I'd like to tell you about a learning algorithm called a Neural Network.
0:07
We're going to first talk about the representation and then in the next set of videos talk about learning algorithms for it. Neutral networks is actually a pretty old idea, but had fallen out of favor for a while. But today, it is the state of the art technique for many different machine learning problems.
0:23
So why do we need yet another learning algorithm? We already have linear regression and we have logistic regression, so why do we need, you know, neural networks?
0:32
In order to motivate the discussion of neural networks, let me start by showing you a few examples of machine learning problems where we need to learn complex non-linear hypotheses.
0:43
Consider a supervised learning classification problem where you have a training set like this. If you want to apply logistic regression to this problem, one thing you could do is apply logistic regression with a lot of nonlinear features like that. So here, g as usual is the sigmoid function, and we can include lots of polynomial terms like these. And, if you include enough polynomial terms then, you know, maybe you can get a hypotheses
1:11
that separates the positive and negative examples. This particular method works well when you have only, say, two features - x1 and x2 - because you can then include all those polynomial terms of x1 and x2. But for many interesting machine learning problems would have a lot more features than just two.
1:30
We've been talking for a while about housing prediction, and suppose you have a housing classification
1:38
problem rather than a regression problem, like maybe if you have different features of a house, and you want to predict what are the odds that your house will be sold within the next six months, so that will be a classification problem.
1:52
And as we saw we can come up with quite a lot of features, maybe a hundred different features of different houses.
2:00
For a problem like this, if you were to include all the quadratic terms, all of these, even all of the quadratic that is the second or the polynomial terms, there would be a lot of them. There would be terms like x1 squared,
2:12
x1x2, x1x3, you know, x1x4
2:18
up to x1x100 and then you have x2 squared, x2x3
2:25
and so on. And if you include just the second order terms, that is, the terms that are a product of, you know, two of these terms, x1 times x1 and so on, then, for the case of n equals
2:38
100, you end up with about five thousand features.
2:41
And, asymptotically, the number of quadratic features grows roughly as order n squared, where n is the number of the original features, like x1 through x100 that we had. And its actually closer to n squared over two.
2:59
So including all the quadratic features doesn't seem like it's maybe a good idea, because that is a lot of features and you might up overfitting the training set, and it can also be computationally expensive, you know, to
3:14
be working with that many features.
3:16
One thing you could do is include only a subset of these, so if you include only the features x1 squared, x2 squared, x3 squared, up to maybe x100 squared, then the number of features is much smaller. Here you have only 100 such quadratic features, but this is not enough features and certainly won't let you fit the data set like that on the upper left. In fact, if you include only these quadratic features together with the original x1, and so on, up to x100 features, then you can actually fit very interesting hypotheses. So, you can fit things like, you know, access a line of the ellipses like these, but
3:55
you certainly cannot fit a more complex data set like that shown here.
3:59
So 5000 features seems like a lot, if you were to include the cubic, or third order known of each others, the x1, x2, x3. You know, x1 squared, x2, x10 and x11, x17 and so on. You can imagine there are gonna be a lot of these features. In fact, they are going to be order and cube such features and if any is 100 you can compute that, you end up with on the order of about 170,000 such cubic features and so including these higher auto-polynomial features when your original feature set end is large this really dramatically blows up your feature space and this doesn't seem like a good way to come up with additional features with which to build none many classifiers when n is large.
4:49
For many machine learning problems, n will be pretty large. Here's an example.
4:55
Let's consider the problem of computer vision.
4:59
And suppose you want to use machine learning to train a classifier to examine an image and tell us whether or not the image is a car.
5:09
Many people wonder why computer vision could be difficult. I mean when you and I look at this picture it is so obvious what this is. You wonder how is it that a learning algorithm could possibly fail to know what this picture is.
5:22
To understand why computer vision is hard let's zoom into a small part of the image like that area where the little red rectangle is. It turns out that where you and I see a car, the computer sees that. What it sees is this matrix, or this grid, of pixel intensity values that tells us the brightness of each pixel in the image. So the computer vision problem is to look at this matrix of pixel intensity values, and tell us that these numbers represent the door handle of a car.
5:54
Concretely, when we use machine learning to build a car detector, what we do is we come up with a label training set, with, let's say, a few label examples of cars and a few label examples of things that are not cars, then we give our training set to the learning algorithm trained a classifier and then, you know, we may test it and show the new image and ask, "What is this new thing?".
6:17
And hopefully it will recognize that that is a car.
6:21
To understand why we need nonlinear hypotheses, let's take a look at some of the images of cars and maybe non-cars that we might feed to our learning algorithm.
6:32
Let's pick a couple of pixel locations in our images, so that's pixel one location and pixel two location, and let's plot this car, you know, at the location, at a certain point, depending on the intensities of pixel one and pixel two.
6:49
And let's do this with a few other images. So let's take a different example of the car and you know, look at the same two pixel locations
6:56
and that image has a different intensity for pixel one and a different intensity for pixel two. So, it ends up at a different location on the figure. And then let's plot some negative examples as well. That's a non-car, that's a non-car . And if we do this for more and more examples using the pluses to denote cars and minuses to denote non-cars, what we'll find is that the cars and non-cars end up lying in different regions of the space, and what we need therefore is some sort of non-linear hypotheses to try to separate out the two classes.
7:32
What is the dimension of the feature space? Suppose we were to use just 50 by 50 pixel images. Now that suppose our images were pretty small ones, just 50 pixels on the side. Then we would have 2500 pixels,
7:46
and so the dimension of our feature size will be N equals 2500 where our feature vector x is a list of all the pixel testings, you know, the pixel brightness of pixel one, the brightness of pixel two, and so on down to the pixel brightness of the last pixel where, you know, in a typical computer representation, each of these may be values between say 0 to 255 if it gives us the grayscale value. So we have n equals 2500, and that's if we were using grayscale images. If we were using RGB images with separate red, green and blue values, we would have n equals 7500.
8:27
So, if we were to try to learn a nonlinear hypothesis by including all the quadratic features, that is all the terms of the form, you know, Xi times Xj, while with the 2500 pixels we would end up with a total of three million features. And that's just too large to be reasonable; the computation would be very expensive to find and to represent all of these three million features per training example.
8:55
So, simple logistic regression together with adding in maybe the quadratic or the cubic features - that's just not a good way to learn complex nonlinear hypotheses when n is large because you just end up with too many features. In the next few videos, I would like to tell you about Neural Networks, which turns out to be a much better way to learn complex hypotheses, complex nonlinear hypotheses even when your input feature space, even when n is large. And along the way I'll also get to show you a couple of fun videos of historically important applications
9:30
of Neural networks as well that I hope those videos that we'll see later will be fun for you to watch as well.

Video: Neurons and the Brain

Neural Networks are a pretty old algorithm that was originally motivated
0:05
by the goal of having machines that can mimic the brain. Now in this class, of course I'm teaching Neural Networks to you because they work really well for different machine learning problems and not, certainly not because they're logically motivated.
0:18
In this video, I'd like to give you some of the background on Neural Networks. So that we can get a sense of what we can expect them to do. Both in the sense of applying them to modern day machinery problems, as well as for those of you that might be interested in maybe the big AI dream of someday building truly intelligent machines.
0:37
Also, how Neural Networks might pertain to that.
0:42
The origins of Neural Networks was as algorithms that try to mimic the brain and those a sense that if we want to build learning systems while why not mimic perhaps the most amazing learning machine we know about, which is perhaps the brain. Neural Networks came to be very widely used throughout the 1980's and 1990's and for various reasons as popularity diminished in the late 90's. But more recently, Neural Networks have had a major recent resurgence.
1:13
One of the reasons for this resurgence is that Neural Networks are computationally some what more expensive algorithm and so, it was only, you know, maybe somewhat more recently that computers became fast enough to really run large scale Neural Networks and because of that as well as a few other technical reasons which we'll talk about later, modern Neural Networks today are the state of the art technique for many applications.
1:39
So, when you think about mimicking the brain while one of the human brain does tell me same things, right? The brain can learn to see process images than to hear, learn to process our sense of touch. We can, you know, learn to do math, learn to do calculus, and the brain does so many different and amazing things. It seems like if you want to mimic the brain it seems like you have to write lots of different pieces of software to mimic all of these different fascinating, amazing things that the brain tell us, but does is this fascinating hypothesis that the way the brain does all of these different things is not worth like a thousand different programs, but instead, the way the brain does it is worth just a single learning algorithm. This is just a hypothesis but let me share with you some of the evidence for this. This part of the brain, that little red part of the brain, is your auditory cortex and the way you're understanding my voice now is your ear is taking the sound signal and routing the sound signal to your auditory cortex and that's what's allowing you to understand my words.
2:41
Neuroscientists have done the following fascinating experiments where you cut the wire from the ears to the auditory cortex and you re-wire,
2:50
in this case an animal's brain, so that the signal from the eyes to the optic nerve eventually gets routed to the auditory cortex.
2:58
If you do this it turns out, the auditory cortex will learn
3:02
to see. And this is in every single sense of the word see as we know it. So, if you do this to the animals, the animals can perform visual discrimination task and as they can look at images and make appropriate decisions based on the images and they're doing it with that piece of brain tissue.
3:19
Here's another example.
3:21
That red piece of brain tissue is your somatosensory cortex. That's how you process your sense of touch. If you do a similar re-wiring process
3:30
then the somatosensory cortex will learn to see. Because of this and other similar experiments, these are called neuro-rewiring experiments.
3:39
There's this sense that if the same piece of physical brain tissue can process sight or sound or touch then maybe there is one learning algorithm that can process sight or sound or touch. And instead of needing to implement a thousand different programs or a thousand different algorithms to do, you know, the thousand wonderful things that the brain does, maybe what we need to do is figure out some approximation or to whatever the brain's learning algorithm is and implement that and that the brain learned by itself how to process these different types of data.
4:13
To a surprisingly large extent, it seems as if we can plug in almost any sensor to almost any part of the brain and so, within the reason, the brain will learn to deal with it.
4:25
Here are a few more examples. On the upper left is an example of learning to see with your tongue. The way it works is--this is actually a system called BrainPort undergoing FDA trials now to help blind people see--but the way it works is, you strap a grayscale camera to your forehead, facing forward, that takes the low resolution grayscale image of what's in front of you and you then run a wire
4:51
to an array of electrodes that you place on your tongue so that each pixel gets mapped to a location on your tongue where maybe a high voltage corresponds to a dark pixel and a low voltage corresponds to a bright pixel and, even as it does today, with this sort of system you and I will be able to learn to see, you know, in tens of minutes with our tongues. Here's a second example of human echo location or human sonar.
5:19
So there are two ways you can do this. You can either snap your fingers,
5:24
or click your tongue. I can't do that very well. But there are blind people today that are actually being trained in schools to do this and learn to interpret the pattern of sounds bouncing off your environment - that's sonar. So, if after you search on YouTube, there are actually videos of this amazing kid who tragically because of cancer had his eyeballs removed, so this is a kid with no eyeballs. But by snapping his fingers, he can walk around and never hit anything. He can ride a skateboard. He can shoot a basketball into a hoop and this is a kid with no eyeballs.
6:00
Third example is the Haptic Belt where if you have a strap around your waist, ring up buzzers and always have the northmost one buzzing. You can give a human a direction sense similar to maybe how birds can, you know, sense where north is. And, some of the bizarre example, but if you plug a third eye into a frog, the frog will learn to use that eye as well.
6:27
So, it's pretty amazing to what extent is as if you can plug in almost any sensor to the brain and the brain's learning algorithm will just figure out how to learn from that data and deal with that data.
6:40
And there's a sense that if we can figure out what the brain's learning algorithm is, and, you know, implement it or implement some approximation to that algorithm on a computer, maybe that would be our best shot at, you know, making real progress towards the AI, the artificial intelligence dream of someday building truly intelligent machines.
6:59
Now, of course, I'm not teaching Neural Networks, you know, just because they might give us a window into this far-off AI dream, even though I'm personally, that's one of the things that I personally work on in my research life. But the main reason I'm teaching Neural Networks in this class is because it's actually a very effective state of the art technique for modern day machine learning applications. So, in the next few videos, we'll start diving into the technical details of Neural Networks so that you can apply them to modern-day machine learning applications and get them to work well on problems. But for me, you know, one of the reasons the excite me is that maybe they give us this window into what we might do if we're also thinking of what algorithms might someday be able to learn in a manner similar to humankind.

Video: Model Representation I

In this video, I want to start telling you about how we represent neural networks. In other words, how we represent our hypothesis or how we represent our model when using neural networks. Neural networks were developed as simulating neurons or networks of neurons in the brain. So, to explain the hypothesis representation let's start by looking at what a single neuron in the brain looks like. Your brain and mine is jam packed full of neurons like these and neurons are cells in the brain. And two things to draw attention to are that first. The neuron has a cell body, like so, and moreover, the neuron has a number of input wires, and these are called the dendrites. You think of them as input wires, and these receive inputs from other locations. And a neuron also has an output wire called an Axon, and this output wire is what it uses to send signals to other neurons, so to send messages to other neurons. So, at a simplistic level what a neuron is, is a computational unit that gets a number of inputs through it input wires and does some computation and then it says outputs via its axon to other nodes or to other neurons in the brain. Here's a illustration of a group of neurons. The way that neurons communicate with each other is with little pulses of electricity, they are also called spikes but that just means pulses of electricity. So here is one neuron and what it does is if it wants a send a message what it does is sends a little pulse of electricity. Varis axon to some different neuron and here, this axon that is this open wire, connects to the dendrites of this second neuron over here, which then accepts this incoming message that some computation. And they, in turn, decide to send out this message on this axon to other neurons, and this is the process by which all human thought happens. It's these Neurons doing computations and passing messages to other neurons as a result of what other inputs they've got. And, by the way, this is how our senses and our muscles work as well. If you want to move one of your muscles the way that where else in your neuron may send this electricity to your muscle and that causes your muscles to contract and your eyes, some senses like your eye must send a message to your brain while it does it senses hosts electricity entity to a neuron in your brain like so. In a neuro network, or rather, in an artificial neuron network that we've implemented on the computer, we're going to use a very simple model of what a neuron does we're going to model a neuron as just a logistic unit. So, when I draw a yellow circle like that, you should think of that as a playing a role analysis, who's maybe the body of a neuron, and we then feed the neuron a few inputs who's various dendrites or input wiles.
3:14
And the neuron does some computation. And output some value on this output wire, or in the biological neuron, this is an axon. And whenever I draw a diagram like this, what this means is that this represents a computation of h of x equals one over one plus e to the negative theta transpose x, where as usual, x and theta are our parameter vectors, like so.
3:42
So this is a very simple, maybe a vastly oversimplified model, of the computations that the neuron does, where it gets a number of inputs, x1, x2, x3 and it outputs some value computed like so.
3:59
When I draw a neural network, usually I draw only the input nodes x1, x2, x3. Sometimes when it's useful to do so, I'll draw an extra node for x0.
4:11
This x0 now that's sometimes called the bias unit or the bias neuron, but because x0 is already equal to 1, sometimes, I draw this, sometimes I won't just depending on whatever is more notationally convenient for that example.
4:31
Finally, one last bit of terminology when we talk about neural networks, sometimes we'll say that this is a neuron or an artificial neuron with a Sigmoid or logistic activation function. So this activation function in the neural network terminology. This is just another term for that function for that non-linearity g(z) = 1 over 1+e to the -z. And whereas so far I've been calling theta the parameters of the model, I'll mostly continue to use that terminology. Here, it's a copy to the parameters, but in neural networks, in the neural network literature sometimes you might hear people talk about weights of a model and weights just means exactly the same thing as parameters of a model. But I'll mostly continue to use the terminology parameters in these videos, but sometimes, you might hear others use the weights terminology.
5:27
So, this little diagram represents a single neuron.
5:34
What a neural network is, is just a group of this different neurons strong together. Completely, here we have input units x1, x2, x3 and once again, sometimes you can draw this extra note x0 and Sometimes not, just flow that in here. And here we have three neurons which have written 81, 82, 83. I'll talk about those indices later. And once again we can if we want add in just a0 and add the mixture bias unit there. There's always a value of 1. And then finally we have this third node and the final layer, and there's this third node that outputs the value that the hypothesis h(x) computes. To introduce a bit more terminology, in a neural network, the first layer, this is also called the input layer because this is where we Input our features, x1, x2, x3. The final layer is also called the output layer because that layer has a neuron, this one over here, that outputs the final value computed by a hypothesis. And then, layer 2 in between, this is called the hidden layer. The term hidden layer isn't a great terminology, but this ideation is that, you know, you supervised early, where you get to see the inputs and get to see the correct outputs, where there's a hidden layer of values you don't get to observe in the training setup. It's not x, and it's not y, and so we call those hidden. And they try to see neural nets with more than one hidden layer but in this example, we have one input layer, Layer 1, one hidden layer, Layer 2, and one output layer, Layer 3. But basically, anything that isn't an input layer and isn't an output layer is called a hidden layer.
7:29
So I want to be really clear about what this neural network is doing. Let's step through the computational steps that are and body represented by this diagram. To explain these specific computations represented by a neural network, here's a little bit more notation. I'm going to use a superscript j subscript i to denote the activation of neuron i or of unit i in layer j. So completely this gave superscript to sub group one, that's the activation of the first unit in layer two, in our hidden layer. And by activation I just mean the value that's computed by and as output by a specific. In addition, new network is parametrize by these matrixes, theta super script j Where theta j is going to be a matrix of weights controlling the function mapping form one layer, maybe the first layer to the second layer, or from the second layer to the third layer.
8:30
So here are the computations that are represented by this diagram.
8:34
This first hidden unit here has it's value computed as follows, there's a is a21 is equal to the sigma function of the sigma activation function, also called the logistics activation function, apply to this sort of linear combination of these inputs. And then this second hidden unit has this activation value computer as sigmoid of this. And similarly for this third hidden unit is computed by that formula. So here we have 3 theta 1 which is matrix of parameters governing our mapping from our three different units, our hidden units. Theta 1 is going to be a 3.
9:35
Theta 1 is going to be a 3x4-dimensional matrix. And more generally, if a network has SJU units in there j and sj + 1 units and sj + 1 then the matrix theta j which governs the function mapping from there sj + 1. That will have to mention sj +1 by sj + 1 I'll just be clear about this notation right. This is Subscript j + 1 and that's s subscript j, and then this whole thing, plus 1, this whole thing (sj + 1), okay? So that's s subscript j + 1 by,
10:21
So that's s subscript j + 1 by sj + 1 where this plus one is not part of the subscript. Okay, so we talked about what the three hidden units do to compute their values. Finally, there's a loss of this final and after that we have one more unit which computer h of x and that's equal can also be written as a(3)1 and that's equal to this. And you notice that I've written this with a superscript two here, because theta of superscript two is the matrix of parameters, or the matrix of weights that controls the function that maps from the hidden units, that is the layer two units to the one layer three unit, that is the output unit. To summarize, what we've done is shown how a picture like this over here defines an artificial neural network which defines a function h that maps with x's input values to hopefully to some space that provisions y. And these hypothesis are parameterized by parameters denoting with a capital theta so that, as we vary theta, we get different hypothesis and we get different functions. Mapping say from x to y.
11:42
So this gives us a mathematical definition of how to represent the hypothesis in the neural network. In the next few videos what I would like to do is give you more intuition about what these hypothesis representations do, as well as go through a few examples and talk about how to compute them efficiently.

Reading: Model Representation I

Model Representation I
Let's examine how we will represent a hypothesis function using neural networks. At a very simple level, neurons are basically computational units that take inputs (dendrites) as electrical inputs (called "spikes") that are channeled to outputs (axons). In our model, our dendrites are like the input features (x_1cdots x_n), and the output is the result of our hypothesis function. In this model our (x_0) input node is sometimes called the "bias unit." It is always equal to 1. In neural networks, we use the same logistic function as in classification, (frac{1}{1 + e^{- heta^Tx}}), yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our "theta" parameters are sometimes called "weights".

Visually, a simplistic representation looks like:

Our input nodes (layer 1), also known as the "input layer", go into another node (layer 2), which finally outputs the hypothesis function, known as the "output layer".

We can have intermediate layers of nodes between the input and output layers called the "hidden layers."

In this example, we label these intermediate or "hidden" layer nodes (a^2_0 cdots a^2_n) and call them "activation units."

If we had one hidden layer, it would look like:

The values for each of the "activation" nodes is obtained as follows:

This is saying that we compute our activation nodes by using a 3×4 matrix of parameters. We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix (Theta^{(2)}) containing the weights for our second layer of nodes.

Each layer gets its own matrix of weights, (Theta^{(j)}).

The dimensions of these matrices of weights is determined as follows:

If network has (s_j) units in layer (j) and (s_{j+1}) units in layer (j+1), then (Theta^{(j)}) will be of dimension (s_{j+1} imes (s_j + 1)).

The +1 comes from the addition in (Theta^{(j)}) of the "bias nodes," (x_0) and (Theta_0^{(j)}). In other words the output nodes will not include the bias nodes while the inputs will. The following image summarizes our model representation:

Example: If layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of (Theta^{(1)}) is going to be 4×3 where (s_j = 2) and (s_{j+1} = 4), so (s_{j+1} imes (s_j + 1) = 4 imes 3).

Video: Model Representation II

In the last video, we gave a mathematical definition of how to represent or how to compute the hypotheses used by Neural Network.
0:08
In this video, I like show you how to actually carry out that computation efficiently, and that is show you a vector rise implementation.
0:17
And second, and more importantly, I want to start giving you intuition about why these neural network representations might be a good idea and how they can help us to learn complex nonlinear hypotheses.
0:28
Consider this neural network. Previously we said that the sequence of steps that we need in order to compute the output of a hypotheses is these equations given on the left where we compute the activation values of the three hidden uses and then we use those to compute the final output of our hypotheses h of x. Now, I'm going to define a few extra terms. So, this term that I'm underlining here, I'm going to define that to be z superscript 2 subscript 1. So that we have that a(2)1, which is this term is equal to g of z to 1. And by the way, these superscript 2, you know, what that means is that the z2 and this a2 as well, the superscript 2 in parentheses means that these are values associated with layer 2, that is with the hidden layer in the neural network.
1:22
Now this term here I'm going to similarly define as
1:29
z(2)2. And finally, this last term here that I'm underlining,
1:34
let me define that as z(2)3. So that similarly we have a(2)3 equals g of
1:44
z(2)3. So these z values are just a linear combination, a weighted linear combination, of the input values x0, x1, x2, x3 that go into a particular neuron.
1:57
Now if you look at this block of numbers,
2:01
you may notice that that block of numbers corresponds suspiciously similar
2:06
to the matrix vector operation, matrix vector multiplication of x1 times the vector x. Using this observation we're going to be able to vectorize this computation of the neural network.
2:21
Concretely, let's define the feature vector x as usual to be the vector of x0, x1, x2, x3 where x0 as usual is always equal 1 and that defines z2 to be the vector of these z-values, you know, of z(2)1 z(2)2, z(2)3.
2:38
And notice that, there, z2 this is a three dimensional vector.
2:43
We can now vectorize the computation
2:48
of a(2)1, a(2)2, a(2)3 as follows. We can just write this in two steps. We can compute z2 as theta 1 times x and that would give us this vector z2; and then a2 is g of z2 and just to be clear z2 here, This is a three-dimensional vector and a2 is also a three-dimensional vector and thus this activation g. This applies the sigmoid function element-wise to each of the z2's elements. And by the way, to make our notation a little more consistent with what we'll do later, in this input layer we have the inputs x, but we can also thing it is as in activations of the first layers. So, if I defined a1 to be equal to x. So, the a1 is vector, I can now take this x here and replace this with z2 equals theta1 times a1 just by defining a1 to be activations in my input layer.
3:44
Now, with what I've written so far I've now gotten myself the values for a1, a2, a3, and really I should put the superscripts there as well. But I need one more value, which is I also want this a(0)2 and that corresponds to a bias unit in the hidden layer that goes to the output there. Of course, there was a bias unit here too that, you know, it just didn't draw under here but to take care of this extra bias unit, what we're going to do is add an extra a0 superscript 2, that's equal to one, and after taking this step we now have that a2 is going to be a four dimensional feature vector because we just added this extra, you know, a0 which is equal to 1 corresponding to the bias unit in the hidden layer. And finally,
4:35
to compute the actual value output of our hypotheses, we then simply need to compute
4:42
z3. So z3 is equal to this term here that I'm just underlining. This inner term there is z3.
4:53
And z3 is stated 2 times a2 and finally my hypotheses output h of x which is a3 that is the activation of my one and only unit in the output layer. So, that's just the real number. You can write it as a3 or as a(3)1 and that's g of z3. This process of computing h of x is also called forward propagation
5:19
and is called that because we start of with the activations of the input-units and then we sort of forward-propagate that to the hidden layer and compute the activations of the hidden layer and then we sort of forward propagate that and compute the activations of
5:37
the output layer, but this process of computing the activations from the input then the hidden then the output layer, and that's also called forward propagation
5:43
and what we just did is we just worked out a vector wise implementation of this procedure. So, if you implement it using these equations that we have on the right, these would give you an efficient way or both of the efficient way of computing h of x.
5:58
This forward propagation view also
6:00
helps us to understand what Neural Networks might be doing and why they might help us to learn interesting nonlinear hypotheses.
6:08
Consider the following neural network and let's say I cover up the left path of this picture for now. If you look at what's left in this picture. This looks a lot like logistic regression where what we're doing is we're using that note, that's just the logistic regression unit and we're using that to make a prediction h of x. And concretely, what the hypotheses is outputting is h of x is going to be equal to g which is my sigmoid activation function times theta 0 times a0 is equal to 1 plus theta 1
6:45
plus theta 2 times a2 plus theta 3 times a3 whether values a1, a2, a3 are those given by these three given units.
7:01
Now, to be actually consistent to my early notation. Actually, we need to, you know, fill in these superscript 2's here everywhere
7:12
and I also have these indices 1 there because I have only one output unit, but if you focus on the blue parts of the notation. This is, you know, this looks awfully like the standard logistic regression model, except that I now have a capital theta instead of lower case theta.
7:29
And what this is doing is just logistic regression.
7:33
But where the features fed into logistic regression are these values computed by the hidden layer.
7:41
Just to say that again, what this neural network is doing is just like logistic regression, except that rather than using the original features x1, x2, x3,
7:52
is using these new features a1, a2, a3. Again, we'll put the superscripts
7:58
there, you know, to be consistent with the notation.
8:02
And the cool thing about this, is that the features a1, a2, a3, they themselves are learned as functions of the input.
8:10
Concretely, the function mapping from layer 1 to layer 2, that is determined by some other set of parameters, theta 1. So it's as if the neural network, instead of being constrained to feed the features x1, x2, x3 to logistic regression. It gets to learn its own features, a1, a2, a3, to feed into the logistic regression and as you can imagine depending on what parameters it chooses for theta 1. You can learn some pretty interesting and complex features and therefore
8:43
you can end up with a better hypotheses than if you were constrained to use the raw features x1, x2 or x3 or if you will constrain to say choose the polynomial terms, you know, x1, x2, x3, and so on. But instead, this algorithm has the flexibility to try to learn whatever features at once, using these a1, a2, a3 in order to feed into this last unit that's essentially
9:09
a logistic regression here. I realized this example is described as a somewhat high level and so I'm not sure if this intuition of the neural network, you know, having more complex features will quite make sense yet, but if it doesn't yet in the next two videos I'm going to go through a specific example of how a neural network can use this hidden there to compute more complex features to feed into this final output layer and how that can learn more complex hypotheses. So, in case what I'm saying here doesn't quite make sense, stick with me for the next two videos and hopefully out there working through those examples this explanation will make a little bit more sense. But just the point O. You can have neural networks with other types of diagrams as well, and the way that neural networks are connected, that's called the architecture. So the term architecture refers to how the different neurons are connected to each other. This is an example of a different neural network architecture
10:07
and once again you may be able to get this intuition of how the second layer, here we have three heading units that are computing some complex function maybe of the input layer, and then the third layer can take the second layer's features and compute even more complex features in layer three so that by the time you get to the output layer, layer four, you can have even more complex features of what you are able to compute in layer three and so get very interesting nonlinear hypotheses.
10:36
By the way, in a network like this, layer one, this is called an input layer. Layer four is still our output layer, and this network has two hidden layers. So anything that's not an input layer or an output layer is called a hidden layer.
10:53
So, hopefully from this video you've gotten a sense of how the feed forward propagation step in a neural network works where you start from the activations of the input layer and forward propagate that to the first hidden layer, then the second hidden layer, and then finally the output layer. And you also saw how we can vectorize that computation.
11:13
In the next, I realized that some of the intuitions in this video of how, you know, other certain layers are computing complex features of the early layers. I realized some of that intuition may be still slightly abstract and kind of a high level. And so what I would like to do in the two videos is work through a detailed example of how a neural network can be used to compute nonlinear functions of the input and hope that will give you a good sense of the sorts of complex nonlinear hypotheses we can get out of Neural Networks.

Reading: Model Representation II

Model Representation II

To re-iterate, the following is an example of a neural network:

In this section we'll do a vectorized implementation of the above functions. We're going to define a new variable (z_k^{(j)}) that encompasses the parameters inside our g function. In our previous example if we replaced by the variable z for all the parameters we would get:

In other words, for layer j=2 and node k, the variable z will be:

The vector representation of x and (z^{j}) is:

Setting (x = a^{(1)}), we can rewrite the equation as:

We are multiplying our matrix (Theta^{(j-1)}) with dimensions (s_j imes (n+1)) is the number of our activation nodes) by our vector (a^{(j-1)}) with height (n+1). This gives us our vector (z^{(j)}) with height (s_j). Now we can get a vector of our activation nodes for layer j as follows:

(a^{(j)} = g(z^{(j)}))

Where our function g can be applied element-wise to our vector (z^{(j)}).

We can then add a bias unit (equal to 1) to layer j after we have computed (a^{(j)}). This will be element (a_0^{(j)}) and will be equal to 1. To compute our final hypothesis, let's first compute another z vector:

(z^{(j+1)} = Theta^{(j)}a^{(j)})

We get this final z vector by multiplying the next theta matrix after (Theta^{(j-1)}) with the values of all the activation nodes we just got. This last theta matrix (Theta^{(j)}) will have only one row which is multiplied by one column (a^{(j)}) so that our result is a single number. We then get our final result with:

(h_Theta(x) = a^{(j+1)} = g(z^{(j+1)}))

Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression. Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.

Video: Examples and Intuitions I

In this and the next video I want to work through a detailed example showing how a neural network can compute a complex non linear function of the input. And hopefully this will give you a good sense of why neural networks can be used to learn complex non linear hypotheses. Consider the following problem where we have features X1 and X2 that are binary values. So, either 0 or 1. So, X1 and X2 can each take on only one of two possible values. In this example, I've drawn only two positive examples and two negative examples. That you can think of this as a simplified version of a more complex learning problem where we may have a bunch of positive examples in the upper right and lower left and a bunch of negative examples denoted by the circles. And what we'd like to do is learn a non-linear division of boundary that may need to separate the positive and negative examples.
0:53
So, how can a neural network do this and rather than using the example and the variable to use this maybe easier to examine example on the left. Concretely what this is, is really computing the type of label y equals x 1 x or x 2. Or actually this is actually the x 1 x nor x 2 function where x nor is the alternative notation for not x 1 or x 2. So, x 1 x or x 2 that's true only if exactly 1 of x 1 or x 2 is equal to 1. It turns out that these specific examples in the works out a little bit better if we use the XNOR example instead. These two are the same of course. This means not x1 or x2 and so, we're going to have positive examples of either both are true or both are false and what have as y equals 1, y equals 1. And we're going to have y equals 0 if only one of them is true and we're going to figure out if we can get a neural network to fit to this sort of training set.
1:59
In order to build up to a network that fits the XNOR example we're going to start with a slightly simpler one and show a network that fits the AND function. Concretely, let's say we have input x1 and x2 that are again binaries so, it's either 0 or 1 and let's say our target labels y = x1 AND x2. This is a logical AND.
2:30
So, can we get a one-unit network to compute this logical AND function? In order to do so, I'm going to actually draw in the bias unit as well the plus one unit.
2:45
Now let me just assign some values to the weights or parameters of this network. I'm gonna write down the parameters on this diagram here, -30 here. +20 and + 20. And what this mean is just that I'm assigning a value of -30 to the value associated with X0 this +1 going into this unit and a parameter value of +20 that multiplies to X1 a value of +20 for the parameter that multiplies into x 2. So, concretely it's the same that the hypothesis h(x)=g(-30+20 X1 plus 20 X2. So, sometimes it's just convenient to draw these weights. Draw these parameters up here in the diagram within and of course this- 30. This is actually theta 1 of 1 0. This is theta 1 of 1 1 and that's theta 1 of 1 2 but it's just easier to think about it as associating these parameters with the edges of the network.
4:01
Let's look at what this little single neuron network will compute. Just to remind you the sigmoid activation function g(z) looks like this. It starts from 0 rises smoothly crosses 0.5 and then it asymptotic as 1 and to give you some landmarks, if the horizontal axis value z is equal to 4.6 then the sigmoid function is equal to 0.99. This is very close to 1 and kind of symmetrically, if it's -4.6 then the sigmoid function there is 0.01 which is very close to 0. Let's look at the four possible input values for x1 and x2 and look at what the hypotheses will output in that case. If x1 and x2 are both equal to 0. If you look at this, if x1 x2 are both equal to 0 then the hypothesis of g of -30. So, this is a very far to the left of this diagram so it will be very close to 0. If x 1 equals 0 and x equals 1, then this formula here evaluates the g that is the sigma function applied to -10, and again that's you know to the far left of this plot and so, that's again very close to 0.
5:17
This is also g of minus 10 that is f x 1 is equal to 1 and x 2 0, this minus 30 plus 20 which is minus 10 and finally if x 1 equals 1 x 2 equals 1 then you have g of minus 30 plus 20 plus 20. So, that's g of positive 10 which is there for very close to 1.
5:39
And if you look in this column this is exactly the logical and function. So, this is computing h of x is approximately x 1 and x 2. In other words it outputs one If and only if x2, x1 and x2, are both equal to 1. So, by writing out our little truth table like this we manage to figure what's the logical function that our neural network computes. This network showed here computes the OR function. Just to show you how I worked that out. If you are write out the hypothesis that this confusing g of -10 + 20 x 1 + 20 x 2 and so you fill in these values. You find that's g of minus 10 which is approximately 0. g of 10 which is approximately 1 and so on and these are approximately 1 and approximately 1 and these numbers are essentially the logical OR function.
6:49
So, hopefully with this you now understand how single neurons in a neural network can be used to compute logical functions like AND and OR and so on. In the next video we'll continue building on these examples and work through a more complex example. We'll get to show you how a neural network now with multiple layers of units can be used to compute more complex functions like the XOR function or the XNOR function.

Reading: Examples and Intuitions I

Examples and Intuitions I
A simple example of applying neural networks is by predicting (x_1) AND (x_2), which is the logical 'and' operator and is only true if both (x_1) and (x_2) are 1.

The graph of our functions will look like:

Remember that (x_0) is our bias variable and is always 1.

Let's set our first theta matrix as:

This will cause the output of our hypothesis to only be positive if both (x_1) and (x_2) are 1. In other words:

So we have constructed one of the fundamental operations in computers by using a small neural network rather than using an actual AND gate. Neural networks can also be used to simulate all the other logical gates. The following is an example of the logical operator 'OR', meaning either (x_1) is true or (x_2) is true, or both:

Where g(z) is the following:

Video: Examples and Intuitions II

In this video I'd like to keep working through our example to show how a Neural Network can compute complex non linear hypothesis.
0:10
In the last video we saw how a Neural Network can be used to compute the functions x1 AND x2, and the function x1 OR x2 when x1 and x2 are binary, that is when they take on values 0,1. We can also have a network to compute negation, that is to compute the function not x1. Let me just write down the ways associated with this network. We have only one input feature x1 in this case and the bias unit +1. And if I associate this with the weights plus 10 and -20, then my hypothesis is computing this h(x) equals sigmoid (10- 20 x1). So when x1 is equal to 0, my hypothesis would be computing g(10- 20 x 0) is just 10. And so that's approximately 1, and when x is equal to 1, this will be g(-10) which is approximately equal to 0. And if you look at what these values are, that's essentially the not x1 function. Cells include negations, the general idea is to put that large negative weight in front of the variable you want to negate. Minus 20 multiplied by x1 and that's the general idea of how you end up negating x1. And so in an example that I hope that you can figure out yourself. If you want to compute a function like this NOT x1 AND NOT x2, part of that will probably be putting large negative weights in front of x1 and x2, but it should be feasible. So you get a neural network with just one output unit to compute this as well. All right, so this logical function, NOT x1 AND NOT x2, is going to be equal to 1 if and only if
2:06
x1 equals x2 equals 0. All right since this is a logical function, this says NOT x1 means x1 must be 0 and NOT x2, that means x2 must be equal to 0 as well. So this logical function is equal to 1 if and only if both x1 and x2 are equal to 0 and hopefully you should be able to figure out how to make a small neural network to compute this logical function as well.
2:33
Now, taking the three pieces that we have put together as the network for computing x1 AND x2, and the network computing for computing NOT x1 AND NOT x2. And one last network computing for computing x1 OR x2, we should be able to put these three pieces together to compute this x1 XNOR x2 function. And just to remind you if this is x1, x2, this function that we want to compute would have negative examples here and here, and we'd have positive examples there and there. And so clearly this will need a non linear decision boundary in order to separate the positive and negative examples.
3:12
Let's draw the network. I'm going to take my input +1, x1, x2 and create my first hidden unit here. I'm gonna call this a 21 cuz that's my first hidden unit. And I'm gonna copy the weight over from the red network, the x1 and x2. As well so then -30, 20, 20. Next let me create a second hidden unit which I'm going to call a 2 2. That is the second hidden unit of layer two. I'm going to copy over the cyan that's work in the middle, so I'm gonna have the weights 10 -20 -20. And so, let's pull some of the truth table values. For the red network, we know that was computing the x1 and x2, and so this will be approximately 0 0 0 1, depending on the values of x1 and x2, and for a 2 2, the cyan network. What do we know? The function NOT x1 AND NOT x2, that outputs 1 0 0 0, for the 4 values of x1 and x2.
4:18
Finally, I'm going to create my output node, my output unit that is a 3 1. This is one more output h(x) and I'm going to copy over the old network for that. And I'm going to need a +1 bias unit here, so you draw that in, And I'm going to copy over the weights from the green networks. So that's -10, 20, 20 and we know earlier that this computes the OR function.
4:46
So let's fill in the truth table entries.
4:50
So the first entry is 0 OR 1 which can be 1 that makes 0 OR 0 which is 0, 0 OR 0 which is 0, 1 OR 0 and that falls to 1. And thus h(x) is equal to 1 when either both x1 and x2 are zero or when x1 and x2 are both 1 and concretely h(x) outputs 1 exactly at these two locations and then outputs 0 otherwise.
5:19
And thus will this neural network, which has a input layer, one hidden layer, and one output layer, we end up with a nonlinear decision boundary that computes this XNOR function. And the more general intuition is that in the input layer, we just have our four inputs. Then we have a hidden layer, which computed some slightly more complex functions of the inputs that its shown here this is slightly more complex functions. And then by adding yet another layer we end up with an even more complex non linear function.
5:50
And this is a sort of intuition about why neural networks can compute pretty complicated functions. That when you have multiple layers you have relatively simple function of the inputs of the second layer. But the third layer I can build on that to complete even more complex functions, and then the layer after that can compute even more complex functions.
6:10
To wrap up this video, I want to show you a fun example of an application of a the Neural Network that captures this intuition of the deeper layers computing more complex features. I want to show you a video of that customer a good friend of mine Yann LeCunj. Yann is a professor at New York University, NYU and he was one of the early pioneers of Neural Network reasearch and is sort of a legend in the field now and his ideas are used in all sorts of products and applications throughout the world now.
6:41
So I wanna show you a video from some of his early work in which he was using a neural network to recognize handwriting, to do handwritten digit recognition. You might remember early in this class, at the start of this class I said that one of the earliest successes of neural networks was trying to use it to read zip codes to help USPS Laws and read postal codes. So this is one of the attempts, this is one of the algorithms used to try to address that problem. In the video that I'll show you this area here is the input area that shows a canvasing character shown to the network. This column here shows a visualization of the features computed by sort of the first hidden layer of the network. So that the first hidden layer of the network and so the first hidden layer, this visualization shows different features. Different edges and lines and so on detected. This is a visualization of the next hidden layer. It's kinda harder to see, harder to understand the deeper, hidden layers, and that's a visualization of why the next hidden layer is confusing. You probably have a hard time seeing what's going on much beyond the first hidden layer, but then finally, all of these learned features get fed to the upper layer. And shown over here is the final answer, it's the final predictive value for what handwritten digit the neural network thinks it is being shown. So let's take a look at the video. [MUSIC]
9:49
So I hope you enjoyed the video and that this hopefully gave you some intuition about the source of pretty complicated functions neural networks can learn. In which it takes its input this image, just takes this input, the raw pixels and the first hidden layer computes some set of features. The next hidden layer computes even more complex features and even more complex features. And these features can then be used by essentially the final layer of the logistic classifiers to make accurate predictions without the numbers that the network sees.

Reading: Examples and Intuitions II

Examples and Intuitions II
The (Theta^{(1)}) matrices for AND, NOR, and OR are:

We can combine these to get the XNOR logical operator (which gives 1 if (x_1) and (x_2) are both 0 or both 1).

For the transition between the first and second layer, we'll use a (Theta^{(1)}) matrix that combines the values for AND and NOR:

For the transition between the second and third layer, we'll use a (Theta^{(2)}) matrix that uses the value for OR:

Let's write out the values for all our nodes:

And there we have the XNOR operator using a hidden layer with two nodes! The following summarizes the above algorithm:

Video: Multiclass Classification

In this video, I want to tell you about how to use neural networks to do multiclass classification where we may have more than one category that we're trying to distinguish amongst. In the last part of the last video, where we had the handwritten digit recognition problem, that was actually a multiclass classification problem because there were ten possible categories for recognizing the digits from 0 through 9 and so, if you want us to fill you in on the details of how to do that.
0:30
The way we do multiclass classification
0:32
in a neural network is essentially an extension of the one versus all method.
0:38
So, let's say that we have a computer vision example, where instead of just trying to recognize cars as in the original example that I started off with, but let's say that we're trying to recognize, you know, four categories of objects and given an image we want to decide if it is a pedestrian, a car, a motorcycle or a truck. If that's the case, what we would do is we would build a neural network with four output units so that our neural network now outputs a vector of four numbers.
1:09
So, the output now is actually needing to be a vector of four numbers and what we're going to try to do is get the first output unit to classify: is the image a pedestrian, yes or no. The second unit to classify: is the image a car, yes or no. This unit to classify: is the image a motorcycle, yes or no, and this would classify: is the image a truck, yes or no. And thus, when the image is of a pedestrian, we would ideally want the network to output 1, 0, 0, 0, when it is a car we want it to output 0, 1, 0, 0, when this is a motorcycle, we get it to or rather, we want it to output 0, 0, 1, 0 and so on.
1:50
So this is just like the "one versus all" method that we talked about when we were describing logistic regression, and here we have essentially four logistic regression classifiers, each of which is trying to recognize one of the four classes that we want to distinguish amongst. So, rearranging the slide of it, here's our neural network with four output units and those are what we want h of x to be when we have the different images, and the way we're going to represent the training set in these settings is as follows. So, when we have a training set with different images
2:27
of pedestrians, cars, motorcycles and trucks, what we're going to do in this example is that whereas previously we had written out the labels as y being an integer from 1, 2, 3 or 4. Instead of representing y this way, we're going to instead represent y as follows: namely Yi
2:54
will be either 1, 0, 0, 0 or 0, 1, 0, 0 or 0, 0, 1, 0 or 0, 0, 0, 1 depending on what the corresponding image Xi is. And so one training example will be one pair Xi colon Yi
3:04
where Xi is an image with, you know one of the four objects and Yi will be one of these vectors.
3:10
And hopefully, we can find a way to get our Neural Networks to output some value. So, the h of x is approximately y and both h of x and Yi, both of these are going to be in our example, four dimensional vectors when we have four classes.
3:31
So, that's how you get neural network to do multiclass classification.
3:36
This wraps up our discussion on how to represent Neural Networks that is on our hypotheses representation.
3:42
In the next set of videos, let's start to talk about how take a training set and how to automatically learn the parameters of the neural network.

Reading: Multiclass Classification

Multiclass Classification

To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we wanted to classify our data into one of four categories. We will use the following example to see how this classification is done. This algorithm takes as input an image and classifies it accordingly:

We can define our set of resulting classes as y:

Each (y^{(i)}) represents a different image corresponding to either a car, pedestrian, truck, or motorcycle. The inner layers, each provide us with some new information which leads to our final hypothesis function. The setup looks like:

Our resulting hypothesis for one set of inputs may look like:

In which case our resulting class is the third one down, or (h_Theta(x)_3), which represents the motorcycle.

Reading: Lecture Slides

Lecture8.pdf

Programming: Multi-class Classification and Neural Networks

Download the programming assignment here.

This ZIP file contains the instructions in a PDF and the starter code. You may use either MATLAB or Octave (>= 3.8.0). To submit this assignment, call the included submit function from MATLAB / Octave. You will need to enter the token provided on the right-hand side of this page.

Neural Networks: Learning

In this module, we introduce the backpropagation algorithm that is used to help learn parameters for a neural network. At the end of this module, you will be implementing your own neural network for digit recognition.

8 videos, 8 readings

Video: Cost Function

Neural networks are one of the most powerful learning algorithms that we have today. In this and in the next few videos, I'd like to start talking about a learning algorithm for fitting the parameters of a neural network given a training set. As with the discussion of most of our learning algorithms, we're going to begin by talking about the cost function for fitting the parameters of the network.
0:22
I'm going to focus on the application of neural networks to classification problems. So suppose we have a network like that shown on the left. And suppose we have a training set like this is x I, y I pairs of M training example.
0:38
I'm going to use upper case L to denote the total number of layers in this network. So for the network shown on the left we would have capital L equals 4. I'm going to use S subscript L to denote the number of units, that is the number of neurons. Not counting the bias unit in their L of the network. So for example, we would have a S one, which is equal there, equals S three unit, S two in my example is five units. And the output layer S four, which is also equal to S L because capital L is equal to four. The output layer in my example under that has four units.
1:17
We're going to consider two types of classification problems. The first is Binary classification, where the labels y are either 0 or 1. In this case, we will have 1 output unit, so this Neural Network unit on top has 4 output units, but if we had binary classification we would have only one output unit that computes h(x). And the output of the neural network would be h(x) is going to be a real number. And in this case the number of output units, S L, where L is again the index of the final layer. Cuz that's the number of layers we have in the network so the number of units we have in the output layer is going to be equal to 1. In this case to simplify notation later, I'm also going to set K=1 so you can think of K as also denoting the number of units in the output layer. The second type of classification problem we'll consider will be multi-class classification problem where we may have K distinct classes. So our early example had this representation for y if we have 4 classes, and in this case we will have capital K output units and our hypothesis or output vectors that are K dimensional. And the number of output units will be equal to K. And usually we would have K greater than or equal to 3 in this case, because if we had two causes, then we don't need to use the one verses all method. We use the one verses all method only if we have K greater than or equals V classes, so having only two classes we will need to use only one upper unit. Now let's define the cost function for our neural network.
3:03
The cost function we use for the neural network is going to be a generalization of the one that we use for logistic regression. For logistic regression we used to minimize the cost function J(theta) that was minus 1/m of this cost function and then plus this extra regularization term here, where this was a sum from J=1 through n, because we did not regularize the bias term theta0.
3:31
For a neural network, our cost function is going to be a generalization of this. Where instead of having basically just one, which is the compression output unit, we may instead have K of them. So here's our cost function. Our new network now outputs vectors in R K where R might be equal to 1 if we have a binary classification problem. I'm going to use this notation h(x) subscript i to denote the ith output. That is, h(x) is a k-dimensional vector and so this subscript i just selects out the ith element of the vector that is output by my neural network.
4:08
My cost function J(theta) is now going to be the following. Is - 1 over M of a sum of a similar term to what we have for logistic regression, except that we have the sum from K equals 1 through K. This summation is basically a sum over my K output. A unit. So if I have four output units, that is if the final layer of my neural network has four output units, then this is a sum from k equals one through four of basically the logistic regression algorithm's cost function but summing that cost function over each of my four output units in turn. And so you notice in particular that this applies to Yk Hk, because we're basically taking the K upper units, and comparing that to the value of Yk which is that one of those vectors saying what cost it should be. And finally, the second term here is the regularization term, similar to what we had for the logistic regression. This summation term looks really complicated, but all it's doing is it's summing over these terms theta j i l for all values of i j and l. Except that we don't sum over the terms corresponding to these bias values like we have for logistic progression. Completely, we don't sum over the terms responding to where i is equal to 0. So that is because when we're computing the activation of a neuron, we have terms like these. Theta i 0. Plus theta i1, x1 plus and so on. Where I guess put in a two there, this is the first hit in there. And so the values with a zero there, that corresponds to something that multiplies into an x0 or an a0. And so this is kinda like a bias unit and by analogy to what we were doing for logistic progression, we won't sum over those terms in our regularization term because we don't want to regularize them and string their values as zero. But this is just one possible convention, and even if you were to sum over i equals 0 up to Sl, it would work about the same and doesn't make a big difference. But maybe this convention of not regularizing the bias term is just slightly more common.
6:33
So that's the cost function we're going to use for our neural network. In the next video we'll start to talk about an algorithm for trying to optimize the cost function.

Reading: Cost Function

Cost Function

Let's first define a few variables that we will need to use:
- L = total number of layers in the network
- (s_l) = number of units (not counting bias unit) in layer l
- K = number of output units/classes
Recall that in neural networks, we may have many output nodes. We denote (h_Theta(x)_k) as being a hypothesis that results in the (k^{th}) output. Our cost function for neural networks is going to be a generalization of the one we used for logistic regression. Recall that the cost function for regularized logistic regression was:

For neural networks, it is going to be slightly more complicated:

We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.

In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.

Note:
- the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
- the triple sum simply adds up the squares of all the individual Θs in the entire network.
- the i in the triple sum does not refer to training example i
Video: Backpropagation Algorithm

In the previous video, we talked about a cost function for the neural network. In this video, let's start to talk about an algorithm, for trying to minimize the cost function. In particular, we'll talk about the back propagation algorithm.
0:13
Here's the cost function that we wrote down in the previous video. What we'd like to do is try to find parameters theta to try to minimize j of theta. In order to use either gradient descent or one of the advance optimization algorithms. What we need to do therefore is to write code that takes this input the parameters theta and computes j of theta and these partial derivative terms. Remember, that the parameters in the the neural network of these things, theta superscript l subscript ij, that's the real number and so, these are the partial derivative terms we need to compute. In order to compute the cost function j of theta, we just use this formula up here and so, what I want to do for the most of this video is focus on talking about how we can compute these partial derivative terms. Let's start by talking about the case of when we have only one training example, so imagine, if you will that our entire training set comprises only one training example which is a pair xy. I'm not going to write x1y1 just write this. Write a one training example as xy and let's tap through the sequence of calculations we would do with this one training example.
1:25
The first thing we do is we apply forward propagation in order to compute whether a hypotheses actually outputs given the input. Concretely, the called the a(1) is the activation values of this first layer that was the input there. So, I'm going to set that to x and then we're going to compute z(2) equals theta(1) a(1) and a(2) equals g, the sigmoid activation function applied to z(2) and this would give us our activations for the first middle layer. That is for layer two of the network and we also add those bias terms. Next we apply 2 more steps of this four and propagation to compute a(3) and a(4) which is also the upwards of a hypotheses h of x. So this is our vectorized implementation of forward propagation and it allows us to compute the activation values for all of the neurons in our neural network.
2:27
Next, in order to compute the derivatives, we're going to use an algorithm called back propagation.
2:34
The intuition of the back propagation algorithm is that for each note we're going to compute the term delta superscript l subscript j that's going to somehow represent the error of note j in the layer l. So, recall that a superscript l subscript j that does the activation of the j of unit in layer l and so, this delta term is in some sense going to capture our error in the activation of that neural duo. So, how we might wish the activation of that note is slightly different. Concretely, taking the example neural network that we have on the right which has four layers. And so capital L is equal to 4. For each output unit, we're going to compute this delta term. So, delta for the j of unit in the fourth layer is equal to
3:23
just the activation of that unit minus what was the actual value of 0 in our training example.
3:29
So, this term here can also be written h of x subscript j, right. So this delta term is just the difference between when a hypotheses output and what was the value of y in our training set whereas y subscript j is the j of element of the vector value y in our labeled training set.
3:56
And by the way, if you think of delta a and y as vectors then you can also take those and come up with a vectorized implementation of it, which is just delta 4 gets set as a4 minus y. Where here, each of these delta 4 a4 and y, each of these is a vector whose dimension is equal to the number of output units in our network.
4:25
So we've now computed the era term's delta 4 for our network.
4:31
What we do next is compute the delta terms for the earlier layers in our network. Here's a formula for computing delta 3 is delta 3 is equal to theta 3 transpose times delta 4. And this dot times, this is the element y's multiplication operation
4:47
that we know from MATLAB. So delta 3 transpose delta 4, that's a vector; g prime z3 that's also a vector and so dot times is in element y's multiplication between these two vectors.
5:01
This term g prime of z3, that formally is actually the derivative of the activation function g evaluated at the input values given by z3. If you know calculus, you can try to work it out yourself and see that you can simplify it to the same answer that I get. But I'll just tell you pragmatically what that means. What you do to compute this g prime, these derivative terms is just a3 dot times1 minus A3 where A3 is the vector of activations. 1 is the vector of ones and A3 is again the activation the vector of activation values for that layer. Next you apply a similar formula to compute delta 2 where again that can be computed using a similar formula.
5:48
Only now it is a2 like so and I then prove it here but you can actually, it's possible to prove it if you know calculus that this expression is equal to mathematically, the derivative of the g function of the activation function, which I'm denoting by g prime. And finally, that's it and there is no delta1 term, because the first layer corresponds to the input layer and that's just the feature we observed in our training sets, so that doesn't have any error associated with that. It's not like, you know, we don't really want to try to change those values. And so we have delta terms only for layers 2, 3 and for this example.
6:30
The name back propagation comes from the fact that we start by computing the delta term for the output layer and then we go back a layer and compute the delta terms for the third hidden layer and then we go back another step to compute delta 2 and so, we're sort of back propagating the errors from the output layer to layer 3 to their to hence the name back complication.
6:51
Finally, the derivation is surprisingly complicated, surprisingly involved but if you just do this few steps steps of computation it is possible to prove viral frankly some what complicated mathematical proof. It's possible to prove that if you ignore authorization then the partial derivative terms you want
7:12
are exactly given by the activations and these delta terms. This is ignoring lambda or alternatively the regularization
7:23
term lambda will equal to 0. We'll fix this detail later about the regularization term, but so by performing back propagation and computing these delta terms, you can, you know, pretty quickly compute these partial derivative terms for all of your parameters. So this is a lot of detail. Let's take everything and put it all together to talk about how to implement back propagation
7:46
to compute derivatives with respect to your parameters.
7:49
And for the case of when we have a large training set, not just a training set of one example, here's what we do. Suppose we have a training set of m examples like that shown here. The first thing we're going to do is we're going to set these delta l subscript i j. So this triangular symbol? That's actually the capital Greek alphabet delta . The symbol we had on the previous slide was the lower case delta. So the triangle is capital delta. We're gonna set this equal to zero for all values of l i j. Eventually, this capital delta l i j will be used
8:26
to compute the partial derivative term, partial derivative respect to theta l i j of J of theta.
8:39
So as we'll see in a second, these deltas are going to be used as accumulators that will slowly add things in order to compute these partial derivatives.
8:49
Next, we're going to loop through our training set. So, we'll say for i equals 1 through m and so for the i iteration, we're going to working with the training example xi, yi.
9:00
So the first thing we're going to do is set a1 which is the activations of the input layer, set that to be equal to xi is the inputs for our i training example, and then we're going to perform forward propagation to compute the activations for layer two, layer three and so on up to the final layer, layer capital L. Next, we're going to use the output label yi from this specific example we're looking at to compute the error term for delta L for the output there. So delta L is what a hypotheses output minus what the target label was?
9:41
And then we're going to use the back propagation algorithm to compute delta L minus 1, delta L minus 2, and so on down to delta 2 and once again there is now delta 1 because we don't associate an error term with the input layer.
9:57
And finally, we're going to use these capital delta terms to accumulate these partial derivative terms that we wrote down on the previous line.
10:06
And by the way, if you look at this expression, it's possible to vectorize this too. Concretely, if you think of delta ij as a matrix, indexed by subscript ij.
10:19
Then, if delta L is a matrix we can rewrite this as delta L, gets updated as delta L plus
10:27
lower case delta L plus one times aL transpose. So that's a vectorized implementation of this that automatically does an update for all values of i and j. Finally, after executing the body of the four-loop we then go outside the four-loop and we compute the following. We compute capital D as follows and we have two separate cases for j equals zero and j not equals zero.
10:56
The case of j equals zero corresponds to the bias term so when j equals zero that's why we're missing is an extra regularization term.
11:05
Finally, while the formal proof is pretty complicated what you can show is that once you've computed these D terms, that is exactly the partial derivative of the cost function with respect to each of your perimeters and so you can use those in either gradient descent or in one of the advanced authorization
11:25
algorithms.
11:28
So that's the back propagation algorithm and how you compute derivatives of your cost function for a neural network. I know this looks like this was a lot of details and this was a lot of steps strung together. But both in the programming assignments write out and later in this video, we'll give you a summary of this so we can have all the pieces of the algorithm together so that you know exactly what you need to implement if you want to implement back propagation to compute the derivatives of your neural network's cost function with respect to those parameters.

Reading: Backpropagation Algorithm

Backpropagation Algorithm

"Backpropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. Our goal is to compute:

(min_Theta J(Theta))

That is, we want to minimize our cost function J using an optimal set of parameters in theta. In this section we'll look at the equations we use to compute the partial derivative of (J(Theta)):

(dfrac{partial}{partial Theta_{i,j}^{(l)}}J(Theta))

To do so, we use the following algorithm:

Back propagation Algorithm

Given training set (lbrace (x^{(1)}, y^{(1)}) cdots (x^{(m)}, y^{(m)}) brace)
- Set (Delta^{(l)}_{i,j} := 0) for all (l,i,j), (hence you end up having a matrix full of zeros)
For training example t =1 to m:

1. Set (a^{(1)} := x^{(t)})
2. Perform forward propagation to compute (a^{(l)}) for l=2,3,…,L

3. Using (y^{(t)}), compute (delta^{(L)} = a^{(L)} - y^{(i)})
Where L is our total number of layers and (a^{(L)}) is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y. To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:

4. Compute (delta^{(L-1)}, delta^{(L-2)},dots,delta^{(2)}) using (delta^{(l)} = ((Theta^{(l)})^T delta^{(l+1)}) .* a^{(l)} .* (1 - a^{(l)}))

The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g', or g-prime, which is the derivative of the activation function g evaluated with the input values given by (z^{(l)}).

The g-prime derivative terms can also be written out as:

5. (Delta^{(l)}_{i,j} := Delta^{(l)}_{i,j} + a_j^{(l)} delta_i^{(l+1)}) or with vectorization, (Delta^{(l)} := Delta^{(l)} + delta^{(l+1)}(a^{(l)})^T)

Hence we update our new (Delta) matrix.
- (D^{(l)}_{i,j} := dfrac{1}{m}left(Delta^{(l)}_{i,j} + lambdaTheta^{(l)}_{i,j} ight)), if j≠0.
- (D^{(l)}_{i,j} := dfrac{1}{m}Delta^{(l)}_{i,j}), If j=0
The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative. Thus we get (frac partial {partial Theta_{ij}^{(l)}} J(Theta) = D_{ij}^{(l)})

Video: Backpropagation Intuition

In the previous video, we talked about the backpropagation algorithm. To a lot of people seeing it for the first time, their first impression is often that wow this is a really complicated algorithm, and there are all these different steps, and I'm not sure how they fit together. And it's kinda this black box of all these complicated steps. In case that's how you're feeling about backpropagation, that's actually okay. Backpropagation maybe unfortunately is a less mathematically clean, or less mathematically simple algorithm, compared to linear regression or logistic regression. And I've actually used backpropagation, you know, pretty successfully for many years. And even today I still don't sometimes feel like I have a very good sense of just what it's doing, or intuition about what back propagation is doing. If, for those of you that are doing the programming exercises, that will at least mechanically step you through the different steps of how to implement back prop. So you'll be able to get it to work for yourself. And what I want to do in this video is look a little bit more at the mechanical steps of backpropagation, and try to give you a little more intuition about what the mechanical steps the back prop is doing to hopefully convince you that, you know, it's at least a reasonable algorithm.
1:13
In case even after this video in case back propagation still seems very black box and kind of like a, too many complicated steps and a little bit magical to you, that's actually okay. And Even though I've used back prop for many years, sometimes this is a difficult algorithm to understand, but hopefully this video will help a little bit. In order to better understand backpropagation, let's take another closer look at what forward propagation is doing. Here's a neural network with two input units that is not counting the bias unit, and two hidden units in this layer, and two hidden units in the next layer. And then, finally, one output unit. Again, these counts two, two, two, are not counting these bias units on top. In order to illustrate forward propagation, I'm going to draw this network a little bit differently.
2:08
And in particular I'm going to draw this neuro-network with the nodes drawn as these very fat ellipsis, so that I can write text in them. When performing forward propagation, we might have some particular example. Say some example x i comma y i. And it'll be this x i that we feed into the input layer. So this maybe x i 2 and x i 2 are the values we set the input layer to. And when we forward propagated to the first hidden layer here, what we do is compute z (2) 1 and z (2) 2. So these are the weighted sum of inputs of the input units. And then we apply the sigmoid of the logistic function, and the sigmoid activation function applied to the z value. Here's are the activation values. So that gives us a (2) 1 and a (2) 2. And then we forward propagate again to get here z (3) 1. Apply the sigmoid of the logistic function, the activation function to that to get a (3) 1. And similarly, like so until we get z (4) 1. Apply the activation function. This gives us a (4)1, which is the final output value of the neural network.
3:24
Let's erase this arrow to give myself some more space. And if you look at what this computation really is doing, focusing on this hidden unit, let's say. We have to add this weight. Shown in magenta there is my weight theta (2) 1 0, the indexing is not important. And this way here, which I'm highlighting in red, that is theta (2) 1 1 and this weight here, which I'm drawing in cyan, is theta (2) 1 2. So the way we compute this value, z(3)1 is, z(3)1 is as equal to this magenta weight times this value. So that's theta (2) 10 x 1. And then plus this red weight times this value, so that's theta(2) 11 times a(2)1. And finally this cyan weight times this value, which is therefore plus theta(2)12 times a(2)1. And so that's forward propagation. And it turns out that as we'll see later in this video, what backpropagation is doing is doing a process very similar to this. Except that instead of the computations flowing from the left to the right of this network, the computations since their flow from the right to the left of the network. And using a very similar computation as this. And I'll say in two slides exactly what I mean by that. To better understand what backpropagation is doing, let's look at the cost function. It's just the cost function that we had for when we have only one output unit. If we have more than one output unit, we just have a summation you know over the output units indexed by k there. If you have only one output unit then this is a cost function. And we do forward propagation and backpropagation on one example at a time. So let's just focus on the single example, x (i) y (i) and focus on the case of having one output unit. So y (i) here is just a real number. And let's ignore regularization, so lambda equals 0. And this final term, that regularization term, goes away. Now if you look inside the summation, you find that the cost term associated with the training example, that is the cost associated with the training example x(i), y(i). That's going to be given by this expression. So, the cost to live off examplie i is written as follows. And what this cost function does is it plays a role similar to the squared arrow. So, rather than looking at this complicated expression, if you want you can think of cost of i being approximately the square difference between what the neural network outputs, versus what is the actual value. Just as in logistic repression, we actually prefer to use the slightly more complicated cost function using the log. But for the purpose of intuition, feel free to think of the cost function as being the sort of the squared error cost function. And so this cost(i) measures how well is the network doing on correctly predicting example i. How close is the output to the actual observed label y(i)? Now let's look at what backpropagation is doing. One useful intuition is that backpropagation is computing these delta superscript l subscript j terms. And we can think of these as the quote error of the activation value that we got for unit j in the layer, in the lth layer.
7:07
More formally, for, and this is maybe only for those of you who are familiar with calculus. More formally, what the delta terms actually are is this, they're the partial derivative with respect to z,l,j, that is this weighted sum of inputs that were confusing these z terms. Partial derivatives with respect to these things of the cost function.
7:27
So concretely, the cost function is a function of the label y and of the value, this h of x output value neural network. And if we could go inside the neural network and just change those z l j values a little bit, then that will affect these values that the neural network is outputting. And that will end up changing the cost function. And again really, this is only for those of you who are expert in Calculus. If you're comfortable with partial derivatives, what these delta terms are is they turn out to be the partial derivative of the cost function, with respect to these intermediate terms that were confusing.
8:06
And so they're a measure of how much would we like to change the neural network's weights, in order to affect these intermediate values of the computation. So as to affect the final output of the neural network h(x) and therefore affect the overall cost. In case this lost part of this partial derivative intuition, in case that doesn't make sense. Don't worry about the rest of this, we can do without really talking about partial derivatives. But let's look in more detail about what backpropagation is doing. For the output layer, the first set's this delta term, delta (4) 1, as y (i) if we're doing forward propagation and back propagation on this training example i. That says y(i) minus a(4)1. So this is really the error, right? It's the difference between the actual value of y minus what was the value predicted, and so we're gonna compute delta(4)1 like so. Next we're gonna do, propagate these values backwards. I'll explain this in a second, and end up computing the delta terms for the previous layer. We're gonna end up with delta(3)1. Delta(3)2. And then we're gonna propagate this further backward, and end up computing delta(2)1 and delta(2)2. Now the backpropagation calculation is a lot like running the forward propagation algorithm, but doing it backwards. So here's what I mean. Let's look at how we end up with this value of delta(2)2. So we have delta(2)2. And similar to forward propagation, let me label a couple of the weights. So this weight, which I'm going to draw in cyan. Let's say that weight is theta(2)1 2, and this one down here when we highlight this in red. That is going to be let's say theta(2) of 2 2. So if we look at how delta(2)2, is computed, how it's computed with this note. It turns out that what we're going to do, is gonna take this value and multiply it by this weight, and add it to this value multiplied by that weight. So it's really a weighted sum of these delta values, weighted by the corresponding edge strength. So completely, let me fill this in, this delta(2)2 is going to be equal to, Theta(2)1 2 is that magenta lay times delta(3)1. Plus, and the thing I had in red, that's theta (2)2 times delta (3)2. So it's really literally this red wave times this value, plus this magenta weight times this value. And that's how we wind up with that value of delta. And just as another example, let's look at this value. How do we get that value? Well it's a similar process. If this weight, which I'm gonna highlight in green, if this weight is equal to, say, delta (3) 1 2. Then we have that delta (3) 2 is going to be equal to that green weight, theta (3) 12 times delta (4) 1. And by the way, so far I've been writing the delta values only for the hidden units, but excluding the bias units. Depending on how you define the backpropagation algorithm, or depending on how you implement it, you know, you may end up implementing something that computes delta values for these bias units as well. The bias units always output the value of plus one, and they are just what they are, and there's no way for us to change the value. And so, depending on your implementation of back prop, the way I usually implement it. I do end up computing these delta values, but we just discard them, we don't use them. Because they don't end up being part of the calculation needed to compute a derivative. So hopefully that gives you a little better intuition about what back propegation is doing. In case of all of this still seems sort of magical, sort of black box, in a later video, in the putting it together video, I'll try to get a little bit more intuition about what backpropagation is doing. But unfortunately this is a difficult algorithm to try to visualize and understand what it is really doing. But fortunately I've been, I guess many people have been using very successfully for many years. And if you implement the algorithm you can have a very effective learning algorithm. Even though the inner workings of exactly how it works can be harder to visualize.

Reading: Backpropagation Intuition

Backpropagation Intuition
Note: [4:39, the last term for the calculation for (z^3_1) (three-color handwritten formula) should be (a^2_2) instead of (a^2_1). 6:08 - the equation for cost(i) is incorrect. The first term is missing parentheses for the log() function, and the second term should be ((1-y^{(i)})log(1-h{_ heta}{(x^{(i)}}))). 8:50 - (delta^{(4)} = y - a^{(4)}) is incorrect and should be (delta^{(4)} = a^{(4)} - y)]

Recall that the cost function for a neural network is:

If we consider simple non-multiclass classification (k = 1) and disregard regularization, the cost is computed with:

Intuitively, (delta_j^{(l)}) is the "error" for (a^{(l)}_j) (unit j in layer l). More formally, the delta values are actually the derivative of the cost function:

Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are. Let us consider the following neural network below and see how we could calculate some (delta_j^{(l)}):

In the image above, to calculate (delta_2^{(2)}), we multiply the weights (Theta_{12}^{(2)}) and (Theta_{22}^{(2)}) by their respective (delta) values found to the right of each edge. So we get (delta_2^{(2)} = Theta_{12}^{(2)}*delta_1^{(3)}+Theta_{22}^{(2)}*delta_2^{(3)}). To calculate every single possible (delta_j^{(l)}), we could start from the right of our diagram. We can think of our edges as our (Theta_{ij}). Going from right to left, to calculate the value of (delta_j^{(l)}), you can just take the over all sum of each weight times the (delta) it is coming from. Hence, another example would be (delta_2^{(3)}=Theta_{12}^{(3)}*delta_1^{(4)}).

Video: Implementation Note: Unrolling Parameters

In the previous video, we talked about how to use back propagation
0:03
to compute the derivatives of your cost function. In this video, I want to quickly tell you about one implementational detail of unrolling your parameters from matrices into vectors, which we need in order to use the advanced optimization routines.
0:20
Concretely, let's say you've implemented a cost function that takes this input, you know, parameters theta and returns the cost function and returns derivatives.
0:30
Then you can pass this to an advanced authorization algorithm by fminunc and fminunc isn't the only one by the way. There are also other advanced authorization algorithms.
0:39
But what all of them do is take those input pointedly the cost function, and some initial value of theta.
0:47
And both, and these routines assume that theta and the initial value of theta, that these are parameter vectors, maybe Rn or Rn plus 1. But these are vectors and it also assumes that, you know, your cost function will return as a second return value this gradient which is also Rn and Rn plus 1. So also a vector. This worked fine when we were using logistic progression but now that we're using a neural network our parameters are no longer vectors, but instead they are these matrices where for a full neural network we would have parameter matrices theta 1, theta 2, theta 3 that we might represent in Octave as these matrices theta 1, theta 2, theta 3. And similarly these gradient terms that were expected to return. Well, in the previous video we showed how to compute these gradient matrices, which was capital D1, capital D2, capital D3, which we might represent an octave as matrices D1, D2, D3.
1:48
In this video I want to quickly tell you about the idea of how to take these matrices and unroll them into vectors. So that they end up being in a format suitable for passing into as theta here off for getting out for a gradient there.
2:03
Concretely, let's say we have a neural network with one input layer with ten units, hidden layer with ten units and one output layer with just one unit, so s1 is the number of units in layer one and s2 is the number of units in layer two, and s3 is a number of units in layer three. In this case, the dimension of your matrices theta and D are going to be given by these expressions. For example, theta one is going to a 10 by 11 matrix and so on.
2:34
So in if you want to convert between these matrices. vectors. What you can do is take your theta 1, theta 2, theta 3, and write this piece of code and this will take all the elements of your three theta matrices and take all the elements of theta one, all the elements of theta 2, all the elements of theta 3, and unroll them and put all the elements into a big long vector.
2:58
Which is thetaVec and similarly
3:00
the second command would take all of your D matrices and unroll them into a big long vector and call them DVec. And finally if you want to go back from the vector representations to the matrix representations.
3:14
What you do to get back to theta one say is take thetaVec and pull out the first 110 elements. So theta 1 has 110 elements because it's a 10 by 11 matrix so that pulls out the first 110 elements and then you can use the reshape command to reshape those back into theta 1. And similarly, to get back theta 2 you pull out the next 110 elements and reshape it. And for theta 3, you pull out the final eleven elements and run reshape to get back the theta 3.
3:48
Here's a quick Octave demo of that process. So for this example let's set theta 1 equal to be ones of 10 by 11, so it's a matrix of all ones. And just to make this easier seen, let's set that to be 2 times ones, 10 by 11 and let's set theta 3 equals 3 times 1's of 1 by 11. So this is 3 separate matrices: theta 1, theta 2, theta 3. We want to put all of these as a vector. ThetaVec equals theta 1; theta 2
4:28
theta 3. Right, that's a colon in the middle and like so
4:35
and now thetavec is going to be a very long vector. That's 231 elements.
4:42
If I display it, I find that this very long vector with all the elements of the first matrix, all the elements of the second matrix, then all the elements of the third matrix.
4:53
And if I want to get back my original matrices, I can do reshape thetaVec.
5:01
Let's pull out the first 110 elements and reshape them to a 10 by 11 matrix.
5:06
This gives me back theta 1. And if I then pull out the next 110 elements. So that's indices 111 to 220. I get back all of my 2's.
5:18
And if I go
5:20
from 221 up to the last element, which is element 231, and reshape to 1 by 11, I get back theta 3.
5:30
To make this process really concrete, here's how we use the unrolling idea to implement our learning algorithm.
5:38
Let's say that you have some initial value of the parameters theta 1, theta 2, theta 3. What we're going to do is take these and unroll them into a long vector we're gonna call initial theta to pass in to fminunc as this initial setting of the parameters theta.
5:56
The other thing we need to do is implement the cost function.
5:59
Here's my implementation of the cost function.
6:02
The cost function is going to give us input, thetaVec, which is going to be all of my parameters vectors that in the form that's been unrolled into a vector.
6:11
So the first thing I'm going to do is I'm going to use thetaVec and I'm going to use the reshape functions. So I'll pull out elements from thetaVec and use reshape to get back my original parameter matrices, theta 1, theta 2, theta 3. So these are going to be matrices that I'm going to get. So that gives me a more convenient form in which to use these matrices so that I can run forward propagation and back propagation to compute my derivatives, and to compute my cost function j of theta.
6:39
And finally, I can then take my derivatives and unroll them, to keeping the elements in the same ordering as I did when I unroll my thetas. But I'm gonna unroll D1, D2, D3, to get gradientVec which is now what my cost function can return. It can return a vector of these derivatives.
6:59
So, hopefully, you now have a good sense of how to convert back and forth between the matrix representation of the parameters versus the vector representation of the parameters.
7:09
The advantage of the matrix representation is that when your parameters are stored as matrices it's more convenient when you're doing forward propagation and back propagation and it's easier when your parameters are stored as matrices to take advantage of the, sort of, vectorized implementations.
7:26
Whereas in contrast the advantage of the vector representation, when you have like thetaVec or DVec is that when you are using the advanced optimization algorithms. Those algorithms tend to assume that you have all of your parameters unrolled into a big long vector. And so with what we just went through, hopefully you can now quickly convert between the two as needed.

Reading: Implementation Note: Unrolling Parameters

Implementation Note: Unrolling Parameters

With neural networks, we are working with sets of matrices:

In order to use optimizing functions such as "fminunc()", we will want to "unroll" all the elements and put them into one long vector:
```
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]
```
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from the "unrolled" versions as follows:
```
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)
```
To summarize:

Video: Gradient Checking

In the last few videos we talked about how to do forward propagation and back propagation in a neural network in order to compute derivatives. But back prop as an algorithm has a lot of details and can be a little bit tricky to implement. And one unfortunate property is that there are many ways to have subtle bugs in back prop. So that if you run it with gradient descent or some other optimizational algorithm, it could actually look like it's working. And your cost function, J of theta may end up decreasing on every iteration of gradient descent. But this could prove true even though there might be some bug in your implementation of back prop. So that it looks J of theta is decreasing, but you might just wind up with a neural network that has a higher level of error than you would with a bug free implementation. And you might just not know that there was this subtle bug that was giving you worse performance. So, what can we do about this? There's an idea called gradient checking that eliminates almost all of these problems. So, today every time I implement back propagation or a similar gradient to a [INAUDIBLE] on a neural network or any other reasonably complex model, I always implement gradient checking. And if you do this, it will help you make sure and sort of gain high confidence that your implementation of four prop and back prop or whatever is 100% correct. And from what I've seen this pretty much eliminates all the problems associated with a sort of a buggy implementation as a back prop. And in the previous videos I asked you to take on faith that the formulas I gave for computing the deltas and the vs and so on, I asked you to take on faith that those actually do compute the gradients of the cost function. But once you implement numerical gradient checking, which is the topic of this video, you'll be able to absolute verify for yourself that the code you're writing does indeed, is indeed computing the derivative of the cross function J.
1:52
So here's the idea, consider the following example. Suppose that I have the function J of theta and I have some value theta and for this example gonna assume that theta is just a real number. And let's say that I want to estimate the derivative of this function at this point and so the derivative is equal to the slope of that tangent one.
2:14
Here's how I'm going to numerically approximate the derivative, or rather here's a procedure for numerically approximating the derivative. I'm going to compute theta plus epsilon, so now we move it to the right. And I'm gonna compute theta minus epsilon and I'm going to look at those two points, And connect them by a straight line
2:43
And I'm gonna connect these two points by a straight line, and I'm gonna use the slope of that little red line as my approximation to the derivative. Which is, the true derivative is the slope of that blue line over there. So, you know it seems like it would be a pretty good approximation.
2:58
Mathematically, the slope of this red line is this vertical height divided by this horizontal width. So this point on top is the J of (Theta plus Epsilon). This point here is J (Theta minus Epsilon), so this vertical difference is J (Theta plus Epsilon) minus J of theta minus epsilon and this horizontal distance is just 2 epsilon.
3:23
So my approximation is going to be that the derivative respect of theta of J of theta at this value of theta, that that's approximately J of theta plus epsilon minus J of theta minus epsilon over 2 epsilon.
3:42
Usually, I use a pretty small value for epsilon, expect epsilon to be maybe on the order of 10 to the minus 4. There's usually a large range of different values for epsilon that work just fine. And in fact, if you let epsilon become really small, then mathematically this term here, actually mathematically, it becomes the derivative. It becomes exactly the slope of the function at this point. It's just that we don't want to use epsilon that's too, too small, because then you might run into numerical problems. So I usually use epsilon around ten to the minus four. And by the way some of you may have seen an alternative formula for s meeting the derivative which is this formula.
4:21
This one on the right is called a one-sided difference, whereas the formula on the left, that's called a two-sided difference. The two sided difference gives us a slightly more accurate estimate, so I usually use that, rather than this one sided difference estimate.
4:35
So, concretely, when you implement an octave, is you implemented the following, you implement call to compute gradApprox, which is going to be our approximation derivative as just here this formula, J of theta plus epsilon minus J of theta minus epsilon divided by 2 times epsilon. And this will give you a numerical estimate of the gradient at that point. And in this example it seems like it's a pretty good estimate.
5:01
Now on the previous slide, we considered the case of when theta was a rolled number. Now let's look at a more general case of when theta is a vector parameter, so let's say theta is an R n. And it might be an unrolled version of the parameters of our neural network. So theta is a vector that has n elements, theta 1 up to theta n. We can then use a similar idea to approximate all the partial derivative terms. Concretely the partial derivative of a cost function with respect to the first parameter, theta one, that can be obtained by taking J and increasing theta one. So you have J of theta one plus epsilon and so on. Minus J of this theta one minus epsilon and divide it by two epsilon. The partial derivative respect to the second parameter theta two, is again this thing except that you would take J of here you're increasing theta two by epsilon, and here you're decreasing theta two by epsilon and so on down to the derivative. With respect of theta n would give you increase and decrease theta and by epsilon over there.
6:09
So, these equations give you a way to numerically approximate the partial derivative of J with respect to any one of your parameters theta i.
6:23
Completely, what you implement is therefore the following.
6:27
We implement the following in octave to numerically compute the derivatives. We say, for i = 1:n, where n is the dimension of our parameter of vector theta. And I usually do this with the unrolled version of the parameter. So theta is just a long list of all of my parameters in my neural network, say. I'm gonna set thetaPlus = theta, then increase thetaPlus of the (i) element by epsilon. And so this is basically thetaPlus is equal to theta except for thetaPlus(i) which is now incremented by epsilon. Epsilon, so theta plus is equal to, write theta 1, theta 2 and so on. Then theta I has epsilon added to it and then we go down to theta N. So this is what theta plus is. And similar these two lines set theta minus to something similar except that this instead of theta I plus Epsilon, this now becomes theta I minus Epsilon.
7:20
And then finally you implement this gradApprox (i) and this would give you your approximation to the partial derivative respect of theta i of J of theta.
7:35
And the way we use this in our neural network implementation is, we would implement this four loop to compute the top partial derivative of the cost function for respect to every parameter in that network, and we can then take the gradient that we got from backprop. So DVec was the derivative we got from backprop. All right, so backprop, backpropogation, was a relatively efficient way to compute a derivative or a partial derivative. Of a cost function with respect to all our parameters. And what I usually do is then, take my numerically computed derivative that is this gradApprox that we just had from up here. And make sure that that is equal or approximately equal up to small values of numerical round up, that it's pretty close. So the DVec that I got from backprop. And if these two ways of computing the derivative give me the same answer, or give me any similar answers, up to a few decimal places, then I'm much more confident that my implementation of backprop is correct. And when I plug these DVec vectors into gradient assent or some advanced optimization algorithm, I can then be much more confident that I'm computing the derivatives correctly, and therefore that hopefully my code will run correctly and do a good job optimizing J of theta.
8:57
Finally, I wanna put everything together and tell you how to implement this numerical gradient checking. Here's what I usually do. First thing I do is implement back propagation to compute DVec. So there's a procedure we talked about in the earlier video to compute DVec which may be our unrolled version of these matrices. So then what I do, is implement a numerical gradient checking to compute gradApprox. So this is what I described earlier in this video and in the previous slide.
9:24
Then should make sure that DVec and gradApprox give similar values, you know let's say up to a few decimal places.
9:32
And finally and this is the important step, before you start to use your code for learning, for seriously training your network, it's important to turn off gradient checking and to no longer compute this gradApprox thing using the numerical derivative formulas that we talked about earlier in this video.
9:50
And the reason for that is the numeric code gradient checking code, the stuff we talked about in this video, that's a very computationally expensive, that's a very slow way to try to approximate the derivative. Whereas In contrast, the back propagation algorithm that we talked about earlier, that is the thing we talked about earlier for computing. You know, D1, D2, D3 for Dvec. Backprop is much more computationally efficient way of computing for derivatives.
10:17
So once you've verified that your implementation of back propagation is correct, you should turn off gradient checking and just stop using that. So just to reiterate, you should be sure to disable your gradient checking code before running your algorithm for many iterations of gradient descent or for many iterations of the advanced optimization algorithms, in order to train your classifier. Concretely, if you were to run the numerical gradient checking on every single iteration of gradient descent. Or if you were in the inner loop of your costFunction, then your code would be very slow. Because the numerical gradient checking code is much slower than the backpropagation algorithm, than the backpropagation method where, you remember, we were computing delta(4), delta(3), delta(2), and so on. That was the backpropagation algorithm. That is a much faster way to compute derivates than gradient checking. So when you're ready, once you've verified the implementation of back propagation is correct, make sure you turn off or you disable your gradient checking code while you train your algorithm, or else you code could run very slowly.
11:20
So, that's how you take gradients numericaly, and that's how you can verify tha implementation of back propagation is correct. Whenever I implement back propagation or similar gradient discerning algorithm for a complicated mode,l I always use gradient checking and this really helps me make sure that my code is correct.

Reading: Gradient Checking

Gradient Checking

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:

function with:

(dfrac{partial}{partialTheta}J(Theta) approx dfrac{J(Theta + epsilon) - J(Theta - epsilon)}{2epsilon})

With multiple theta matrices, we can approximate the derivative with respect to (Theta_j) as follows:

(dfrac{partial}{partialTheta_j}J(Theta) approx dfrac{J(Theta_1, dots, Theta_j + epsilon, dots, Theta_n) - J(Theta_1, dots, Theta_j - epsilon, dots, Theta_n)}{2epsilon})

A small value for (epsilon)(epsilon) such as (epsilon = 10^{-4}), guarantees that the math works out properly. If the value for (epsilon) is too small, we can end up with numerical problems.

Hence, we are only adding or subtracting epsilon to the (Theta_j) matrix. In octave we can do it as follows:
```
epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;
```
We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector.

Once you have verified once that your backpropagation algorithm is correct, you don't need to compute gradApprox again. The code to compute gradApprox can be very slow.

Video: Random Initialization

In the previous video, we've put together almost all the pieces you need in order to implement and train in your network. There's just one last idea I need to share with you, which is the idea of random initialization.
0:13
When you're running an algorithm of gradient descent, or also the advanced optimization algorithms, we need to pick some initial value for the parameters theta. So for the advanced optimization algorithm, it assumes you will pass it some initial value for the parameters theta.
0:29
Now let's consider a gradient descent. For that, we'll also need to initialize theta to something, and then we can slowly take steps to go downhill using gradient descent. To go downhill, to minimize the function j of theta. So what can we set the initial value of theta to? Is it possible to set the initial value of theta to the vector of all zeros? Whereas this worked okay when we were using logistic regression, initializing all of your parameters to zero actually does not work when you are trading on your own network. Consider trading the follow Neural network, and let's say we initialize all the parameters of the network to 0. And if you do that, then what you, what that means is that at the initialization, this blue weight, colored in blue is gonna equal to that weight, so they're both 0. And this weight that I'm coloring in in red, is equal to that weight, colored in red, and also this weight, which I'm coloring in green is going to equal to the value of that weight. And what that means is that both of your hidden units, A1 and A2, are going to be computing the same function of your inputs. And thus you end up with for every one of your training examples, you end up with A 2 1 equals A 2 2.
1:46
And moreover because I'm not going to show this in too much detail, but because these outgoing weights are the same you can also show that the delta values are also gonna be the same. So concretely you end up with delta 1 1, delta 2 1 equals delta 2 2, and if you work through the map further, what you can show is that the partial derivatives with respect to your parameters will satisfy the following, that the partial derivative of the cost function with respected to breaking out the derivatives respect to these two blue waves in your network. You find that these two partial derivatives are going to be equal to each other.
2:31
And so what this means is that even after say one greater descent update, you're going to update, say, this first blue rate was learning rate times this, and you're gonna update the second blue rate with some learning rate times this. And what this means is that even after one created the descent update, those two blue rates, those two blue color parameters will end up the same as each other. So there'll be some nonzero value, but this value would equal to that value. And similarly, even after one gradient descent update, this value would equal to that value. There'll still be some non-zero values, just that the two red values are equal to each other. And similarly, the two green ways. Well, they'll both change values, but they'll both end up with the same value as each other. So after each update, the parameters corresponding to the inputs going into each of the two hidden units are identical. That's just saying that the two green weights are still the same, the two red weights are still the same, the two blue weights are still the same, and what that means is that even after one iteration of say, gradient descent and descent. You find that your two headed units are still computing exactly the same functions of the inputs. You still have the a1(2) = a2(2). And so you're back to this case. And as you keep running greater descent, the blue waves,, the two blue waves, will stay the same as each other. The two red waves will stay the same as each other and the two green waves will stay the same as each other.
3:56
And what this means is that your neural network really can compute very interesting functions, right? Imagine that you had not only two hidden units, but imagine that you had many, many hidden units. Then what this is saying is that all of your headed units are computing the exact same feature. All of your hidden units are computing the exact same function of the input. And this is a highly redundant representation because you find the logistic progression unit. It really has to see only one feature because all of these are the same. And this prevents you and your network from doing something interesting.
4:31
In order to get around this problem, the way we initialize the parameters of a neural network therefore is with random initialization.
4:41
Concretely, the problem was saw on the previous slide is something called the problem of symmetric ways, that's the ways are being the same. So this random initialization is how we perform symmetry breaking. So what we do is we initialize each value of theta to a random number between minus epsilon and epsilon. So this is a notation to b numbers between minus epsilon and plus epsilon. So my weight for my parameters are all going to be randomly initialized between minus epsilon and plus epsilon. The way I write code to do this in octave is I've said Theta1 should be equal to this. So this rand 10 by 11, that's how you compute a random 10 by 11 dimensional matrix. All the values are between 0 and 1, so these are going to be raw numbers that take on any continuous values between 0 and 1. And so if you take a number between zero and one, multiply it by two times INIT_EPSILON then minus INIT_EPSILON, then you end up with a number that's between minus epsilon and plus epsilon.
5:45
And the so that leads us, this epsilon here has nothing to do with the epsilon that we were using when we were doing gradient checking. So when numerical gradient checking, there we were adding some values of epsilon and theta. This is your unrelated value of epsilon. We just wanted to notate init epsilon just to distinguish it from the value of epsilon we were using in gradient checking. And similarly if you want to initialize theta2 to a random 1 by 11 matrix you can do so using this piece of code here.
6:16
So to summarize, to create a neural network what you should do is randomly initialize the waves to small values close to zero, between -epsilon and +epsilon say. And then implement back propagation, do great in checking, and use either great in descent or 1b advanced optimization algorithms to try to minimize j(theta) as a function of the parameters theta starting from just randomly chosen initial value for the parameters. And by doing symmetry breaking, which is this process, hopefully great gradient descent or the advanced optimization algorithms will be able to find a good value of theta.

Reading: Random Initialization

Random Initialization

Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. Instead we can randomly initialize our weights for our (Theta) matrices using the following method:

Hence, we initialize each (Theta^{(l)}_{ij}) to a random value between ([-epsilon,epsilon]). Using the above formula guarantees that we get the desired bound. The same procedure applies to all the (Theta)'s. Below is some working code you could use to experiment.
```
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
```
rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0 and 1.

(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)

Video: Putting It Together

So, it's taken us a lot of videos to get through the neural network learning algorithm.
0:05
In this video, what I'd like to do is try to put all the pieces together, to give a overall summary or a bigger picture view, of how all the pieces fit together and of the overall process of how to implement a neural network learning algorithm.
0:21
When training a neural network, the first thing you need to do is pick some network architecture and by architecture I just mean connectivity pattern between the neurons. So, you know, we might choose between say, a neural network with three input units and five hidden units and four output units versus one of 3, 5 hidden, 5 hidden, 4 output and here are 3, 5, 5, 5 units in each of three hidden layers and four open units, and so these choices of how many hidden units in each layer and how many hidden layers, those are architecture choices. So, how do you make these choices?
0:59
Well first, the number of input units well that's pretty well defined. And once you decides on the fix set of features x the number of input units will just be, you know, the dimension of your features x(i) would be determined by that. And if you are doing multiclass classifications the number of output of this will be determined by the number of classes in your classification problem. And just a reminder if you have a multiclass classification where y takes on say values between
1:30
1 and 10, so that you have ten possible classes.
1:34
Then remember to right, your output y as these were the vectors. So instead of clause one, you recode it as a vector like that, or for the second class you recode it as a vector like that. So if one of these apples takes on the fifth class, you know, y equals 5, then what you're showing to your neural network is not actually a value of y equals 5, instead here at the upper layer which would have ten output units, you will instead feed to the vector which you know
2:07
with one in the fifth position and a bunch of zeros down here. So the choice of number of input units and number of output units is maybe somewhat reasonably straightforward.
2:18
And as for the number of hidden units and the number of hidden layers, a reasonable default is to use a single hidden layer and so this type of neural network shown on the left with just one hidden layer is probably the most common.
2:34
Or if you use more than one hidden layer, again the reasonable default will be to have the same number of hidden units in every single layer. So here we have two hidden layers and each of these hidden layers have the same number five of hidden units and here we have, you know, three hidden layers and each of them has the same number, that is five hidden units.
2:57
Rather than doing this sort of network architecture on the left would be a perfect ably reasonable default.
3:04
And as for the number of hidden units - usually, the more hidden units the better; it's just that if you have a lot of hidden units, it can become more computationally expensive, but very often, having more hidden units is a good thing.
3:17
And usually the number of hidden units in each layer will be maybe comparable to the dimension of x, comparable to the number of features, or it could be any where from same number of hidden units of input features to maybe so that three or four times of that. So having the number of hidden units is comparable. You know, several times, or some what bigger than the number of input features is often a useful thing to do So, hopefully this gives you one reasonable set of default choices for neural architecture and and if you follow these guidelines, you will probably get something that works well, but in a later set of videos where I will talk specifically about advice for how to apply algorithms, I will actually say a lot more about how to choose a neural network architecture. Or actually have quite a lot I want to say later to make good choices for the number of hidden units, the number of hidden layers, and so on.
4:10
Next, here's what we need to implement in order to trade in neural network, there are actually six steps that I have; I have four on this slide and two more steps on the next slide. First step is to set up the neural network and to randomly initialize the values of the weights. And we usually initialize the weights to small values near zero.
4:31
Then we implement forward propagation so that we can input any excellent neural network and compute h of x which is this output vector of the y values.
4:44
We then also implement code to compute this cost function j of theta.
4:49
And next we implement back-prop, or the back-propagation
4:54
algorithm, to compute these partial derivatives terms, partial derivatives of j of theta with respect to the parameters. Concretely, to implement back prop. Usually we will do that with a fore loop over the training examples.
5:09
Some of you may have heard of advanced, and frankly very advanced factorization methods where you don't have a four-loop over the m-training examples, that the first time you're implementing back prop there should almost certainly the four loop in your code, where you're iterating over the examples, you know, x1, y1, then so you do forward prop and back prop on the first example, and then in the second iteration of the four-loop, you do forward propagation and back propagation on the second example, and so on. Until you get through the final example. So there should be a four-loop in your implementation of back prop, at least the first time implementing it. And then there are frankly somewhat complicated ways to do this without a four-loop, but I definitely do not recommend trying to do that much more complicated version the first time you try to implement back prop.
5:59
So concretely, we have a four-loop over my m-training examples
6:03
and inside the four-loop we're going to perform fore prop and back prop using just this one example.
6:09
And what that means is that we're going to take x(i), and feed that to my input layer, perform forward-prop, perform back-prop
6:17
and that will if all of these activations and all of these delta terms for all of the layers of all my units in the neural network then still inside this four-loop, let me draw some curly braces just to show the scope with the four-loop, this is in
6:34
octave code of course, but it's more a sequence Java code, and a four-loop encompasses all this. We're going to compute those delta terms, which are is the formula that we gave earlier.
6:45
Plus, you know, delta l plus one times
6:48
a, l transpose of the code. And then finally, outside the having computed these delta terms, these accumulation terms, we would then have some other code and then that will allow us to compute these partial derivative terms. Right and these partial derivative terms have to take into account the regularization term lambda as well. And so, those formulas were given in the earlier video.
7:14
So, how do you done that you now hopefully have code to compute these partial derivative terms.
7:21
Next is step five, what I do is then use gradient checking to compare these partial derivative terms that were computed. So, I've compared the versions computed using back propagation versus the partial derivatives computed using the numerical
7:37
estimates as using numerical estimates of the derivatives. So, I do gradient checking to make sure that both of these give you very similar values.
7:45
Having done gradient checking just now reassures us that our implementation of back propagation is correct, and is then very important that we disable gradient checking, because the gradient checking code is computationally very slow.
7:59
And finally, we then use an optimization algorithm such as gradient descent, or one of the advanced optimization methods such as LB of GS, contract gradient has embodied into fminunc or other optimization methods. We use these together with back propagation, so back propagation is the thing that computes these partial derivatives for us.
8:21
And so, we know how to compute the cost function, we know how to compute the partial derivatives using back propagation, so we can use one of these optimization methods to try to minimize j of theta as a function of the parameters theta. And by the way, for neural networks, this cost function j of theta is non-convex, or is not convex and so it can theoretically be susceptible to local minima, and in fact algorithms like gradient descent and the advance optimization methods can, in theory, get stuck in local
8:55
optima, but it turns out that in practice this is not usually a huge problem and even though we can't guarantee that these algorithms will find a global optimum, usually algorithms like gradient descent will do a very good job minimizing this cost function j of theta and get a very good local minimum, even if it doesn't get to the global optimum. Finally, gradient descents for a neural network might still seem a little bit magical. So, let me just show one more figure to try to get that intuition about what gradient descent for a neural network is doing.
9:27
This was actually similar to the figure that I was using earlier to explain gradient descent. So, we have some cost function, and we have a number of parameters in our neural network. Right here I've just written down two of the parameter values. In reality, of course, in the neural network, we can have lots of parameters with these. Theta one, theta two--all of these are matrices, right? So we can have very high dimensional parameters but because of the limitations the source of parts we can draw. I'm pretending that we have only two parameters in this neural network. Although obviously we have a lot more in practice.
9:59
Now, this cost function j of theta measures how well the neural network fits the training data.
10:06
So, if you take a point like this one, down here,
10:10
that's a point where j of theta is pretty low, and so this corresponds to a setting of the parameters. There's a setting of the parameters theta, where, you know, for most of the training examples, the output
10:24
of my hypothesis, that may be pretty close to y(i) and if this is true than that's what causes my cost function to be pretty low.
10:32
Whereas in contrast, if you were to take a value like that, a point like that corresponds to, where for many training examples, the output of my neural network is far from the actual value y(i) that was observed in the training set. So points like this on the line correspond to where the hypothesis, where the neural network is outputting values on the training set that are far from y(i). So, it's not fitting the training set well, whereas points like this with low values of the cost function corresponds to where j of theta is low, and therefore corresponds to where the neural network happens to be fitting my training set well, because I mean this is what's needed to be true in order for j of theta to be small.
11:15
So what gradient descent does is we'll start from some random initial point like that one over there, and it will repeatedly go downhill.
11:24
And so what back propagation is doing is computing the direction of the gradient, and what gradient descent is doing is it's taking little steps downhill until hopefully it gets to, in this case, a pretty good local optimum.
11:37
So, when you implement back propagation and use gradient descent or one of the advanced optimization methods, this picture sort of explains what the algorithm is doing. It's trying to find a value of the parameters where the output values in the neural network closely matches the values of the y(i)'s observed in your training set. So, hopefully this gives you a better sense of how the many different pieces of neural network learning fit together.
12:07
In case even after this video, in case you still feel like there are, like, a lot of different pieces and it's not entirely clear what some of them do or how all of these pieces come together, that's actually okay.
12:18
Neural network learning and back propagation is a complicated algorithm.
12:23
And even though I've seen the math behind back propagation for many years and I've used back propagation, I think very successfully, for many years, even today I still feel like I don't always have a great grasp of exactly what back propagation is doing sometimes. And what the optimization process looks like of minimizing j if theta. Much this is a much harder algorithm to feel like I have a much less good handle on exactly what this is doing compared to say, linear regression or logistic regression.
12:51
Which were mathematically and conceptually much simpler and much cleaner algorithms.
12:56
But so in case if you feel the same way, you know, that's actually perfectly okay, but if you do implement back propagation, hopefully what you find is that this is one of the most powerful learning algorithms and if you implement this algorithm, implement back propagation, implement one of these optimization methods, you find that back propagation will be able to fit very complex, powerful, non-linear functions to your data, and this is one of the most effective learning algorithms we have today.

Reading: Putting It Together

Putting it Together
First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.
- Number of input units = dimension of features (x^{(i)})
- Number of output units = number of classes
- Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
- Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.
Training a Neural Network

1. Randomly initialize the weights
2. Implement forward propagation to get (h_Theta(x^{(i)})) for any (x^{(i)})
3. Implement the cost function
4. Implement backpropagation to compute partial derivatives
5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

When we perform forward and back propagation, we loop on every training example:
```
for i = 1:m,
   Perform forward propagation and backpropagation using example (x(i),y(i))
   (Get activations a(l) and delta terms d(l) for l = 2,...,L
```
The following image gives us an intuition of what is happening as we are implementing our neural network:

Ideally, you want (h_Theta(x^{(i)}) approx y^{(i)}). This will minimize our cost function. However, keep in mind that (J(Theta)) is not convex and thus we can end up in a local minimum instead.

Video: Autonomous Driving

In this video, I'd like to show you a fun and historically important example of neural networks learning of using a neural network for autonomous driving. That is getting a car to learn to drive itself.
0:14
The video that I'll showed a minute was something that I'd gotten from Dean Pomerleau, who was a colleague who works out in Carnegie Mellon University out on the east coast of the United States. And in part of the video you see visualizations like this. And I want to tell what a visualization looks like before starting the video.
0:32
Down here on the lower left is the view seen by the car of what's in front of it. And so here you kinda see a road that's maybe going a bit to the left, and then going a little bit to the right.
0:44
And up here on top, this first horizontal bar shows the direction selected by the human driver. And this location of this bright white band that shows the steering direction selected by the human driver where you know here far to the left corresponds to steering hard left, here corresponds to steering hard to the right. And so this location which is a little bit to the left, a little bit left of center means that the human driver at this point was steering slightly to the left. And this second bot here corresponds to the steering direction selected by the learning algorithm and again the location of this sort of white band means that the neural network was here selecting a steering direction that's slightly to the left. And in fact before the neural network starts leaning initially, you see that the network outputs a grey band, like a grey, like a uniform grey band throughout this region and sort of a uniform gray fuzz corresponds to the neural network having been randomly initialized. And initially having no idea how to drive the car. Or initially having no idea of what direction to steer in. And is only after it has learned for a while, that will then start to output like a solid white band in just a small part of the region corresponding to choosing a particular steering direction. And that corresponds to when the neural network becomes more confident in selecting a band in one particular location, rather than outputting a sort of light gray fuzz, but instead outputting a white band that's more constantly selecting one's steering direction. >> ALVINN is a system of artificial neural networks that learns to steer by watching a person drive. ALVINN is designed to control the NAVLAB 2, a modified Army Humvee who had put sensors, computers, and actuators for autonomous navigation experiments.
2:40
The initial step in configuring ALVINN is creating a network just here.
2:46
During training, a person drives the vehicle while ALVINN watches.
2:55
Once every two seconds, ALVINN digitizes a video image of the road ahead, and records the person's steering direction.
3:11
This training image is reduced in resolution to 30 by 32 pixels and provided as input to ALVINN's three layered network. Using the back propagation learning algorithm,ALVINN is training to output the same steering direction as the human driver for that image.
3:33
Initially the network steering response is random.
3:43
After about two minutes of training the network learns to accurately imitate the steering reactions of the human driver.
4:02
This same training procedure is repeated for other road types.
4:09
After the networks have been trained the operator pushes the run switch and ALVINN begins driving.
4:20
Twelve times per second, ALVINN digitizes the image and feeds it to its neural networks.
4:33
Each network, running in parallel, produces a steering direction, and a measure of its' confidence in its' response.
4:48
The steering direction, from the most confident network, in this network training for the one lane road, is used to control the vehicle.
5:07
Suddenly an intersection appears ahead of the vehicle.
5:22
As the vehicle approaches the intersection the confidence of the lone lane network decreases.
5:37
As it crosses the intersection and the two lane road ahead comes into view, the confidence of the two lane network rises.
5:51
When its' confidence rises the two lane network is selected to steer. Safely guiding the vehicle into its lane onto the two lane road.
6:05
So that was autonomous driving using the neural network. Of course there are more recently more modern attempts to do autonomous driving. There are few projects in the US and Europe and so on, that are giving more robust driving controllers than this, but I think it's still pretty remarkable and pretty amazing how instant neural network trained with backpropagation can actually learn to drive a car somewhat well.

Reading: Lecture Slides

Lecture9.pdf

Programming: Neural Network Learning

Download the programming assignment here. This ZIP file contains the instructions in a PDF and the starter code. You may use either MATLAB or Octave (>= 3.8.0).
查看全文

相关阅读:
flex布局
 input框不能输入问题
 JS的innerHTML完成注册表
 CSS的z-index属性和box-shadow属性
 JS个人笔记
 css照片墙
 透明度设置
 a标签的name属性
 iframe标签
 title属性

原文地址：https://www.cnblogs.com/kershaw/p/10070285.html

Linear Regression with One Variable

Video: Model Representation

Reading: Model Representation

Video: Cost Function

Reading: Cost Function

Video: Cost Function - Intuition I

Reading: Cost Function - Intuition I

Video: Cost Function - Intuition II

Reading: Cost Function - Intuition II

Video: Gradient Descent

Reading: Gradient Descent

Video: Gradient Descent Intuition

Reading: Gradient Descent Intuition

Video: Gradient Descent For Linear Regression

Reading: Gradient Descent For Linear Regression

Reading: Lecture Slides

Linear Regression with Multiple Variables(Multivariate)

Reading: Setting Up Your Programming Assignment Environment

Reading: Accessing MATLAB Online and Uploading the Exercise Files

Reading: Installing Octave on Windows

Reading: Installing Octave on Mac OS X (10.10 Yosemite and 10.9 Mavericks and Later)

Reading: Installing Octave on Mac OS X (10.8 Mountain Lion and Earlier)

Reading: Installing Octave on GNU/Linux

Reading: More Octave/MATLAB resources

Video: Multiple Features

Reading: Multiple Features

Video: Gradient Descent for Multiple Variables

Reading: Gradient Descent For Multiple Variables

Video: Gradient Descent in Practice I - Feature Scaling

Reading: Gradient Descent in Practice I - Feature Scaling

Video: Gradient Descent in Practice II - Learning Rate

Reading: Gradient Descent in Practice II - Learning Rate

Video: Features and Polynomial Regression

Reading: Features and Polynomial Regression

Video: Normal Equation

Reading: Normal Equation

Video: Normal Equation Noninvertibility

Reading: Normal Equation Noninvertibility

Video: Working on and Submitting Programming Assignments

Reading: Programming tips from Mentors

Reading: Lecture Slides

Logistic Regression

Video: Classification

Reading: Classification

Video: Hypothesis Representation

Reading: Hypothesis Representation

Video: Decision Boundary

Reading: Decision Boundary

Video: Cost Function

Reading: Cost Function

Video: Simplified Cost Function and Gradient Descent

Reading: Simplified Cost Function and Gradient Descent

Video: Advanced Optimization

Reading: Advanced Optimization

Video: Multiclass Classification: One-vs-all

Reading: Multiclass Classification: One-vs-all

Reading: Lecture Slides

Regularization(Solving the Problem of Overfitting)

Video: The Problem of Overfitting

Reading: The Problem of Overfitting

Video: Cost Function

Reading: Cost Function

Video: Regularized Linear Regression

Reading: Regularized Linear Regression

Video: Regularized Logistic Regression

Reading: Regularized Logistic Regression

Reading: Lecture Slides

Programming: Logistic Regression

Neural Networks: Representation

Video: Non-linear Hypotheses

Video: Neurons and the Brain

Video: Model Representation I

Reading: Model Representation I

Video: Model Representation II

Reading: Model Representation II

Video: Examples and Intuitions I

Reading: Examples and Intuitions I

Video: Examples and Intuitions II

Reading: Examples and Intuitions II