Week 1:(2)Linear Regression with One Variable

Catalog:

1.Linear Regression Model Part

2.Cost Function

2.1 Cost function formula

2.2 Cost function intuition

2.3 Visualizing the cost function

2.4 Visualization examples

3.Gradient Descent

3.1 Gradient descent

3.2 Implementing gradient descent

3.3 Gradient descent intuition

4.Learning rate

5.Gradient descent for linear regression

6.Running gradient descent

7.Summary


1.Linear Regression Model Part

->We'll look at what the overall process of supervised learning is like. Specifically, you see the first model of this course, Linear Regression Model. That just means fitting a straight line to your data. It's probably the most widely used learning algorithm in the world today. As you get familiar with linear regression, many of the concepts you see here will also apply to other machine learning models that you'll see later in this specialization.

【Eg one — Linear Regression:visualizing this data as a plot】

8d99d030812348329535cabb8906b55a.png

->Let's start with a problem that you can address using linear regression. Say you want to predict the price of a house based on the size of the house. This is the example we've seen earlier this week. We're going to use a data set on house sizes and prices from Portland, a city in the United States. Here we have a graph where the horizontal axis is the size of the house in square feet, and the vertical axis is the price of a house in thousands of dollars. Let's go ahead and plot the data points for various houses in the dataset. Here each data point, each of these little crosses is a house with the size and the price that it most recently was sold for. Now, let's say you're a real estate agent in Portland and you're helping a client to sell her house. She is asking you, how much do you think I can get for this house? This dataset might help you estimate the price she could get for it. You start by measuring the size of the house, and it turns out that the house is 1250 square feet. How much do you think this house could sell for? One thing you could do this, you can build a linear regression model from this dataset. Your model will fit a straight line to the data, which might look like this. Based on this straight line fit to the data, you can see that the house is 1250 square feet, it will intersect the best fit line over here, and if you trace that to the vertical axis on the left, you can see the price is maybe around here, say about $220,000. This is an example of what's called a supervised learning model. We call this supervised learning because you are first training a model by giving a data that has right answers because you get the model examples of houses with both the size of the house, as well as the price that the model should predict for each house. Well, here are the prices, that is, the right answers are given for every house in the dataset. This linear regression model is a particular type of supervised learning model. It's called regression model because it predicts numbers as the output like prices in dollars. Any supervised learning model that predicts a number such as 220,000 or 1.5 or negative 33.2 is addressing what's called a regression problem. Linear regression is one example of a regression model. But there are other models for addressing regression problems too. We'll see some of those later in Course 2 of this specialization. Just to remind you, in contrast with the regression model, the other most common type of supervised learning model is called a classification model. Classification model predicts categories or discrete categories, such as predicting if a picture is of a cator a dog or if given medical record, it has to predict if a patient has a particular disease. You'll see more about classification models later in this course as well. As a reminder about the difference between classification and regression, in classification, there are only a small number of possible outputs. If your model is recognizing cats versus dogs, that's two possible outputs. Or maybe you're trying to recognize any of 10 possible medical conditions in a patient, so there's a discrete, finite set of possible outputs. We call it classification problem, whereas in regression, there are infinitely many possible numbers that the model could output.

【Eg one — Linear Regression:visualizing this data as a data table 】
391a168648814c08bae2b650a72cfadf.png

->In addition to visualizing this data as a plot here on the left, there's one other way of looking at the data that would be useful, and that's a data table here on the right. The data comprises a set of inputs. This would be the size of the house, which is this column here. It also has outputs. You're trying to predict the price, which is this column here. Notice that the horizontal and vertical axes correspond to these two columns, the size and the price. If you have, say, 47 rows in this data table, then there are 47 of these little crosses on the plot of the left, each cross corresponding to one row of the table. For example, the first row of the table is a house with size, 2,104 square feet, so that's around here, and this house is sold for $400,000 which is around here. This first row of the table is plotted as this data point over here.

【notation for describing the data】
c99e41a289844a4ab301a0991110eeb9.png
->Now, let's look at some notation for describing the data. This is notation that you find useful throughout your journey in machine learning. As you increasingly get familiar with machine learning terminology, this would be terminology they can use to talk about machine learning concepts with others as well since a lot of this is quite standard across AI, you'll be seeing this notation multiple times in this specialization, so it's okay if you don't remember everything for assign through, it will naturally become more familiar overtime.
->The dataset that you just saw and that is used to train the model is called a training set. Note that your client's house is not in this dataset because it's not yet sold, so no one knows what the price is. To predict the price of your client's house, you first train your model to learn from the training set and that model can then predict your client's houses price. In Machine Learning, the standard notation to denote the input here is lowercase x, and we call this the input variable, is also called a feature or an input feature. For example, for the first house in your training set, x is the size of the house, so x equals 2,104. The standard notation to denote the output variable which you're trying to predict, which is also sometimes called the target variable, is lowercase y. Here, y is the price of the house, and for the first training example, this is equal to 400, so y equals 400. The dataset has one row for each house and in this training set, there are 47 rows with each row representing a different training example. We're going to use lowercase m to refer it to the total number of training examples, and so here m is equal to 47. To indicate the single training example, we're going to use the notation parentheses x, y. For the first training example, x, y, this pair of numbers is (2104, 400). Now we have a lot of different training examples. We have 47 of them in fact. To refer to a specific training example, this will correspond to a specific row in this table on  the left, I'm going to use the notation x superscript in parenthesis i, y superscript in parentheses i. The superscript tells us that this is the i-th training example, such as the first, second, or third up to the 47th training example. I here, refers to a specific row in the table. For instance, here is the first example, when i equals 1 in the training set, and so x superscript 1 is equal to 2104 and y superscript 1 is equal to 400 and let's add this superscript 1 here as well. Just to note, this superscript i in parentheses is not exponentiation. When I write this, this is not x squared. This is not x to the power 2. It just refers to the second training example. This i, is just an index into the training set and refers to row i in the table.  let's look at what rotate to take this training set that you just saw and feed it to learning algorithm so that the algorithm can learn from this data. 

【process of how supervised learning works】
dffe09f9c5db41f5a288c47a594c104d.png
->Let's look at the process of how supervised learning works. Supervised learning algorithm will input a dataset and then what exactly does it do and what does it output? Recall that a training set in supervised learning includes both the input features, such as the size of the house and also the output targets, such as the price of the house. The output targets are the right answers to the model we'll learn from. To train the model,you feed the training set, both the input features and the output targets to your learning algorithm. Then your supervised learning algorithm will produce some function. We'll write this function as lowercase f, where f stands for function. Historically, this function used to be called a hypothesis, but I'm just going to call it a function f in this class. The job with f is to take a new input x and output and estimate or a prediction, which I'm going to call y-hat, and it's written like the variable y with this little hat symbol on top. In machine learning,the convention is that y-hat is the estimate or the prediction for y. The function f is called the model. X is called the input or the input feature, and the output of the model is the prediction, y-hat. The model's prediction is the estimated value of y. When the symbol is just the letter y, then that refers to the target, which is the actual true value in the training set. In contrast, y-hat is an estimate. It may or may not be the actual true value. Well, if you're helping your client to sell the house, well, the true price of the house is unknown until they sell it. -Your model f, given the size, outputs the price which is the estimator, that is the prediction of what the true price will be. Now, when we design a learning algorithm, a key question is, how are we going to represent the function f? Or in other words, what is the math formula we're going to use to compute f? For now, let's stick with f being a straight line. You're function can be written as f_w, b of x equals, I'm going to use w times x plus b. I'll define w and b soon. But for now, just know that w and b are numbers, and the values chosen for w and b will determine the prediction y-hat based on the input feature x. This f_w b of x means f is a function that takes x as input, and depending on the values of w and b, f will output some value of a prediction y-hat. As an alternative to writing this, f_w, b of x, I'll sometimes just write f of x without explicitly including w and b into subscript. Is just a simpler notation that means exactly the same thing as f_w b of x. Let's plot the training set on the graph where the input feature x is on the horizontal axis and the output target y is on the vertical axis. Remember, the algorithm learns from this data and generates the best-fit line like maybe this one here. This straight line is the linear function f_w b of x equals w times x plus b. Or more simply, we can drop w and b and just write f of x equals wx plus b. Here's what this function is doing, it's making predictions for the value of y using a streamline function of x. You may ask, why are we choosing a linear function, where linear function is just a fancy term for a straight line instead of some non-linear function like a curve or a parabola? Well, sometimes you want to fit more complex non-linear functions as well, like a curve like this. But since this linear function is relatively simple and easy to work with, let's use a line as a foundation that will eventually help you to get to more complex models that are non-linear. This particular model has a name, it's called linear regression. More specifically, this is linear regression with one variable, where the phrase one variable means that there's a single input variable or feature x, namely the size of the house. Another name for a linear model with one input variable is univariate linear regression, where uni means one in Latin, and where variate means variable. Univariate is just a fancy way of saying one variable.

【About optional lab】
->In later videos, you'll also see a variation of regression where you'll want to make a prediction based not just on the size of a house, but on a bunch of other things that you may know about the house such as number of bedrooms and other features.By the way,There is another optional lab. You don't need to write any code. Just review it, run the code and see what it does. That will show you how to define in Python a straight line function. The lab will let you choose the values of w and b to try to fit the training data.That's linear regression. In order for you to make this work, one of the most important things you have to do is construct a cost function. The idea of a cost function is one of the most universal and important ideas in machine learning, and is used in both linear regression and in training many of the most advanced AI models in the world. 

2.Cost Function

2.1 Cost function formula

eff55defa62940f38883bab78e7b5485.png

->In order to implement linear regression the first key step is first to define something called a cost function. This is something we'll build in this video, and the cost function will tell us how well the model is doing so that we can try to get it to do better. Let's look at what this means.Recall that you have a training set that contains input features x and output targets y. The model you're going to use to fit this training set is this linear function f_w, b of x equals tow times x plus b. To introduce a little bit more terminology the w and b are called the parameters of the model. In machine learning parameters of the model are the variables you can adjust during training in order to improve the model. Sometimes you also hear the parameters w and b referred to as coefficients or as weights.

b2b271fdde4b410ab7c7dcfb34dbcbff.png

->Now let's take a look at what these parameters w and b do. Depending on the values you've chosen for w and b you get a different function f of x, which generates a different line on the graph. Remember that we can write f of x as a shorthand for f_w, b of x. We're going to take a look at some plots of f of x on a chart. Maybe you're already familiar with drawing lines on charts, but even if this is a review for you, I hope this will help you build intuition on how w and b the parameters determine f. When w is equal to 0 and b is equal to 1.5, then f looks like this horizontal line. In this case, the function f of x is 0 times x plus 1.5 so f is always a constant value. It always predicts 1.5 for the estimated value of y.Y hat is always equal to b and here b is also called the y intercept because that's where it crosses the vertical axis or the y axis on this graph. As a second example, if w is 0.5 and b is equal 0, then f of x is 0.5 times x. When x is 0, the prediction is also 0, and when x is 2, then the prediction is0.5 times 2, which is 1. You get a line that looks like this and notice that the slope is 0.5 divided by 1. The value of w gives you the slope of the line, which is 0.5. Finally, if w equals0.5 and b equals 1, then f of x is 0.5 times x plus 1 and when x is 0, then f of x equals b, which is 1 so the line intersects the vertical axis at b, the y intercept. Also when x is 2, then f of x is 2, so the line looks like this. Again, this slope is 0.5 divided by 1 so the value of w gives you the slope which is 0.5. Recall that you have a training set like the one shown here. With linear regression, what you want to do is to choose values for the parameters w and b so that the straight line you get from the function f somehow fits the data well. Like maybe this line shown here. When I see that the line fits the data visually, you can think of this to mean that the line defined by f is roughly passing through or somewhere close  to the training examples as compared to other possible lines that are not as close to these points. Just to remind you of some notation, a training example like this point here is defined by x superscript i, y superscript i where y is the target. For a given input x^i, the function f also makes a predictive value for y and a value that it predicts to y is y hat i shown here. For our choice of a model f of x^i is w times x^i plus b. Stated differently,the prediction y hat i is f of wb of x^i where for the model 
we're using f of x^i is equal to wx^i plus b.

979da3b563e141938bf1b3a02a73e639.png

->Now the question is how do you find values for w and b so that the prediction y hat i is close to the true target y^i for many or maybe all training examples x^i, y^i.To answer that question, let's first take a look at how to measure how well a line fits the training data. To do that, we're going to construct a cost function. The cost function takes the prediction y hat and compares it to the target y by taking y hat minus y. This difference is called the error, we're measuring how far off to prediction is from the target. Next, let's computes the square of this error. Also, we're going to want to compute this term for different training examples i in the training set. When measuring the error, for example i, we'll compute this squared error term. Finally, we want to measure the error across the entire training set. In particular, let's sum up the squared errors like this. We'll sum from i equals 1,2, 3 all the way up to m and remember that m is the number of training examples, which is 47 for this dataset. Notice that if we have more training examples m is larger and your cost function will calculate a bigger number. This is summing over more examples. To build a cost function that doesn't automatically get bigger as the training set size gets larger by convention, we will compute the average squared error instead of the total squared error and we do that by dividing by m like this. We're nearly there.Just one last thing. By convention, the cost function that machine learning people use actually divides by 2 times m. The extra division by 2 is just meant to make some of our later calculations look neater, but the cost function still works whether you include this division by 2 or not. This expression right here is the cost function and we're going to write J of wb to refer to the cost function. This is also called the squared error cost function, and it's called this because you're taking the square of these error terms. In machine learning different people will use different cost functions for different applications, but the squared error cost function is by far the most commonly used one for linear regression and for that matter, for all regression problems where it seems to give good results for many applications. Just as a reminder,the prediction y hat is equal to the outputs of the model f at x. We can rewrite the cost function J of wb as 1 over 2m times the sum from i equals 1 to m of f of x^i minus y^i the quantity squared. Eventually we're going to want to find values of w and b that make the cost function small. But before going there, let's first gain more intuition about what J of wb is really computing.

->At this point you might be thinking we've done a whole lot of math to define the cost function. But what exactly is it doing? Let's go on to the next video where we'll step through one example of what the cost function is really computing that I hope will help you build intuition about what it means if J of wb is large versus if the cost j is small. 

2.2 Cost function intuition

8a9048ac59914aa389128be09875a2ac.png

->We're seeing the mathematical definition of the cost function. Now, let's build some intuition about what the cost function is really doing. We'll walk through one example to see how the cost function can be used to find the best parameters for your model.

->Here's what we've seen about the cost function so far. You want to fit a straight line to the training data, so you have this model, fw, b of x is w times x, plus b. Here, the model's parameters are w, and b. Now, depending on the values chosen for these parameters, you get different straight lines like this. You want to find values for w, and b, so that the straight line fits the training data well. To measure how well a choice of w, and b fits the training data, you have a cost function J. What the cost function J does is, it measures the difference between the model's predictions, and the actual true values for y. What you see later, is that linear regression would try to find values for w, and b, then make a J of w be as small as possible. In math, we write it like this. We want to minimize, J as a function of w, and b. Now, in order for us to better visualize the cost function J, this work of a simplified version of the linear regression model. We're going to use the model fw of x, is w times x. You can think of this as taking the original model on the left, and getting rid of the parameter b, or setting the parameter b equal to 0. It just goes away from the equation, so f is now just w times x. You now have just one parameter w, and your cost function J, looks similar to what it was before. Taking the difference, and squaring it, except now, f is equal to w times xi, and J is now a function of just w. The goal becomes a little bit different as well, because you have just one parameter, w, not w and b. With this simplified model, the goal is to find the value for w, that minimizes J of w.To see this visually, what this means is that if b is set to 0, then f defines a line that looks like this. You see that the line passes through the origin here, because when x is 0, f of x is 0 too.
760738c384624c98891e0d6c2d8a0b5c.png

->Now, using this simplified model, let's see how the cost function changes as you choose different values for the parameter w. In particular, let's look at graphs of the model f of x, and the cost function J. I'm going to plot these side-by-side, and you'll be able to see how the two are related. First, notice that for f subscript w, when the parameter w is fixed, that is, is always a constant value, then fw is only a function of x, which means that the estimated value of y depends on the value of the input x. In contrast, looking to the right, the cost function J, is a function of w, where w controls the slope of the line defined by f w. The cost defined by J, depends on a parameter, in this case, the parameter w. Let's go ahead, and plot these functions, fw of x, and J of w side-by-side so you can see how they are related. We'll start with the model, that is the function fw of x on the left. Here are the input feature xis on the horizontal axis, and the output value yis on the vertical axis. Here's the plots of three points representing the training set at positions 1, 1, 2, 2, and 3,3. Let's pick a value for w. Say w is 1. For this choice of w, the function fw, they'll say this straight line with a slope of 1. Now, what you can do next is calculate the cost J when w equals 1. You may recall that the cost function is defined as follows, is the squared error cost function. If you substitute fw(X^i)with w times X^i, the cost function looks like this. Where this expression is now w times X^i minus Y^i. For this value of w, it turns out that the error term inside the cost function, this w times X^i minus Y^i is equal to 0 for each of the three data points. Because for this data-set, when x is 1, then y is 1. When w is also 1, then f(x) equals 1, so f(x) equals y for this first training example, and the difference is 0. Plugging this into the cost function J, you get 0 squared. Similarly, when x is 2, then y is 2, and f(x) is also 2. Again, f(x) equals y, for the second training example. In the cost function, the squared error for the second example is also 0 squared. Finally, when x is 3, then y is 3 and f(3) is also 3. In a cost function the third squared error term is also 0 squared. For all three examples in this training set, f(X^i) equals Y^i for each training example i, so f(X^i) minus Y^i is 0. For this particular data-set, when w is 1, then the cost J is equal to 0. Now, what you can do on the right is plot the cost function J. Notice that because the cost function is a function of the parameter w, the horizontal axis is now labeled w and not x, and the vertical axis is now J and not y. You have J(1) equals to 0. In other words, when w equals 1, J(w) is 0, so let me go ahead and plot that.
21a99dbb47124d5f8edcae1bf7dc8aac.png

->Now, let's look at how F and J change for different values of w. W can take on a range of values, so w can take on negative values, w can be 0, and it can take on positive values too. What if w is equal to 0.5 instead of 1, what would these graphs look like then? Let's go ahead and plot that. Let's set w to be equal to 0.5, and in this case, the function f(x)now looks like this, is a line with a slope equal to 0.5. Let's also compute the cost J, when w is 0.5. Recall that the cost function is measuring the squared error or difference between the estimator value, that is y hat I, which is F(X^i), and the true value, that is Y^i for each example i. Visually you can see that the error or difference is equal to the height of this vertical line here when x is equal to 1. Because this lower line is the gap between the actual value of y and the value that the function f predicted, which is a bit further down here. For this first example, when x is 1, f(x) is 0.5. The squared error on the first example is 0.5 minus 1 squared. Remember the cost function, we'll sum over all the training examples in the training set. Let's go on to the second training example. When x is 2, the model is predicting f(x) is 1 and the actual value of y is 2. The error for the second example is equal to the height of this little line segment here, and the squared error is the square of the length of this line segment, so you get 1 minus 2 squared. Let's do the third example. Repeating this process,the error here, also shown by this line segment, is 1.5 minus 3 squared. Next, we sum up all of these terms, which turns out to be equal to 3.5. Then we multiply this term by 1 over 2m, where m is the number of training examples. Since there are three training examples m equals 3, so this is equal to1 over 2 times 3, where this m here is 3. If we work out the math, this turns out to be3.5 divided by 6. The cost J is about 0.58. Let's go ahead and plot that over there on the right.
ce38cc133883425f8970f32a40bf489c.png

->Now, let's try one more value for w. How about if w equals 0? What do the graphs for f and J look like when w is equal to 0? It turns out that if w is equal to 0, then f of x is just this horizontal line that is exactly on the x-axis. The error for each example is a line that goes from each point down to the horizontal line that represents f of x equals 0. The cost J when w equals 0 is 1 over 2mtimes the quantity, 1^2 plus 2^2 plus 3^2, and that's equal to1 over 6 times 14, which is about 2.33. Let's plot this point where w is 0 and J of 0 is 2.33 over here. You can keep doing this for other values of w. Since w can be any number, it can also be a negative value. If w is negative 0.5, then the line f is a downward-sloping line like this. It turns out that when w is negative 0.5 then you end up with an even higher cost, around 5.25, which is this point up here. You can continue computing the cost function for different values of w and so on and plot these. It turns out that by computing a range of values, you can slowly trace out what the cost function J looks like and that's what J is. To recap, each value of parameter w corresponds to different straight line fit, f of x, on the graph to the left. For the given training set, that choice for a value of w corresponds to a single point on the graph on the right because for each value of w, you can calculate the cost J of w. For example, when w equals 1, this corresponds to this straight line fit through the data and it also corresponds to this point on the graph of J, where w equals 1 and the cost J of 1 equals 0. Whereas when w equals 0.5, this gives you this line which has a smaller slope. This line in combination with the training set corresponds to this point on the cost function graph at w equals 0.5. For each value of w you wind up with a different line and its corresponding costs, J of w, and you can use these points to trace out this plot on the right.
212648030e954ec9931303daa2d39423.png->Given this, how can you choose the value of w that results in the function f, fitting the data well? Well, as you can imagine, choosing a value of w that causes J of w to be as small as possible seems like a good bet. J is the cost function that measures how big the squared errors are, so choosing w that minimizes these squared errors, makes them as small as possible, will give us a good model. In this example, if you were to choose the value of w that results in the smallest possible value of J of w you'd end up picking w equals 1. As you can see, that's actually a pretty good choice. This results in the line that fits the training data very well. That's how in linear regression you use the cost function to find the value of w that minimizes J. In the more general case where we had parameters w and b rather than just w, you find the values of wand b that minimize J. To summarize, you saw plots of both f and J and worked through how the two are related. As you vary w or vary w and b you end up with different straight lines and when that straight line passes across the data, the cause J is small. The goal of linear regression is to find the parameters w or w and b that results in the smallest possible value for the cost function J. Now in this video, we worked through our example with a simplified problem using only w. In the next video, let's visualize what the cost function looks like for the full version of linear regression using both w and b. 

2.3 Visualizing the cost function

d45da2e9d26c49899d77ed02aa02d388.png->In the last video,you saw one visualization of the cost function J of w or J of w,b.Let's look at some further richer visualizations so that you can get an even better intuition about what the cost function is doing.Here is what we've seen so far.There's the model,the model's parameters w and b,the cost function J of w and b,as well as the goal of linear regression,which is to minimize the cost function J of w and b over parameters w and b.
0c0a4e20917d4f7c82fe147413615406.png->In the last video,we had temporarily set b to zero in order to simplify the visualizations.Now,let's go back to the original model with both parameters w and b without setting b to be equal to 0.Same as last time,we want to get a visual understanding of the model function,f of x,shown here on the left,and how it relates to the cost function J of w,b,shown here on the right.Here's a training set of house sizes and prices.Let's say you pick one possible function of x,like this one.Here,I've set w to 0.06 and b to 50.f of x is 0.06times x plus 50.Note that this is not a particularly good model for this training set,is actually a pretty bad model.It seems to consistently underestimate housing prices.Given these values for wand b let's look at what the cost function J of w and b may look like.Recall what we saw last time was when you had only w,because we temporarily set b to zero to simplify things,but then we had come up with a plot of the cost function that look like this as a function of w only.When we had only one parameter,w,the cost function had this U-shaped curve,shaped a bit like a soup bowl.Now,in this housing price example that we have on this slide,we have two parameters,w and b.The plots becomes a little more complex.
53e83c2478c942cb835892d539955ce3.png

->It turns out that the cost function also has a similar shape like a soup bowl,except in three dimensions instead of two.In fact,depending on your training set,the cost function will look something like this.What you see here is a 3D-surface plot where the axes are labeled w and b.As you vary w and b,which are the two parameters of the model,you get different values for the cost function J of w,and b.This is a lot like the U-shaped curve you saw in the last video,except instead of having one parameter w as input for the j,you now have two parameters,w and b as inputs into this soup bowl or this hammock-shaped function J.I just want to point out that any single point on this surface represents some particular choice of w and b.For example,if w was minus10 and b was minus 15,then the height of the surface above this point is the value of j when w is minus 10and b is minus 15.Now,in order to look even more closely at specific points,there's another way of plotting the cost function J that would be useful for visualization,which is,rather than using these 3D-surface plots,I like to take this exact same function J.I'm not changing the function J at all and plot it using something called a contour plot.
b1fc6aa6cb2648e5884988346381b290.png
->If you've ever seen a topographical map showing how high different mountains are,the contours in a topographical map are basically horizontal slices of the landscape of say,a mountain.This image is of Mount Fuji in Japan.I still remember my family visiting Mount Fuji when I was a teenager.It's beautiful sights.If you fly directly above the mountain,that's what this contour map looks like.It shows all the points,they're at the same height for different heights.At the bottom of this slide is a 3D-surface plot of the cost function J.I know it doesn't look very bowl-shaped,but it is actually a bowl just very stretched out,which is why it looks like that.In an optional lab,that is shortly to follow,you will be able to see this in 3D and spin around the surface yourself and it'll look more obviously bowl-shaped there.Next,here on the upper right is a contour plot of this exact same cost function as that shown at the bottom.The two axes on this contour plots are b,on the vertical axis,and w on the horizontal axis.What each of these ovals,also called ellipses,shows,is the center points on the 3D surface which are at the exact same height.In other words,the set of points which have the same value for the cost function J.To get the contour plots,you take the 3D surface at the bottom and you use a knife to slice it horizontally.You take horizontal slices of that 3D surface and get all the points,they're at the same height.Therefore,each horizontal slice ends up being shown as one of these ellipses or one of these ovals.Concretely,if you take that point,and that point,and that point,all of these three points have the same value for the cost function J,even though they have different values for w and b.In the figure on the upper left,you see also that these three points correspond to different functions,f,all three of which are actually pretty bad for predicting housing prices in this case.Now,the bottom of the bowl,where the cost function J is at a minimum,is this point right here,at the center of this concentric ovals.If you haven't seen contour plots much before,I'd like you to imagine,if you will,that you are flying high up above the bowl in an airplane or in a rocket ship,and you're looking straight down at it.That is as if you set your computer monitor flat on your desk facing up and the bowl shape is coming directly out of your screen,rising above you desk.Imagine that the bowl shape grows out of your computer screen lying flat like that,so that each of these ovals have the same height above your screen and the minimum of the bowl is right down there in the center of the smallest oval.It turns out that the contour plots are a convenient way to visualize the 3D cost function J,but in a way,there's plotted in just 2D.In this video,you saw how the 3D bowl-shaped surface plot can also be visualized as a contour plot.Using this visualization too,in the next video,let's visualize some specific choices of w and b in the linear regression model so that you can see how these different choices affect the straight line you're fitting to the data.

2.4 Visualization examples

【example 1】
b9b1b13b8c0c44759e4abaed5e5430d2.png
->Let's look at some more visualizations of w and b. Here's one example. Over here, you have a particular point on the graph j. For this point, w equals about negative 0.15 and b equals about 800. This point corresponds to one pair of values for w and b that use a particular cost j. In fact, this booklet pair of values for w and b corresponds to this function f of x, which is this line you can see on the left. This line intersects the vertical axis at 800 because b equals 800 and the slope of the line is negative 0.15, because w equals negative 0.15. Now, if you look at the data points in the training set, you may notice that this line is not a good fit to the data. For this function f of x, with these values of w and b, many of the predictions for the value of y are quite far from the actual target value of y that is in the training data. Because this line is not a good fit, if you look at the graph of j, the cost of this line is out here, which is pretty far from the minimum. There's a pretty high cost because this choice of w and b is just not that good a fit to the training set.

【example 2】
f3e7a07960794251a49120e450299a54.png
->Now, let's look at another example with a different choice of w and b. Now, here's another function that is still not a great fit for the data, but maybe slightly less bad. This points here represents the cost for this booklet pair of w and b that creates that line. The value of w is equal to 0 and the value b is about 360. This pair of parameters corresponds to this function, which is a flat line, because f of x equals0 times x plus 360. I hope that makes sense.

【example 3】
9a4fb965fdf14d929241e8839e278967.png

->Let's look at yet another example. Here's one more choice for w and b, and with these values, you end up with this line f of x. Again, not a great fit to the data, is actually further away from the minimum compared to the previous example. Remember that the minimum is at the center of that smallest ellipse.

【example 4】
25e2c597a37643fabc059c248a336206.png
->Last example, if you look at f of x on the left, this looks like a pretty good fit to the training set. You can see on the right, this point representing the cost is very close to the center of the smaller ellipse, it's not quite exactly the minimum, but it's pretty close. For this value of w and b, you get to this line, f of x. You can see that if you measure the vertical distances between the data points and the predicted values on the straight line, you'd get the error for each data point. The sum of squared errors for all of these data points is pretty close to the minimum possible sum of squared errors among all possible straight line fits. I hope that by looking at these figures, you can get a better sense of how different choices of the parameters affect the line f of x and how this corresponds to different values for the cost j, and hopefully you can see how the better fit lines correspond to points on the graph of j that are closer to the minimum possible cost for this cost function j of w and b.

【About optional lab】
->In the optional lab that follows this video, you'll get to run some codes and remember all the code is given, so you just need to hit Shift Enter to run it and take a look at it and the lab will show you how the cost function is implemented in code. Given a small training set and different choices for the parameters, you'll be able to seehow the cost varies depending on how well the model fits the data.In the optional lab, you also can play with in interactive console plot. Check this out. You can use your mouse cursor to click anywhere on the contour plot and you will see the straight line defined by the values you chose for the parameters w and b. You'll see a dot up here also on the 3D surface plot showing the cost. Finally, the optional lab also has a 3D surface plot that you can manually rotate and spin around using your mouse cursor to take a better look at what the cost function looks like. I hope you'll enjoy playing with the optional lab. Now in linear regression, rather than having to manually try to read a contour plot for the best value for w and b, which isn't really a good procedure and also won't work once we get to more complex machine learning models. What you really want is an efficient algorithm that you can write in code for automatically finding the values of parameters w and b they give you the best fit line. That minimizes the cost function j. There is an algorithm for doing this cdalled gradient descent. This algorithm is one of the most important algorithms in machine learning. Gradient descent and variations on gradient descent are used to train, not just linear regression, but some of the biggest and most complex models in all of AI.

3.Gradient Descent

3.1 Gradient descent

->In the last video, we saw visualizations of the cost function j and how you can try different choices of the parameters w and b and see what cost value they get you. It would be nice if we had a more systematic way to find the values of w and b, that results in the smallest possible cost, j of w, b. It turns out there's an algorithm called gradient descent that you can use to do that. Gradient descent is used all over the place in machine learning, not just for linear regression, but for training for example some of the most advanced neural network models, also called deep learning models. Deep learning models are something you learned about in the second course. Learning these two of gradient descent will set you up with one of the most important building blocks in machine learning.
7df7b95f7bda4a3080ed87059d74db9d.png
->Here's an overview of what we'll do with gradient descent. You have the cost function j of w, b right here that you want to minimize. In the example we've seen so far, this is a cost function for linear regression, but it turns out that gradient descent is an algorithm that you can use to try to minimize any function, not just a cost function for linear regression. Just to make this discussion on gradient descent more general, it turns out that gradient descent applies to more general functions, including other cost functions that work with models that have more than two parameters. For instance, if you have a cost function J as a function of w_1, w_2 up to w_n and b, your objective is to minimize j over the parameters w_1 to w_n and b. In other words, you want to pick values for w_1 through w_n and b, that gives you the smallest possible value of j. It turns out that gradient descent is an algorithm that you can apply to try to minimize this cost function j as well. What you're going to do is just to start off with some initial guesses for w and b. In linear regression,it won't matter too much what the initial value are, so a common choice is to set them both to 0. For example, you can set w to 0 and b to 0 as the initial guess. With the gradient descent algorithm, what you're going to do is, you'll keep on changing the parameters w and b a bit every time to try to reduce the cost j of w, b until hopefully j settles at or near a minimum. One thing I should note is that for some functions j that may not be a bow shape or a hammock shape, it is possible for there to be more than one possible minimum.
【example——cost function of a neural network model
d588cde6fcbe44c394ddd157b29497b4.png
->Let's take a look at an example of a more complex surface plot j to see what gradient is doing. This function is not a squared error cost function. For linear regression with the squared error cost function, you always end up with a bow shape or a hammock shape. But this is a type of cost function you might get if you're training a neural network model. Notice the axes, that is wand b on the bottom axis. For different values of w and b, you get different points on this surface, j of w, b, where the height of the surface at some point is the value of the cost function.
->Now, let's imagine that this surface plot is actually a view of a slightly hilly outdoor park or a golf course where the high points are hills and the low points are valleys like so. I'd like you to imagine if you will, that you are physically standing at this point on the hill. If it helps you to relax, imagine that there's lots of really nice green grass and butterflies and flowers is a really nice hill. Your goal is to startup here and get to the bottom of one of these valleys as efficiently as possible. What the gradient descent algorithm does is, you're going to spin around 360 degrees and look around and ask yourself, if I were to take a tiny little baby step in one direction, and I want to go downhill as quickly as possible to or one of these valleys. What direction do I choose to take that baby step? Well, if you want to walk down this hill as efficiently as possible, it turns out that if you're standing at this point in the hill and you look around, you will notice that the best direction to take your next step downhill is roughly that direction. Mathematically, this is the direction of steepest descent. It means that when you take a tiny baby little step, this takes you downhill faster than a tiny little baby step you could have taken in any other direction. After taking this first step, you're now at this point on the hill over here. Now let's repeat the process. Standing at this new point, you're going to again spin around 360 degrees and ask yourself, in what direction will I take the next little baby step in order to move downhill? If you do that and take another step, you end up moving a bit in that direction and you can keep going. From this new point, you can again look around and decide what direction would take you downhill most quickly. Take another step,another step, and so on, until you find yourself at the bottom of this valley, at this local minimum, right here. What you just did was go through multiple steps of gradient descent.
->It turns out, gradient descent has an interesting property. Remember that you can choose a starting point at the surface by choosing starting values for the parameters w and b. When you perform gradient descent a moment ago, you had started at this point over here. Now, imagine if you try gradient descent again, but this time you choose a different starting point by choosing parameters that place your starting point just a couple of steps to the right over here. If you then repeat the gradient descent process, which means you look around, take a little step in the direction of steepest ascent so you end up here. Then you again look around, take another step, and so on. If you were to run gradient descent this second time, starting just a couple steps in the right of where we did it the first time, then you end up in a totally different valley. This different minimum over here on the right. The bottoms of both the first and the second valleys are called local minima. Because if you start going down the first valley, gradient descent won't lead you to the second valley, and the same is true if you started going down the second valley, you stay in that second minimum and not find your way into the first local minimum. This is an interesting property of the gradient descent algorithm, and you see more about this later. In this video, you saw how gradient descent helps you go downhill. In the next video, let's look at the mathematical expressions that you can implement to make gradient descent work.

3.2 Implementing gradient descent

d57fb7746e864220863a27680dc867a2.png
->Let's take a look at how you can actually implement the gradient descent algorithm. Let me write down the gradient descent algorithm. Here it is. On each step, w, the parameter, is updated to the old value of w minus Alpha times this term d/dw of the cos function J of wb. What this expression is saying is, after your parameter w by taking the current value of w and adjusting it a small amount, which is this expression on the right, minus Alpha times this term over here.
【equal notation】
->If you feel like there's a lot going on in this equation, it's okay, don't worry about it. We'll unpack it together. First, this equal notation here. Now, since I said we're assigning w a value using this equal sign, so in this context, this equal sign is the assignment operator. Specifically, in this context, if you write code that says a equals c, it means take the value c and store it in your computer, in the variable a. Or if you write a equals a plus 1, it means set the value of a to be equal to a plus 1, or increments the value of a by one. The assignment operator encoding is different than truth assertions in mathematics. Where if I write a equals c, I'm asserting, that is, I'm claiming that the values of a and c are equal to each other. Hopefully, I will never write a truth assertion a equals a plus 1 because that just can't possibly be true. In Python and in other programming languages, truth assertions are sometimes written as equals equals, so you may see oh, that says a equals equals c if you're testing whether a is equal to c. But in math notation, as we conventionally use it, like in these videos, the equal sign can be used for either assignments or for truth assertion. I try to make sure I was clear when I write an equal sign, whether we're assigning a value to a variable, or whether we're asserting the truth of the equality of two values.
【expression】
->Now, this dive more deeply into what the symbols in this equation means. The symbol here is the Greek alphabet Alpha. In this equation, Alpha is also called the learning rate. The learning rate is usually a small positive number between 0 and 1 and it might be say, 0.01. What Alpha does is, it basically controls how big of a step you take downhill. If Alpha is very large, then that corresponds to a very aggressive gradient descent procedure where you're trying to take huge steps downhill. If Alpha is very small, then you'd be taking small baby steps downhill. We'll come back later to dive more deeply into how to choose a good learning rate Alpha. Finally, this term here, that's the derivative term of the cost function J. Let's not worry about the details of this derivative right now. But later on, you'll get to see more about the derivative term. But for now, you can think of this derivative term that I drew a magenta box around as telling you in which direction you want to take your baby step. In combination with the learning rate Alpha, it also determines the size of the steps you want to take downhill.
->Now, I do want to mention that derivatives come from calculus. Even if you aren't familiar with calculus, don't worry about it. Even without knowing any calculus, you'd be able to figure out all you need to know about this derivative term in this video and the next. One more thing. Remember your model has two parameters, not just w, but also b. You also have an assignment operations update the parameter b that looks very similar. b is assigned the old value of b minus the learning rate Alpha times this slightly different derivative term, d/db of J of wb. Remember in the graph of the surface plot where you're taking baby steps until you get to the bottom of the value, well, for the gradient descent algorithm, you're going to repeat these two update steps until the algorithm converges. By converges, I mean that you reach the point at a local minimum where the parameters w and b no longer change much with each additional step that you take.
->Now, there's one more subtle detail about how to correctly in semantic gradient descent, you're going to update two parameters, w and b. This update takes place for both parameters, w and b. One important detail is that for gradient descent, you want to simultaneously update w and b, meaning you want to update both parameters at the same time. What I mean by that, is that in this expression, you're going to update w from the old w to a new w, and you're also updating b from its oldest value to a new value of b. The way to implement this is to compute the right side, computing this thing for w and b, and simultaneously at the same time, update w and b to the new values. Let's take a look at what this means. Here's the correct way to implement gradient descent which does a simultaneous update. This sets a variable temp_w equal to that expression, which is w minus that term here. There's also a set in another variable temp_b to that, which is b minus that term. You compute both for hand sides, both updates, and store them into variables temp_w and temp_b. Then you copy the value of temp_w into w, and you also copy the value of temp_b into b. Now, one thing you may notice is that this value of w is from the for w gets updated. Here, I noticed that the pre-update w is where it goes into the derivative term over here. In contrast, here is an incorrect implementation of gradient descent that does not do a simultaneous update. In this incorrect implementation, we compute temp_w, same as before, so far that's okay. Now here's where things start to differ. We then update w with the value in temp_w before calculating the new value for the other parameter to be. Next, we calculate temp_b as b minus that term here, and finally, we update b with the value in temp_b. The difference between the right-hand side and the left-hand side implementations is that if you look over here, this w has already been updated to this new value, and this is update dw that actually goes into the cost function j of w, b. It means that this term hereon the right is not the same as this term over here that you see on the left. That also means this temp_b term on the right is not quite the same as the temp b term on the left, and thus this updated value for b on the right is not the same as this updated value for variable b on the left. The way that gradient descent is implemented in code, it actually turns out to be more natural to implement it the correct way with simultaneous updates. When you hear someone talk about gradient descent, they always mean the gradient descents where you perform a simultaneous update of the parameters. If however, you were to implement non-simultaneous update, it turns out it will probably work more or less anyway. But doing it this way isn't really the correct way to implement it, is actually some other algorithm with different properties. I would advise you to just stick to the correct simultaneous update and not use this incorrect version on the right. That's gradient descent.
->In the next video, we'll go into details of the derivative term which you saw in this video, but that we didn't really talk about in detail. Derivatives are part of calculus, and again, if you're not familiar with calculus, don't worry about it. You don't need to know calculus at all in order to complete this course or this specialization, and you have all the information you need in order to implement gradient descent. Coming up in the next video, we'll go over derivatives together, and you come away with the intuition and knowledge you need to be able to implement and apply gradient descent yourself. I think that'll bean exciting thing for you to know how to implement. Let's go on to the next video to see how to do that.

3.3 Gradient descent intuition

e30b91a654ec4a0aae2b120939f9d742.png
->Now let's dive more deeply in gradient descent to gain better intuition about what it's doing and why it might make sense. Here's the gradient descent algorithm that you saw in the previous video. As a reminder, this variable, this Greek symbol Alpha,is the learning rate. The learning rate control show big of a step you take when updating the model's parameters, w and b. This term here, this d over dw, this is a derivative term. By convention in math, this d is written with this funny font here. In case anyone watching this has PhD in math or is an expert in multivariate calculus, they may be wondering,that's not the derivative, that's the partial derivative.Yes, they be right. But for the purposes of implementing a machine learning algorithm, I'm just going to call it derivative. Don't worry about these little distinctions.
->What we're going to focus on now is get more intuition about what this learning rate and what this derivative are doing and why when multiplied together like this, it results in updates to parameters w and b. That makes sense. In order to do this let's use a slightly simpler example where we work on minimizing just one parameter. Let's say that you have a cost function J of just one parameter w with w is a number. This means the gradient descent now looks like this. W is updated to w minus the learning rate Alpha times d over dw of J of w. You're trying to minimize the cost by adjusting the parameter w.
【example 1】
25d4ff82611d40d49aa1aa74230b6946.png
->This is like our previous example where we had temporarily set b equal to 0 with one parameter w instead of two, you can look at two-dimensional graphs of the cost function j, instead of three dimensional graphs. Let's look at what gradient descent does on just function J of w. Here on the horizontal axis is parameter w, and on the vertical axis is the cost j of w. Now less initialized gradient descent with some starting value for w. Let's initialize it at this location. Imagine that you start off at this point right hereon the function J, what gradient descent will do is it will update w to be w minus learning rate Alpha times d over dw of J of w. Let's look at what this derivative term here means. A way to think about the derivative at this point on the line is to draw a tangent line, which is a straight line that touches this curve at that point. Enough, the slope of this line is the derivative of the function j at this point. To get the slope, you can draw a little triangle like this. If you compute the height divided by the width of this triangle,that is the slope. For example, this slope might be 2 over 1, for instance and when the tangent line is pointing up and to the right, the slope is positive, which means that this derivative is a positive number, so is greater than 0. The updated w is going to be w minus the learning rate times some positive number. The learning rate is always a positive number. If you take w minus a positive number, you end up with a new value for w, that's smaller. On the graph, you're moving to the left, you're decreasing the value of w. You may notice that this is the right thing to do if your goal is to decrease the cost J, because when we move towards the left on this curve, the cost j decreases, and you're getting closer to the minimum for J, which is over here. So far, gradient descent, seems to be doing the right thing.
【example 2】
e16a6ec01d594a50941202b5d888acf9.png
->Now, let's look at another example. Let's take the same function j of w as above, and now let's say that you initialized gradient descent at a different location. Say by choosing a starting value for w that's over here on the left. That's this point of the function j. Now, the derivative term, remember is d over dw of J of w, and when we look at the tangent line at this point over here, the slope of this line is a derivative of J at this point. But this tangent line is sloping down into the right. This lines sloping down into the right has a negative slope. In other words, the derivative of J at this point is a negative number. For instance, if you draw a triangle, then the height like this is negative 2 and the width is 1, the slope is negative2 divided by 1, which is negative 2, which is a negative number. When you update w, you get w minus the learning rate times a negative number. This means you subtract from w, a negative number. But subtracting a negative number means adding a positive number, and so you end up increasing w. Because subtracting a negative number is the same as adding a positive number to w. This step of gradient descent causes w to increase, which means you're moving to the right of the graph and your cost J has decrease down to here. Again, it looks like gradient descent is doing something reasonable, is getting you closer to the minimum.
->Hopefully, these last two examples show some of the intuition behind what a derivative term is doing and why this host gradient descent change w to get you closer to the minimum. I hope this video gave you some sense for why the derivative term in gradient descent makes sense. One other key quantity in the gradient descent algorithm is the learning rate Alpha. How do you choose Alpha? What happens if it's too small or what happens when it's too big? In the next video, let's take a deeper look at the parameter Alpha to help build intuitions about what it does, as well as how to make a good choice for a good value of Alpha for your implementation of gradient descent.???

4.Learning rate

->The choice of the learning rate, alpha will have a huge impact on the efficiency of your implementation of gradient descent. And if alpha, the learning rate is chosen poorly rate of descent may not even work at all. In this video, let's take a deeper look at the learning rate. This will also help you choose better learning rates for your implementations of gradient descent.

【 eq?%5Calpha is too small】
15036c715ff246f69225f97ddd02a490.png->So here again, is the great inter sense rule. W is updated to be W minus the learning rate, alpha times the derivative term. To learn more about what the learning rate alpha is doing. Let's see what could happen if the learning rate alpha is either too small or if it is too large. For the case where the learning rate is too small. Here's a graph where the horizontal axis is W and the vertical axis is the cost J. And here's the graph of the function J of W. Let's start grading descent at this point here, if the learning rate is too small. Then what happens is that you multiply your derivative term by some really, really small number. So you're going to be multiplying by number alpha. That's really small, like 0.0000001. And so you end up taking a very small baby step like that. Then from this point you're going to take another tiny tiny little baby step. But because the learning rate is so small, the second step is also just minuscule. The outcome of this process is that you do end up decreasing the cost J but incredibly slowly. So, here's another step and another step, another tiny step until you finally approach the minimum. But as you may notice you're going to need a lot of steps to get to the minimum. So to summarize if the learning rate is too small, then gradient descents will work, but it will be slow. It will take a very long time because it's going to take these tiny tiny baby steps. And it's going to need a lot of steps before it gets anywhere close to the minimum.
【 eq?%5Calpha is too large】
15036c715ff246f69225f97ddd02a490.png
->Now, let's look at a different case. What happens if the learning rate is too large? Here's another graph of the cost function. And let's say we start grating descent with W at this value here. So it's actually already pretty close to the minimum. So the decorative points to the right. But if the learning rate is too large then you update W very giant step to be all the way over here. And that's this point here on the function J. So you move from this point on the left, all the way to this point on the right. And now the cost has actually gotten worse. It has increased because it started out at this value here and after one step, it actually increased to this value here. Now the derivative at this new point says to decrease W but when the learning rate is too big. Then you may take a huge step going from here all the way out here. So now you've gotten to this point here and again, if the learning rate is too big. Then you take another huge step with an acceleration and way overshoot the minimum again. So now you're at this point on the right and one more time you do another update. And end up all the way here and so you're now at this point here. So as you may notice you're actually getting further and further away from the minimum. So if the learning rate is too large, then creating the sense may overshoot and may never reach the minimum. And another way to say that is that great intersect may fail to converge and may even diverge.
65c7fd9ca1f544de8acc8fc2dbeae61b.png
->So, here's another question, you may be wondering one of your parameter W is already at this point here. So that your cost J is already at a local minimum. What do you think? One step of gradient descent will do if you've already reached a minimum? Let's work through this together. Let's suppose you have some cost function J. And the one you see here isn't a square error cost function and this cost function has two local minima corresponding to the two valleys that you see here. Now let's suppose that after some number of steps of gradient descent, your parameter W is over here, say equal to five. And so this is the current value of W. This means that you're at this point on the cost function J. And that happens to be a local minimum, turns out if you draw attention to the function at this point. The slope of this line is zero and thus the derivative term. Here is equal to zero for the current value of W. And so you're grading descent update becomes W is updated to W minus the learning rate times zero. We're here that's because the derivative term is equal to zero. And this is the same as saying let's set W to be equal to W. So this means that if you're already at a local minimum, gradient descent leaves W unchanged. Because it just updates the new value of W to be the exact same old value of W. So concretely, let's say if the current value of W is five. And alpha is 0.1 after one iteration, you update W as W minus alpha times zero and it is still equal to five. So if your parameters have already brought you to a local minimum, then further gradient descent steps to absolutely nothing. It doesn't change the parameters which is what you want because it keeps the solution at that local minimum.
b34ba37731e24fa48816971dc2c10df6.png->This also explains why gradient descent can reach a local minimum, even with a fixed learning rate alpha. Here's what I mean, to illustrate this,let's look at another example. Here's the cost function J of W that we want to minimize. Let's initialize gradient descent up here at this point. If we take one update step, maybe it will take us to that point. And because this derivative is pretty large, grading, descent takes a relatively big step right. Now, we're at this second point where we take another step. And you may notice that the slope is not as steep as it was at the first point. So the derivative isn't as large. And so the next update step will not be as large as that first step. Now, read this third point here and the derivative is smaller than it was at the previous step. And will take an even smaller step as we approach the minimum. The decorative gets closer and closer to zero. So as we run gradient descent, eventually we're taking very small steps until you finally reach a local minimum.
->So just to recap, as we get nearer a local minimum gradient descent will automatically take smaller steps. And that's because as we approach the local minimum, the derivative automatically gets smaller. And that means the update steps also automatically gets smaller. Even if the learning rate alpha is kept at some fixed value. So that's the gradient descent algorithm, you can use it to try to minimize any cost function J. Not just the mean squared error cost function that we're using for the new regression.
->In the next video, we're going to take the function J and set that back to be exactly the linear regression models cost function. The mean squared error cost function that we come up with earlier. And putting together great in dissent with this cost function that will give you your first learning algorithm,the linear regression algorithm.

5.Gradient descent for linear regression

099672d202204fb7a3c8344415d2526a.png

->We're going to pull out together and use the squared error cost function for the linear regression model with gradient descent. This will allow us to train the linear regression model to fit a straight line to achieve the training data.Here's the linear regression model. To the right is the squared error cost function. Below is the gradient descent algorithm. It turns out if you calculate these derivatives, these are the terms you would get. The derivative with respect to W is this 1 over m, sum of i equals 1 through m. Then the error term, that is the difference between the predicted and the actual values times the input feature xi. The derivative with respect to b is this formula over here, which looks the same as the equation above, except that it doesn't have that xi term at the end. If you use these formulas to compute these two derivatives and implements gradient descent this way, it will work.
【Optional】
->(Now, you may be wondering, where did I get these formulas from? They're derived using calculus. If you want to see the full derivation, I'll quickly run through the derivation on the next slide. But if you don't remember or aren't interested in the calculus, don't worry about it. You can skip the materials on the next slide entirely and still be able to implement gradient descent and finish this class and everything will work just fine.)
b319f0fef1bf4090874bc0b65d797b3e.png

->In this slide, which is one of the most mathematical slide of the entire specialization, and again is completely optional, we'll show you how to calculate the derivative terms. Let's start with the first term. The derivative of the cost function J with respect to w. We'll start by plugging in the definition of the cost function J. J of WP is this. 1 over 2m times this sum of the squared error terms. Now remember also that f of wb of X^i is equal to this term over here, which is WX^i plus b. What we would like to do is compute the derivative, also called the partial derivative with respect to w of this equation right here on the right. If you taken a calculus class before, and again is totally fine if you haven't, you may know that by the rules of calculus, the derivative is equal to this term over here. Which is why the two here and two here cancel out, leaving us with this equation that you saw on the previous slide. This is why we had to find the cost function with the 1.5 earlier this week is because it makes the partial derivative neater. It cancels out the two that appears from computing the derivative. For the other derivative with respect to b, this is quite similar. I can write it out like this, and once again, plugging the definition of f of X^i, giving this equation. By the rules of calculus, this is equal to this where there's no X^i anymore at the end. The 2's cancel one small and you end up with this expression for the derivative with respect to b. Now you have these two expressions for the derivatives. You can plug them into the gradient descent algorithm.
【gradient descent algorithm for linear regression】
890f380490c44788b4f27149a0ae725c.png
->Here's the gradient descent algorithm for linear regression. You repeatedly carry out these updates to w and b until convergence. Remember that this f of x is a linear regression model, so as equal to w times x plus b. This expression here is the derivative of the cost function with respect to w. This expression is the derivative of the cost function with respect to b. Just as a reminder, you want to update w and b simultaneously on each step.
【how gradient descent works】
4791fe805a82464697a970d9c22e0d10.png

->Now, let's get familiar with how gradient descent works. One the shoe we saw with gradient descent is that it can lead to a local minimum instead of a global minimum. Whether global minimum means the point that has the lowest possible value for the cost function J of all possible points. You may recall this surfacel. This function has more than one local minimum. Remember, depending on where you initialize the parameters w and b, you can end up at different local minima. You can end up here, or you can end up here. But it turns out when you're using a squared error cost function with linear regression, the cost function does not and will never have multiple local minima. It has a single global minimum because of this bowl-shape. The technical term for this is that this cost function is a convex function. Informally, a convex function is of bowl-shaped function and it cannot have any local minima other than the single global minimum. When you implement gradient descent on a convex function, one nice property is that so long as you're learning rate is chosen appropriately, it will always converge to the global minimum. Congratulations, you now know how to implement gradient descent for linear regression.

6.Running gradient descent

26ecc43bad704876989887da9fdf4482.png

->Let's see what happens when you run gradient descent for linear regression. Let's go see the algorithm in action. Here's a plot of the model and data on the upper left and a contour plot of the cost function on the upper right and at the bottom is the surface plot of the same cost function. Often w and b will both be initialized to 0, but for this demonstration, lets initialized w = -0.1 and b = 900. So this corresponds to f(x) = -0.1x + 900. Now, if we take one step using gradient descent, we ended up going from this point of the cost function out here to this point just down and to the right and notice that the straight line fit is also changed a bit. Let's take another step. The cost function has now moved to this third and again the function f(x) has also changed a bit. As you take more of these steps, the cost is decreasing at each update. So the parameters w and b are following this trajectory. And if you look on the left, you get this corresponding straight line fit that fits the data better and better until we've reached the global minimum. The global minimum corresponds to this straight line fit, which is a relatively good fit to the data. And so that's gradient descent and we're going to use this to fit a model to the holding data. And you can now use this f(x) model to predict the price of your clients house or anyone else's house. For instance, if your friend's house size is 1250 square feet, you can now read off the value and predict that maybe they could get, I don't know, $250,000 for the house.
b788b97d31f645c488a4dbbaf7d8e041.png->To be more precise, this gradient descent process is called batch gradient descent. The term bashed grading descent refers to the fact that on every step of gradient descent, we're looking at all of the training examples, instead of just a subset of the training data. So in computing grading descent, when computing derivatives, when computing the sum from i =1 to m. And bash gradient descent is looking at the entire batch of training examples at each update. I know that bash grading percent may not be the most intuitive name, but this is what people in the machine learning community call it. If you've heard of the newsletter The Batch, that's published by DeepLearning.AI. The newsletter The batch was also named for this concept in machine learning. And then it turns out that there are other versions of gradient descent that do not look at the entire training set, but instead looks at smaller subsets of the training data at each update step. But we'll use batch gradient descent for linear regression. 
【About optional lab】
->In the optional lab that follows this video. You'll see a review of the gradient descent algorithm as was how to implement it in code. You'll also see a plot that shows how the cost decreases as you continue training more iterations. And you'll also see a contour plot, seeing how the cost gets closer to the global minimum as gradient descent finds better and better values for the parameters w and b. So remember that to do the optional lab. You just need to read and run this code. You will need to write any code yourself and I hope you take a few moments to do that. And also become familiar with the gradient descent code because this will help you to implement this and similar algorithms in the future yourself. In addition to the optional labs, if you haven't done so yet. I hope you also check out the practice quizzes, which are a nice way that you can double check your own understanding of the concepts. It's also totally fine, if you don't get them all right the first time. And you can also take the quizzes multiple times until you get the score that you want. You now know how to implement linear regression with one variable and that brings us to the close of this week.
->Next week, we'll learn to make linear regression much more powerful instead of one feature like size of a house, you learn how to get it to work with lots of features. You'll also learn how to get it to fit nonlinear curves. These improvements will make the algorithm much more useful and valuable. Lastly, we'll also go over some practical tips that will really hope for getting linear regression to work on practical applications. 

7.Summary

  • standard notations:

——training set=The dataset that is used to train the model
——m=the total number of training examples
——x=input variable/feature
——y=output variable /target variable
——(x,y)=single training example 
——eq?%28x%5E%7B%28i%29%7D%2Cy%5E%7B%28i%29%7D%29=the i-th training example(tips: is just an index into the training set, is not exponentiation)
——h=hypothesis/function/supervised learning algorithm's productionz(tips:The job with f is to take a new input x and output and estimate or a prediction, which I'm going to call eq?%5Chat%7By%7D)
——eq?%5Chat%7By%7D= the estimate or the prediction for y/the output of the model 
——w、b=the parameters of the model/coefficients/weights.(tips:parameters of the model are the variables you can adjust during training in order to improve the model).
——eq?%5Calpha=the learning rate 
——cost function J:measures the squared error or difference between the estimator value

(1)This part mainly talks about the linear regression model, the gradient descent algorithm and gradient descent for linear regression, which focuses on the squared error cost function in the cost function of the linear regression model, which is used to indicate the degree of fit of the linear regression model, and the smaller the value is, the better the effect of the fit is.

(2)The idea of gradient descent is to advance in the direction of the largest decline each time until a locally optimal solution is found, in which the learning efficiency α in the formula of the gradient descent algorithm should be moderate and does not need to be modified during the gradient descent process.

(3)Gradient descent algorithm for linear regression due to the special nature of the linear regression cost function, each time will inevitably find and only one solution, that is, the global optimal solution, each step of the gradient descent of the method needs to traverse the samples of the entire training set.