WEBVTT

00:00.890 --> 00:08.810
Now, let us discuss about Cost and Bisson, so we have already discussed what hostis cost is, the

00:08.930 --> 00:14.960
sum of all the errors which are present in the predictions which you have made.

00:15.380 --> 00:21.470
So cost function is something which you want to minimize because we want to minimize the edit, which

00:21.470 --> 00:22.550
is present Dinamo.

00:23.950 --> 00:29.900
For example, our cost function might be the sum of squared errors or their training set.

00:30.190 --> 00:34.920
So in this case, we will be reducing the sum of squared error.

00:34.930 --> 00:41.530
So we will have to reduce the error of the entire model, which we have now.

00:41.530 --> 00:48.440
Gradient descent is a method for finding the minimum of a function of multiple variables.

00:48.970 --> 00:55.720
So we want to find out the minimum value of multiple variables so we can use gradient descent as a tool

00:55.720 --> 00:57.250
to minimize our cost function.

00:57.640 --> 00:58.930
Now, let us discuss that.

00:59.140 --> 01:01.640
Why do we really need gradient descent?

01:02.170 --> 01:10.450
So one way of solving the problem would be so let's say we have this data, we have this line which

01:10.450 --> 01:15.040
we have to create and we will have to find out several values for this.

01:16.150 --> 01:23.760
So one week with me back, I can randomly select some bigger values and keep on trying again and again,

01:23.770 --> 01:28.300
again and again until I find a very nice cost value.

01:28.480 --> 01:35.230
So I will put in a lot of the values and I will compare different models which I have created, and

01:35.230 --> 01:41.740
I will select the model which has the minimal cost value I will based on those minimum cost value,

01:41.740 --> 01:46.270
whatever values I would have used, I will use those values.

01:48.930 --> 01:56.790
So this will not be a feasible solution, because what will happen is I can use it, but only if I have

01:56.790 --> 02:03.330
endless amount of time and a lot of patience and I can keep on trying a lot of values.

02:03.330 --> 02:11.430
And then by using those detainees at that point, when the values stop changing and then I don't really

02:11.550 --> 02:17.790
have to modify the value to reduce the cost, at that point, I will be stopping and I will select those

02:17.790 --> 02:18.540
with the values.

02:18.690 --> 02:20.720
But it is not a feasible solution.

02:20.880 --> 02:23.070
So that is why we need gradient descent.

02:23.280 --> 02:28.840
Now, gradient descent is completely an alien form for us and we don't really know what gradient descent

02:28.860 --> 02:29.070
is.

02:30.020 --> 02:33.710
So let us try to understand it and let us try to get in depth.

02:36.270 --> 02:36.810
So.

02:39.010 --> 02:49.600
When we are finding out the cost value, so this GLW is nothing but the cost of the entire function.

02:49.960 --> 02:56.690
So here W can be vate or begin in simpler terms of the integration.

02:56.720 --> 03:05.200
We can call it the BWC, so we can either say B.W. or we can say that it is one in the same thing and

03:05.200 --> 03:07.200
G is nothing but the cost function.

03:07.510 --> 03:13.150
So we can say gee W or we can C C or we can see F of.

03:14.240 --> 03:14.840
Excellent.

03:15.500 --> 03:17.720
So it can be called any of these things.

03:19.970 --> 03:29.060
So what will happen is we will randomly select value, so we will randomly select a value for the Beita

03:29.240 --> 03:32.920
and based on the value we will select, what is the cost?

03:33.980 --> 03:41.610
And at any point of time, at any point of time, this is the function which the essentials.

03:41.810 --> 03:43.640
So this is the function.

03:43.640 --> 03:48.380
The basically the cost or the error will be.

03:49.800 --> 03:51.640
President in a linear regression.

03:51.780 --> 03:57.820
So let's say we have this cost, we have this function and there can be multiple lines.

03:58.470 --> 04:06.360
So as I change the B values, as I change the slope of the equation, the line will come closer to this

04:06.360 --> 04:07.170
particular line.

04:07.320 --> 04:13.530
So initially, I will be creating this line, which will be a lot different from the correctly, then

04:13.530 --> 04:15.040
I will create the line.

04:15.420 --> 04:21.540
So when I create this line, the cost will reduce slightly, then I will create another line.

04:22.140 --> 04:25.770
Now the cost will produce a lot less.

04:26.190 --> 04:32.520
Now, then if I again at one point of time, the cost will be very low.

04:32.580 --> 04:34.350
This is the minimum cost, which I can.

04:35.640 --> 04:42.570
Then after that, once I just change the values again, then the cost will again increase.

04:43.540 --> 04:46.780
And after that, the cost will keep on increasing.

04:48.130 --> 04:54.880
So these are the different cost values based on the values which I can have, so same thing has been

04:54.880 --> 04:57.250
depicted in this particular school.

04:57.490 --> 05:07.570
So initially, the VW will, as the VW will change with the change in the VW, the cost will keep on

05:07.570 --> 05:08.090
changing.

05:08.410 --> 05:10.410
So the cost will keep on reducing.

05:10.560 --> 05:16.340
Initially, I offered a point of time when the cost will achieve the minimum value.

05:16.480 --> 05:21.310
After that, when we again change the VW, the cost will start increasing.

05:22.880 --> 05:31.550
So we want to find out the optimal values and thus we want to stop at the point of time when this cost

05:31.550 --> 05:32.600
is minimum.

05:34.240 --> 05:40.220
So when we look at this particular slope, we can find out the line of this slope.

05:40.540 --> 05:46.410
So what is the slope if we look at a very minor point on this particular golf?

05:48.290 --> 05:58.520
This Gulf is actually nothing but a slope, which is between this small line will be depicted by the

05:58.520 --> 06:02.390
change in the if the change in the function.

06:04.110 --> 06:11.090
And this small section, this horizontal section will be depicted by the change in the veto.

06:11.400 --> 06:13.070
This is the change in the double.

06:14.410 --> 06:15.760
So let's see for the.

06:16.890 --> 06:18.930
So this change in.

06:21.390 --> 06:30.690
The function and this change in the B.W., when we divide this function by the B.W., we get the T,

06:30.690 --> 06:34.640
the V, this is the equation of the band, Hedo.

06:36.390 --> 06:43.410
And what is done theater than theater is nothing but the slope of this line, which is the change in

06:43.410 --> 06:48.690
the function with respect to the change in the value.

06:51.180 --> 06:59.820
So hence, we can see that the change in the function with respect to the is equal to the slope of the

06:59.820 --> 07:00.190
line.

07:00.390 --> 07:06.200
So what we can say very early in this investigation, we will get this done.

07:08.690 --> 07:11.390
Multiplied by this storm.

07:12.520 --> 07:17.700
This storm is actually the cost of the function at any point of time.

07:18.930 --> 07:30.180
And this is my new change in the B.W., which means that I can find out the gradient at any point of

07:30.180 --> 07:38.010
time, I can find out the change which I want at any point of time by multiplying the cost.

07:39.450 --> 07:45.610
Of the function at any point of time with the change in the BW.

07:47.320 --> 07:54.250
This is what the president is, so now when I see this particular function, I know that whenever I'll

07:54.250 --> 08:02.370
be changing the B.W., so as the B.W. will change accordingly, the value of the cost will change.

08:03.130 --> 08:07.500
And as you see, the slope is slowly and gradually decreasing.

08:07.840 --> 08:15.190
So when we see that the slope is very high, the value of the slope is very large, but after that,

08:15.280 --> 08:17.320
the value of the slope is decreasing.

08:19.450 --> 08:26.470
The value of the slope keeps on decreasing as it reaches to the minimum point of.

08:27.530 --> 08:35.460
The cost, so at this point, the cost is minimal and the slope is also minimal.

08:36.260 --> 08:43.040
So when we are changing the values of the league, when we are changing the values of the weight, that

08:43.040 --> 08:44.510
is one more scenario which would.

08:46.280 --> 08:48.800
So if we change with respect to the.

08:49.610 --> 08:56.270
So if they are changing with respect of the B.W., so what will happen that initially when we are changing

08:56.270 --> 08:59.750
with respect of the B.W., the slope will be higher?

09:00.860 --> 09:08.830
So we can make Foster's changes so we can change the value of W Foster's, so we will change the make

09:08.840 --> 09:11.420
changes to the values of the W very bust.

09:11.670 --> 09:15.020
We will make the changes in the value of the B values very fast.

09:15.320 --> 09:22.220
But when we reach towards the minimum value, what will happen is if we are making the changes very

09:22.220 --> 09:24.830
fast, then it will go like this.

09:25.200 --> 09:27.410
There will be one change which will be very fast.

09:27.790 --> 09:29.210
Then there would be another change.

09:30.150 --> 09:31.940
Then there would be another change.

09:33.920 --> 09:41.150
Then the last change would actually overshoot this value and then we would try to get back and then

09:41.150 --> 09:46.010
we'd make the same large change again and then we will again overshoot the value.

09:46.310 --> 09:51.660
So what this really causes, it will again and again and again overshoot the values for us.

09:52.130 --> 09:59.480
So what we will have to do is we will have to reduce the change which we are making as we reach the

09:59.570 --> 10:00.560
minimum value.

10:00.800 --> 10:05.970
So now if you make the changes very fast, then there is a chance of overshooting.

10:06.200 --> 10:08.260
Now, let us look at another scenario.

10:09.080 --> 10:12.960
The second scenario would be if we are making changes very slowly.

10:13.160 --> 10:21.170
So if you are making the changes very slowly, then what will happen is that the values will take a

10:21.170 --> 10:23.930
lot of time to reach the minimum value.

10:25.380 --> 10:34.740
So it will be so slow that it might not even reach the global minimum and it might take in a finite

10:34.740 --> 10:37.920
amount of time to reach the minimum value.

10:38.800 --> 10:45.910
So that is the reason why we will have to choose not a very high value or not a very small value, we

10:45.910 --> 10:52.600
have to make sure that the value which we are selecting is a moderate value so that we can reach the

10:52.600 --> 10:54.190
global minimum effectively.

10:56.610 --> 10:57.870
So now let's get for.

10:59.950 --> 11:06.320
So let us try to formulate the equation so this is the equation which we will be having.

11:06.580 --> 11:14.580
So we are trying to generate a function of X. So the function of X will have to be in the form V dunnart

11:15.130 --> 11:19.130
plus B, the next one, plus B.W. X2 and so on.

11:20.020 --> 11:25.810
So if we have this particular equation, this X one value will be one of the variables.

11:25.810 --> 11:33.400
X will be another input variable and we are expecting the output of this to be Viag.

11:35.130 --> 11:37.320
We are expecting the output to be.

11:39.200 --> 11:47.090
Why value, but actual output, which is expected, is why and the output which comes out to be is by.

11:49.270 --> 11:54.520
So the added value will be the expected output.

11:56.150 --> 11:58.850
Minus the actual output, which we get.

11:59.830 --> 12:01.800
After putting the BWC.

12:02.800 --> 12:09.190
So what will be the added value, the added value will be via minus why have we had this?

12:09.190 --> 12:16.180
Nothing, but we done all plus with our next one plus B that we do now because B is a constant value.

12:16.540 --> 12:18.250
Can I multiply one with it?

12:19.030 --> 12:23.070
If I even if I multiplied with one, the value will remain the same.

12:24.410 --> 12:26.120
So I can multiply it with.

12:28.620 --> 12:29.400
It's not.

12:30.780 --> 12:38.370
So, like, we multiplied with X and let the value of X, not the one, so I will have an X, not you.

12:42.310 --> 12:44.650
And the value of export will be one.

12:45.590 --> 12:52.190
So I will get the value of EDAR as a minus be done or not.

12:52.280 --> 12:56.610
Plus, we don't explain Glasby that we do will be done.

12:56.720 --> 13:05.750
What is the constant value X not is the one be the one we do on the slope values.

13:07.660 --> 13:15.670
So what will be the cost, as we discussed, costs will be the sum of all the other values, so the

13:15.670 --> 13:22.930
cost can be described as the square, the sum of all the added values or the.

13:25.280 --> 13:27.350
Some of all the absolute better.

13:28.790 --> 13:30.860
So cost can be any of these two.

13:33.180 --> 13:34.170
So let us look for.

13:35.430 --> 13:36.240
So begin.

13:37.560 --> 13:46.200
Vitamin D cost value by finding out the beta values, because we already know the X values and we already

13:46.200 --> 13:47.260
know the value.

13:47.490 --> 13:55.380
The only thing which we do not know here is the value of the betas so we can find out the value of cost

13:55.710 --> 14:00.420
by putting in different values as Vianne X values are constant.

14:01.860 --> 14:05.820
Hence, we will try to find out we don't value such as this minimum.

14:06.800 --> 14:07.910
So what do we get?

14:08.990 --> 14:14.480
This is the matrix of X, that is the X one X, not value as one.

14:15.820 --> 14:21.710
And the value of X one, two, three, four for all the rules of the.

14:23.920 --> 14:31.960
Similarly, we will have the Matrix y, then y will be the value of life for all the roles of data.

14:32.980 --> 14:41.560
Remember this, if we talk about the loan data, then X1 will be the amount of the loan, the number

14:41.560 --> 14:49.090
of dependents, the gender of the person, the number of legacy, the salary of the person, so the

14:49.090 --> 14:49.970
number of children.

14:49.990 --> 14:52.180
So these should be the values of the X.

14:53.360 --> 15:00.950
And what will be my life will be the value which we are trying to predict, which is the interest rate

15:00.950 --> 15:01.190
you.

15:02.040 --> 15:02.380
Right.

15:02.580 --> 15:09.800
So this is what our data will look like and they'll be the matrix is what we are trying to find out.

15:11.630 --> 15:16.780
So the equation which we will have here is the sum of Aitor Square.

15:17.630 --> 15:18.780
So what will this be?

15:19.040 --> 15:22.000
It will be the sum of it or square with it.

15:22.010 --> 15:25.120
It is nothing but VEI minus V does not.

15:26.710 --> 15:34.330
Minus one X one minus VW x2 getting the minus sign inside because all the values were plus all values

15:34.330 --> 15:37.780
will convert to a negative and then square all of these.

15:38.980 --> 15:39.970
This is the equation.

15:41.350 --> 15:44.260
Just as it is just the same thing as we have written here.

15:45.240 --> 15:50.220
Just the same thing, we have just opened the bracket and put minus all inside.

15:54.220 --> 15:57.820
So this will be the value of the cost.

15:59.960 --> 16:08.420
So how will we update those values so when we are trying to update the week or when we are trying to

16:08.420 --> 16:19.010
update the veto values, the update, we will be the we will change the weight by W minus the change

16:19.010 --> 16:23.300
in cost with respect to change in the that just.

16:25.830 --> 16:29.250
Change in cost with respect to change in.

16:30.430 --> 16:33.640
The W or the V values.

16:36.750 --> 16:41.280
So the update rule is new, the blue is equal to W minus.

16:42.300 --> 16:52.090
A learning break, a learning break in changing in with respect to W. Now, what is this learning grade?

16:52.110 --> 16:52.730
What is this?

16:53.580 --> 16:58.430
This is nothing but how fast we want to go down this route.

16:58.860 --> 17:06.660
Remember, we just discussed that if you will go very fast, then we will overshoot and we will go very

17:06.660 --> 17:07.100
slow.

17:07.440 --> 17:11.940
Then we will and we might not even reach this point of time.

17:12.300 --> 17:15.860
So that is the reason why this value of aid is very important.

17:16.080 --> 17:19.350
So we will have to decide what the value of our should be.

17:22.290 --> 17:27.510
Getting Florida now, why will be will be training the later.

17:28.790 --> 17:35.440
They could read different scenarios possible, so you remember this kind of slope which we have created,

17:35.450 --> 17:45.650
so this was the data points which we had and we assigned a line to it the right to predict these values

17:45.650 --> 17:47.670
based on the line which we have read.

17:48.320 --> 17:53.840
Now, when we are assigning this line, when we are creating this line, this equation of the line,

17:54.200 --> 18:03.290
then this line might not be able to predict these values very properly, because if you see the function

18:03.650 --> 18:07.220
is not really actually a linear function.

18:08.550 --> 18:15.660
This would have been predicted properly if they had a line like this, I'm datapoints would have been

18:15.660 --> 18:16.710
something like this.

18:26.210 --> 18:29.270
And then they would have created a line like this.

18:30.350 --> 18:37.100
Then we could have said that this is a perfect line which we have created for the moment, but right

18:37.100 --> 18:40.500
now the data points show a completely different picture.

18:40.670 --> 18:43.930
So this scenario is called the footing.

18:44.340 --> 18:52.370
When we are not able to predict the values properly, we are not able to learn from all the data points

18:52.370 --> 18:59.150
because the language we have for it is very simple in nature, because the function which we have applied

18:59.150 --> 19:00.750
is very simple in nature.

19:00.920 --> 19:05.690
So it is not able to capture the fact present in the data.

19:08.070 --> 19:16.370
The other side of this here, we have this data, the same data, and we have it OK to it instead of

19:16.680 --> 19:20.830
a linear line, we have the data of the line here.

19:21.030 --> 19:28.300
So when we go through this data, this is able to understand the patterns of the data.

19:29.530 --> 19:36.670
It will know that when the value of X is very low, you have to have some Y value and the value keeps

19:36.670 --> 19:43.660
on decreasing till one point the value of X, and after that, again, the value of Y starts increasing.

19:44.500 --> 19:46.750
But it does not understand this fact that.

19:48.500 --> 19:55.490
And on the other hand, this this particular function which we have created is just right for this type

19:55.490 --> 19:55.950
of people.

19:57.080 --> 20:03.590
And on the other hand, when we're fighting a very complex nyein, a very complex function to do this

20:03.590 --> 20:08.360
data, it is trying to learn from all the data points.

20:08.690 --> 20:11.300
It is trying to learn from all the data points.

20:11.480 --> 20:18.680
So what happens is that when a problem comes and when we are testing the data and when we are actually

20:18.680 --> 20:25.520
using the model in your life, then what happens is if we have a data point here, the model will try

20:25.520 --> 20:30.900
to predict this particular value, which will be very different from the actual value.

20:31.730 --> 20:39.590
But if we use this model and we have the same data point, so the error will be less in comparison to

20:39.980 --> 20:40.550
this one.

20:43.310 --> 20:45.620
And again, if we have a data point here.

20:46.960 --> 20:49.150
Then this might not be.

20:50.280 --> 20:51.370
Predicted properly.

20:51.660 --> 20:55.380
But if we have data point here, then the error will be less.

20:56.040 --> 21:04.080
And if we have a data point here, then again, the error will be a little later, but it will also

21:04.080 --> 21:04.950
be kind of fine.

21:05.220 --> 21:09.020
So what is the difference between under and over 40?

21:09.300 --> 21:15.720
Under 40 is when the model does not really learn enough from the data which is present.

21:17.850 --> 21:26.670
And overvoting is when the model tries to learn a lot from the giving data, it tries to find out very,

21:26.670 --> 21:32.160
very, very complex patterns from the data so that it creates a complex function.

21:32.460 --> 21:39.780
And because of that complex function, it is not able to predict properly on the testing the.

21:40.830 --> 21:47.420
On the life data, on a different data from the training data, it will perform very well on the training,

21:47.830 --> 21:52.530
say, but it will not perform well on the realtime or testing the.

21:54.430 --> 22:02.770
So that is the reason why we want to find out a line or we want to find out the function which is able

22:02.770 --> 22:11.410
to predict values and it is generally in nature, there's a generalized model and not a very complex

22:11.410 --> 22:11.790
model.

22:12.460 --> 22:14.920
That is why we are looking for a simple model.

22:16.330 --> 22:17.600
A simpler model.

22:18.460 --> 22:18.860
OK.

22:20.470 --> 22:27.430
So this is an example of over 40 under 40 and a model that is just straight.

22:29.390 --> 22:33.860
Now, let us see how we will actually train the data.

22:36.340 --> 22:38.080
So for training the data.

22:39.360 --> 22:40.440
What we do is.

22:41.560 --> 22:45.880
We will divide the data which we have in the tool box.

22:47.490 --> 22:52.470
One is the Prani sic and another five is the best thing.

22:53.730 --> 23:01.140
So what will happen is we will train our model on this training set so that the model will not see this

23:01.140 --> 23:04.380
particular part of the test set of the data.

23:04.950 --> 23:11.100
So what will happen is when the model has not seen the best data, we will train and model again and

23:11.100 --> 23:13.860
again and very properly in this on this training set.

23:13.860 --> 23:19.890
And when we are satisfied with that model, then we will check if the model is performing well on the

23:19.960 --> 23:27.900
tests and if the model performs well on the training side and the best both, then we will see that

23:27.900 --> 23:35.520
that model is trained properly if the model performs poorly on the training set.

23:36.390 --> 23:38.520
That means that it is understood.

23:40.320 --> 23:44.880
You can see the model will not perform well even on the training data here.

23:46.610 --> 23:48.450
That is why it is called underfeeding.

23:49.540 --> 23:53.620
And if the model performs very well on their training data.

23:55.270 --> 23:58.350
But does not perform well on the testing does.

23:59.550 --> 24:01.700
Then it is going to be over 50.

24:03.470 --> 24:10.580
And when the model shows same performance on the training data and testing data, a good performance

24:10.580 --> 24:15.290
on both training and testing data, then we can see that the model is doing just great.

24:22.950 --> 24:26.040
Now, let us look at another method.

24:27.450 --> 24:34.980
So this method, which we have seen where we divide the data into training and investing, said this

24:34.980 --> 24:36.900
is called hold out method.

24:38.820 --> 24:46.620
So the out method is the simplest kind of method, the dataset is separated into two sets that is draining

24:46.620 --> 24:47.000
SEK.

24:48.160 --> 24:49.540
And the testing said.

24:50.780 --> 24:57.860
And the more the the owning the using the training cycle and after training, the model is checked and

24:57.860 --> 24:59.470
validated on the testing.

25:00.500 --> 25:08.210
Now, what are the disadvantages of this, the evaluation when we are doing the evaluation on the testing?

25:08.690 --> 25:15.830
It may depend heavily on which data end up in the training set and which end up in the essay.

25:16.220 --> 25:25.040
Now, if we have divided the data randomly and somehow the data patterns which we want, the more they

25:25.050 --> 25:31.680
do end up in the testing data, then our model will not really learn much from the training dataset.

25:32.090 --> 25:38.420
So we want to make sure that all the patterns which we have, that those should be presented the training

25:38.420 --> 25:41.630
data also so that we can find out and learn from the training.

25:43.620 --> 25:50.070
Does the evaluation may be significantly different, depending on how the decision is made?

25:51.380 --> 25:52.840
So what do we do here?

25:54.650 --> 26:01.290
To prevent this, we use the cross-validation method and what is cross-validation method.

26:01.640 --> 26:09.080
We basically divide the data instead of dividing the data into training daytime testing data, we divide

26:09.080 --> 26:10.760
the data into different foods.

26:11.210 --> 26:16.780
Now, these a number of roles could be any number of holes, five, seven, then 15.

26:16.940 --> 26:18.680
It is completely up to you.

26:20.050 --> 26:32.380
And based on these falls, we allow each fall, each and every fold to be a testing data at one iteration.

26:32.770 --> 26:41.170
So in the first iteration, the first split, the first hole will become the testing data and all of

26:41.170 --> 26:45.140
the folds will become the training data in the next iteration.

26:45.610 --> 26:52.590
This one, the fall two, will become the testing data and all of that will become the training data.

26:53.140 --> 26:57.400
Then in the next iteration, this one will become the.

26:59.650 --> 27:03.610
Finding the testing data and all of that will be the training data.

27:03.790 --> 27:12.850
So what will happen once we have then all of this, then we can take up average of all of these and

27:12.850 --> 27:16.210
find out what is the performance of the models.

27:16.720 --> 27:21.480
This way, the model will be able to learn from all of the faults.

27:23.890 --> 27:26.950
So same thing, so here we have three fold.

27:27.220 --> 27:34.470
So what we do the first time we make the blue one as distinct, next time we make the Greven as distinct,

27:34.750 --> 27:40.500
then in the third time we make the red one as distinct and we will use them accordingly.

27:41.480 --> 27:43.400
So this is what.

27:44.830 --> 27:52.300
Cross-validation is so thin now we have learned about linear regression and how we will perform cross-validation

27:52.300 --> 28:01.540
on that, so let us have a look at the code so we will look at the code in the next session.