1
00:00:00,110 --> 00:00:07,730
We've just looked at error sanctioning, and that has permitted us to see how this model A um, with

2
00:00:07,730 --> 00:00:14,600
parameters M and C different from this other model B with its own parameters m and C.

3
00:00:14,630 --> 00:00:22,880
Let's say, for example, given that we have um, this line y so m is say for example zero and c three.

4
00:00:22,880 --> 00:00:28,460
And then this other m could be two and then c could be let's say four.

5
00:00:28,460 --> 00:00:37,250
With the model B having a higher loss or having a higher error as compared to the model A.

6
00:00:37,580 --> 00:00:42,020
Now the whole idea of um training optimization is simple.

7
00:00:42,020 --> 00:00:47,090
What we want to do is take a model which has been initialized.

8
00:00:47,090 --> 00:00:52,100
Let's take a model whose parameters have been initialized to some random values, like for example,

9
00:00:52,100 --> 00:01:00,290
zero three or even zero, zero or whatever random values, and then update those parameters such that

10
00:01:00,290 --> 00:01:05,360
we get a better performing model like this one here with parameters two four.

11
00:01:05,360 --> 00:01:12,500
So let's suppose that we randomly initialize our model, um, for to have m equals zero and c equal

12
00:01:12,500 --> 00:01:13,070
three.

13
00:01:13,100 --> 00:01:19,940
Then we want to update this such that zero tends to something close to two, and three turns to something

14
00:01:19,940 --> 00:01:20,810
close to four.

15
00:01:20,810 --> 00:01:29,540
Now, to carry out this transformation, we shall make use of um, an algorithm known as the stochastic

16
00:01:29,540 --> 00:01:31,790
gradient descent.

17
00:01:31,790 --> 00:01:34,220
This algorithm is quite simple.

18
00:01:34,220 --> 00:01:36,410
Suppose we have a weight.

19
00:01:36,530 --> 00:01:38,570
This weight could be m or c.

20
00:01:38,570 --> 00:01:41,180
Thus we have the a parameter.

21
00:01:41,450 --> 00:01:43,430
Uh, let's call this parameter.

22
00:01:43,430 --> 00:01:45,800
Or let's say let's say this initial value.

23
00:01:46,670 --> 00:01:48,200
We are going to subtract.

24
00:01:48,200 --> 00:01:52,550
We're going to take this weight minus, uh, learning rate.

25
00:01:53,660 --> 00:01:58,430
Times the partial derivative of the loss.

26
00:01:58,460 --> 00:02:05,210
This loss is the same as the error we had computed, uh, with respect to that particular weight.

27
00:02:06,060 --> 00:02:07,950
So let's have this.

28
00:02:07,950 --> 00:02:10,320
This is minus all this here.

29
00:02:10,560 --> 00:02:13,440
And when you have this, I.

30
00:02:13,470 --> 00:02:17,700
So this initial width or the initial weight value um is here.

31
00:02:17,700 --> 00:02:24,090
So we have w I minus learning rate times the partial derivative of l the loss with respect to I.

32
00:02:24,090 --> 00:02:29,610
And then this now gives us w f.

33
00:02:29,610 --> 00:02:31,080
That's this final value.

34
00:02:31,080 --> 00:02:34,800
So we could go from um if we pick out some let's pick out some values.

35
00:02:34,800 --> 00:02:42,990
Let's suppose that uh our w initially is zero minus, um, let's say a learning rate of um, 0.1.

36
00:02:43,780 --> 00:02:44,860
Times.

37
00:02:45,310 --> 00:02:53,950
Um, a partial derivative of, let's say -20 will now give us a w f, which is two.

38
00:02:54,070 --> 00:02:59,530
And that's how we could move from, um, this initial w to this two.

39
00:02:59,560 --> 00:03:04,420
But it should be noted that in practice we don't generally just move from, uh, very poorly performing

40
00:03:04,420 --> 00:03:07,900
model to one that is very well performing in a single step.

41
00:03:07,900 --> 00:03:12,100
So we are going to go through this same algorithm in many steps.

42
00:03:12,100 --> 00:03:14,230
So we could start with let's get back.

43
00:03:14,230 --> 00:03:19,870
We could start with um, this partial derivative of um let's say even five.

44
00:03:19,870 --> 00:03:23,680
So here, here we have 0.1 times ten.

45
00:03:24,040 --> 00:03:28,150
Uh, times one times ten is actually uh, one divided by two.

46
00:03:28,180 --> 00:03:29,740
That's 0.5.

47
00:03:29,740 --> 00:03:33,370
So we could go from 0 to 0.5.

48
00:03:33,820 --> 00:03:36,250
You see we go from 0 to 0.5.

49
00:03:36,250 --> 00:03:41,530
And then in our next step, because now our W has turned to 0.5.

50
00:03:41,530 --> 00:03:42,130
That's from zero.

51
00:03:42,160 --> 00:03:43,690
We went from 0 to 0.5.

52
00:03:43,690 --> 00:03:46,630
We will replace this zero here with 0.5.

53
00:03:46,630 --> 00:03:51,280
So instead of having zero we would have now 0.5.

54
00:03:51,310 --> 00:03:53,260
The learning rate is still the same.

55
00:03:53,380 --> 00:03:55,690
Um, although it could change as we keep training.

56
00:03:55,690 --> 00:03:59,140
And then this partial derivative, we could say for example that is ten.

57
00:03:59,140 --> 00:04:02,080
So let's um, this is 0.5.

58
00:04:02,080 --> 00:04:07,570
Let's take this off here and then replace this with a negative ten.

59
00:04:07,570 --> 00:04:09,820
So here we have negative ten.

60
00:04:11,200 --> 00:04:12,640
Now you see we have one.

61
00:04:12,640 --> 00:04:17,290
So we have 0.5 minus or rather 0.5 plus one.

62
00:04:17,290 --> 00:04:21,460
So 0.5 plus one gives us um 1.5.

63
00:04:21,460 --> 00:04:23,290
So let's take this off again.

64
00:04:23,290 --> 00:04:27,340
So instead of 0.5 now we have 1.5.

65
00:04:27,940 --> 00:04:28,600
Take this off.

66
00:04:28,600 --> 00:04:30,370
And we have now 1.5.

67
00:04:30,370 --> 00:04:32,110
So you see how we go from zero.

68
00:04:32,110 --> 00:04:34,600
We go from 0 to 0.5.

69
00:04:34,600 --> 00:04:36,610
And then 21.5.

70
00:04:36,610 --> 00:04:43,720
And the reason why I would expect to have uh, better values of w better in this case is values of w

71
00:04:43,720 --> 00:04:52,150
such that the, the straight line would be close to enough to this point as compared to this other poorly

72
00:04:52,150 --> 00:04:53,320
performing model.

73
00:04:53,320 --> 00:04:57,970
And so though, as we're saying, the reason why we are sure that we are going to keep getting towards

74
00:04:57,970 --> 00:05:05,830
this, um, better performing values of W is because we are including the loss in this, um, algorithm.

75
00:05:05,830 --> 00:05:12,820
So when we calculate our loss, we look for this partial derivative with respect to the weights.

76
00:05:12,820 --> 00:05:18,280
And is this partial derivative that permits us to adjust the weights.

77
00:05:18,310 --> 00:05:25,210
Now this learning rate helps us adjust by how much we are going to, um, modify the weights.

78
00:05:25,210 --> 00:05:30,580
So if we had a large learning rate here, or let's say if we had a larger learning rate, not large

79
00:05:30,580 --> 00:05:32,320
because learning rates are relative.

80
00:05:32,320 --> 00:05:36,910
So if we had a larger learning rate of um instead of 0.1 if we had one.

81
00:05:36,910 --> 00:05:38,200
So let's take this off.

82
00:05:38,590 --> 00:05:43,840
If we had one, you would find that right from the first step we would have um, five.

83
00:05:43,840 --> 00:05:45,970
So we would, we would just jump from 0 to 5.

84
00:05:45,970 --> 00:05:50,860
So this controls how much we, we improve our model.

85
00:05:50,860 --> 00:05:54,460
Now, if you have no background in calculus, you shouldn't bother.

86
00:05:54,460 --> 00:06:00,790
As with TensorFlow, all we need to do to make use of this algorithm is by just specifying that our

87
00:06:00,790 --> 00:06:02,320
optimizer is SGD.

88
00:06:02,320 --> 00:06:05,440
So you just need to put in the string and you're good to go.

89
00:06:05,440 --> 00:06:08,170
You don't really need to write all this code.

90
00:06:08,170 --> 00:06:17,560
So getting back um here again we have let's let's plot the loss against the weights.

91
00:06:17,560 --> 00:06:19,690
So we have a plot of loss against weights.

92
00:06:19,690 --> 00:06:26,680
Now our loss here is actually something like y minus y prime.

93
00:06:26,680 --> 00:06:28,600
All of that square is true.

94
00:06:28,600 --> 00:06:30,280
We have a sum but we're going to ignore that.

95
00:06:30,280 --> 00:06:32,800
So it's going to look like it's going to look something like this.

96
00:06:33,160 --> 00:06:34,750
You see this quadratic function.

97
00:06:34,750 --> 00:06:40,150
And even if you have no background in calculus, you should just understand that when we talk about,

98
00:06:40,150 --> 00:06:47,500
uh, partial derivative or derivative in general, uh, and we have a curve that has to do with the

99
00:06:47,500 --> 00:06:48,850
slope of that curve.

100
00:06:48,850 --> 00:06:55,360
So if you, if you have at this point, if at this point you have w and you have the loss at this point

101
00:06:55,360 --> 00:06:58,960
the, the slope, you have a slope which is, which looks like this.

102
00:06:58,990 --> 00:07:03,550
If you draw a tangent and you calculate a slope, you try to calculate the slope.

103
00:07:03,550 --> 00:07:08,020
You find that here the slope is going to be higher than if we draw a tangent at this point.

104
00:07:08,020 --> 00:07:14,500
If you look at this point here and you draw this tangent well let's get back and try to draw a tangent.

105
00:07:14,680 --> 00:07:16,840
And we have a slope.

106
00:07:16,840 --> 00:07:19,750
The slope is smaller as compared to this.

107
00:07:19,750 --> 00:07:26,920
So if this is let's call this slope a and let's call this slope b.

108
00:07:27,100 --> 00:07:30,550
So slope a is greater than slope b.

109
00:07:30,670 --> 00:07:37,990
And so what this means is the partial derivative of the loss with respect to the weight at this point

110
00:07:37,990 --> 00:07:42,400
is greater than the partial derivative of the loss with respect to the weight at this other point.

111
00:07:42,400 --> 00:07:47,950
Now let's see how this, uh, plays when we when it comes to updating the weights, remember that the

112
00:07:47,950 --> 00:07:51,310
ideal situation is that our loss should be zero.

113
00:07:51,310 --> 00:07:53,500
So what we want to do here is minimize the loss.

114
00:07:53,500 --> 00:08:00,460
So, um, if we get to a point like this, then that's fine because we've gotten to the smallest possible

115
00:08:00,460 --> 00:08:01,870
value of our loss.

116
00:08:01,870 --> 00:08:09,520
And what this means is you may even have two and four and, uh, as weight as, um, M and C as the

117
00:08:09,520 --> 00:08:15,160
parameters, but the, the loss will be maybe somewhere about, uh, around here.

118
00:08:15,160 --> 00:08:18,310
And so there might still be room for improvement.

119
00:08:18,310 --> 00:08:24,130
And so, uh, what happens is we are going to take this weight.

120
00:08:24,130 --> 00:08:31,480
You see, we have the weight, subtract the take the subtract the learning rate times this slope actually.

121
00:08:31,480 --> 00:08:35,440
So when we say partial derivative of the loss with respect to the weight, what we actually are doing

122
00:08:35,440 --> 00:08:38,380
here is we're subtracting that slope.

123
00:08:39,170 --> 00:08:46,730
And so if we are having, um, this weight, we imagine we have this weight here and that this, this

124
00:08:46,730 --> 00:08:49,550
weight links leads us to this point here.

125
00:08:49,880 --> 00:08:50,570
There we go.

126
00:08:50,570 --> 00:08:56,210
We have this point here where our loss is, is quite high, which is not what we desire.

127
00:08:56,240 --> 00:09:00,170
Then you would find that the slope, the slope is much larger.

128
00:09:00,170 --> 00:09:06,260
Just like with this slope here, the slope is much larger as compared to if we have a weight value around

129
00:09:06,260 --> 00:09:13,040
here where we have a smaller slope, and because the slope is much larger, you would find that, um,

130
00:09:13,040 --> 00:09:19,490
multiplying this larger slope by the learning rate would lead to a larger change in the weight.

131
00:09:19,490 --> 00:09:26,990
So, um, this larger change in the weight will tend to make the weight, um, or this weight value

132
00:09:26,990 --> 00:09:29,000
move to the left.

133
00:09:29,680 --> 00:09:34,270
Because now we're taking all the weight minus a large value.

134
00:09:34,270 --> 00:09:37,270
And so we'll tend to move now to the left.

135
00:09:37,270 --> 00:09:41,350
Now if we're in this other direction, if we come in this other direction, that if we have a weight

136
00:09:41,350 --> 00:09:49,090
value this other direction, then, um, taking w minus this learning rate times the slope, because

137
00:09:49,090 --> 00:09:51,340
the slope is now negative, we have a negative slope.

138
00:09:51,340 --> 00:09:53,830
We'll have w minus a negative.

139
00:09:53,830 --> 00:09:57,880
That's w plus the learning rate times the absolute value of the slope.

140
00:09:57,880 --> 00:10:00,460
And so now we'll be moving towards the right.

141
00:10:00,460 --> 00:10:08,020
And so you see how graphically the stochastic gradient descent algorithm permits us to move the weight

142
00:10:08,020 --> 00:10:14,500
towards values which will permit us have, um, a smaller loss.

143
00:10:14,560 --> 00:10:21,310
Another interesting point to note is the fact that as we as our weight approaches this values where

144
00:10:21,310 --> 00:10:26,020
the corresponding loss is very small, the slope also becomes very small.

145
00:10:26,020 --> 00:10:29,170
And so the change in weight isn't that large.

146
00:10:29,170 --> 00:10:34,480
So so if we have if we get to a point like this here, if we get to this point, let's say we get to

147
00:10:34,480 --> 00:10:40,540
this point here, which is very close to the optimal weight value, where we're going to have the smallest

148
00:10:40,540 --> 00:10:45,910
possible loss, then this slope here is going to be very small.

149
00:10:45,910 --> 00:10:50,830
And so when you do w minus learning rate times that very small slope, you will find that the value

150
00:10:50,830 --> 00:10:54,100
of w will change just a little bit.

151
00:10:54,100 --> 00:11:01,720
And so what this tells us is when we are far away from that um optimal weight value, the slope is large.

152
00:11:01,720 --> 00:11:05,800
And then when we approach that, uh, optimal weight value, the slope is small.

153
00:11:05,800 --> 00:11:12,520
And so what happens generally during training is as we start training, the loss value, the loss the

154
00:11:12,520 --> 00:11:14,620
loss value drops much faster.

155
00:11:14,620 --> 00:11:21,880
And as we keep getting much better weight values, this loss value continues dropping but much more

156
00:11:21,880 --> 00:11:22,570
slowly.

157
00:11:22,570 --> 00:11:26,110
And this is what we call um, model convergence.

158
00:11:26,110 --> 00:11:32,620
Now the choice of this learning rate is very important because if our learning rate is too big, then,

159
00:11:32,740 --> 00:11:40,390
um, multiplying it by a large slope will take this weight to, uh, this other side, which is not

160
00:11:40,420 --> 00:11:41,110
what we want.

161
00:11:41,110 --> 00:11:47,710
And if it's too small, then we'll even if we have a large slope, the change in the weight is going

162
00:11:47,710 --> 00:11:49,060
to be also too small.

163
00:11:49,060 --> 00:11:51,460
And so the learning process is going to be too slow.

164
00:11:51,460 --> 00:11:57,880
So we want to choose, uh, an adequate value for the learning rate so that our training goes fast and

165
00:11:57,880 --> 00:11:58,240
smooth.

166
00:11:58,240 --> 00:12:04,930
Diving back to the code, you see all we need to do, um, is just specify SGD and we're going to have

167
00:12:04,930 --> 00:12:11,950
the the w minus learning rate times the partial derivative with respect to the learning with respect

168
00:12:11,950 --> 00:12:16,060
to the weights, um, which are going to be calculated um, under the hood.

169
00:12:16,060 --> 00:12:17,170
So that's it.

170
00:12:17,170 --> 00:12:22,060
Let's get back and print out the shape that's y and x and y.

171
00:12:22,060 --> 00:12:23,500
And then we could start with the training.

172
00:12:23,500 --> 00:12:25,660
So here we just need to do model dot fit.

173
00:12:25,660 --> 00:12:27,370
And then we specify x.

174
00:12:27,370 --> 00:12:30,430
We specify y we give number of epochs.

175
00:12:30,430 --> 00:12:35,620
Remember we said we are not going to start updating these parameters and then dive.

176
00:12:35,620 --> 00:12:39,790
Or after one step just get straight to the optimal values.

177
00:12:39,790 --> 00:12:44,140
Generally we go through several steps and each one of these step is called an epoch.

178
00:12:44,140 --> 00:12:47,560
So we'll specify a number of epochs to say um ten.

179
00:12:47,560 --> 00:12:51,310
And then we'll variables we say one.

180
00:12:51,310 --> 00:12:53,800
So let's run that and see what we get.

181
00:12:54,460 --> 00:12:58,630
Um has no this should be epochs on that.

182
00:12:59,810 --> 00:13:04,640
And as you could see, the model, um, has trained for ten epochs.

183
00:13:04,640 --> 00:13:08,270
Now you'll notice how this loss value keeps dropping.

184
00:13:08,300 --> 00:13:12,890
See, we go from 308 or 308,519.

185
00:13:12,890 --> 00:13:16,280
We drop um to 308,516.

186
00:13:16,280 --> 00:13:21,470
If you increase the number of epochs, you would find that these values are going to drop even much

187
00:13:21,470 --> 00:13:23,900
more, um, as compared to what we've seen already.

188
00:13:23,900 --> 00:13:25,280
So you see, keeps dropping.

189
00:13:25,280 --> 00:13:30,200
Now we get to 308, um, 485.

190
00:13:30,230 --> 00:13:35,840
Now, although we specify this optimizer to be SGD, that's stochastic gradient descent.

191
00:13:36,260 --> 00:13:38,630
Um, we could also have other optimizers.

192
00:13:38,630 --> 00:13:39,860
So you could check out here.

193
00:13:39,860 --> 00:13:48,470
But most of these are just flavors of the SGD, like the Adam and Adam W, which are the one of the

194
00:13:48,470 --> 00:13:51,050
most popular optimizers we have today.

195
00:13:51,050 --> 00:13:51,920
So click on this.

196
00:13:51,920 --> 00:13:52,550
Adam.

197
00:13:53,240 --> 00:14:02,420
You see we have Adam Optimizer, which as we said is a flavor of the SGD and it has now many more parameters.

198
00:14:02,420 --> 00:14:06,050
You could feel free to check out all this in the documentation.

199
00:14:06,050 --> 00:14:11,240
So let's just let's just copy this out or let's just get back to the code and import.

200
00:14:11,240 --> 00:14:12,500
Um, Adam.

201
00:14:12,500 --> 00:14:17,240
So we have get back to the top we have from TensorFlow.

202
00:14:18,050 --> 00:14:22,850
TensorFlow Keras, Keras optimizers.

203
00:14:22,850 --> 00:14:25,580
We're going to import um Adam.

204
00:14:25,790 --> 00:14:26,690
So that's it.

205
00:14:26,690 --> 00:14:27,560
You could import Adam.

206
00:14:27,560 --> 00:14:34,340
You can import Adam um w or any other optimizer available in TensorFlow.

207
00:14:34,430 --> 00:14:35,450
So that's it.

208
00:14:35,450 --> 00:14:40,070
Once you import that you could now replace this SGD we had right here.

209
00:14:40,070 --> 00:14:42,440
So instead of this now we just have Adam.

210
00:14:42,680 --> 00:14:45,110
And then you could specify it's learning rate.

211
00:14:45,110 --> 00:14:50,360
So we have learning rate by default it was given to be 0.001.

212
00:14:50,360 --> 00:14:51,560
Yeah that was it.

213
00:14:52,100 --> 00:14:53,900
Um take that off okay.

214
00:14:53,900 --> 00:14:55,190
So by default we have this.

215
00:14:55,190 --> 00:14:58,610
We also have the values for beta one, beta two and so on and so forth.

216
00:14:58,610 --> 00:15:00,950
But let's just focus on the learning rate.

217
00:15:00,950 --> 00:15:02,450
So by default we have this.

218
00:15:02,450 --> 00:15:08,450
We could change this to say um one e negative five or some other value.

219
00:15:08,450 --> 00:15:10,610
So let's just maintain the one e negative three.

220
00:15:10,610 --> 00:15:13,280
And then just as before we compile the model.

221
00:15:13,280 --> 00:15:15,530
And then we go ahead and train.

222
00:15:15,530 --> 00:15:20,270
Now while training we did not keep the history of all these losses.

223
00:15:20,270 --> 00:15:21,650
So let's say history.

224
00:15:22,070 --> 00:15:23,810
Um there we go.

225
00:15:23,990 --> 00:15:25,700
So we're still going to have the same output.

226
00:15:25,700 --> 00:15:33,050
But now after the training is done, you could do history dot history, and you would see that you will

227
00:15:33,050 --> 00:15:35,450
get all these different loss values.

228
00:15:35,780 --> 00:15:38,270
So you get all the different loss values.

229
00:15:38,300 --> 00:15:40,070
See we have this dictionary.

230
00:15:40,070 --> 00:15:44,150
We have loss um, and we have all these values right up to the end.

231
00:15:44,150 --> 00:15:49,010
So let's, let's now do history dot history and then we get the loss.

232
00:15:49,460 --> 00:15:54,740
See we have directly all the different loss values from the epoch zero right up to the last epoch.

233
00:15:54,740 --> 00:15:59,210
We can now plot out this loss values um using matplotlib.

234
00:15:59,210 --> 00:16:03,800
So let's simply do, let's first of all import matplotlib.

235
00:16:03,800 --> 00:16:13,700
We have from here um import matplotlib.pyplot as plt on that.

236
00:16:14,690 --> 00:16:16,730
Well think of this.

237
00:16:17,030 --> 00:16:18,650
Oh and that should be fine.

238
00:16:18,650 --> 00:16:20,360
So we run that again.

239
00:16:20,360 --> 00:16:21,080
That's fine.

240
00:16:21,080 --> 00:16:25,280
We get back and then we, um, try to plot all this.

241
00:16:25,280 --> 00:16:26,930
Uh, I'll try to plot out this loss.

242
00:16:26,930 --> 00:16:29,240
So we see that, um, graphically.

243
00:16:29,330 --> 00:16:30,140
There we go.

244
00:16:30,140 --> 00:16:36,230
We have our matplotlib plot, um, history, history loss.

245
00:16:36,230 --> 00:16:37,370
There we go.

246
00:16:37,370 --> 00:16:39,440
We also want to have a title.

247
00:16:39,980 --> 00:16:44,630
So here we have model model loss.

248
00:16:45,080 --> 00:16:47,840
Then we want to have a y label.

249
00:16:47,840 --> 00:16:53,240
Y label is simply the loss loss.

250
00:16:53,660 --> 00:16:56,720
And then we have the X label.

251
00:16:56,720 --> 00:17:00,650
The X label is the epoch um epoch.

252
00:17:00,650 --> 00:17:03,620
And then we have um legend.

253
00:17:03,950 --> 00:17:05,030
Legend.

254
00:17:05,270 --> 00:17:07,400
We have train.

255
00:17:07,400 --> 00:17:08,990
Well let's see train.

256
00:17:08,990 --> 00:17:11,750
Okay, let's put on the list and we have train.

257
00:17:12,140 --> 00:17:13,190
That's fine.

258
00:17:13,190 --> 00:17:16,010
And now we do plot show.

259
00:17:16,010 --> 00:17:16,910
There we go.

260
00:17:16,910 --> 00:17:20,210
You can see our loss and you see that, um, it's actually dropping.

261
00:17:20,210 --> 00:17:25,940
And if we compare this, uh, plot to what we had already in the board, we see that we're still very

262
00:17:25,940 --> 00:17:28,160
far from the model convergence.

263
00:17:28,160 --> 00:17:34,100
So we could still, uh, do better with this model, or we could still get, uh, much smaller loss

264
00:17:34,100 --> 00:17:34,760
values.