1
00:00:00,150 --> 00:00:05,910
Hello, everyone, and welcome to this new section in which we're going to look at different strategies

2
00:00:05,910 --> 00:00:08,730
of combating overfitting and under fitting.

3
00:00:08,730 --> 00:00:15,570
These strategies will include data augmentation, as you could see here, drop out regularization,

4
00:00:15,570 --> 00:00:21,840
early stopping, smaller network usage, hyper parameter tuning and normalization.

5
00:00:21,840 --> 00:00:28,920
So far in this course, we've mentioned the terms overfitting and under feeding without really getting

6
00:00:28,920 --> 00:00:32,040
in depth to what this tool actually mean.

7
00:00:32,190 --> 00:00:40,800
Now, for overfitting, we'll consider this plot of the last versus the number of epochs and then precision

8
00:00:40,800 --> 00:00:44,730
versus number of epochs with a loss versus number of epochs.

9
00:00:44,730 --> 00:00:48,900
What will generally have when a model overfitting something like this?

10
00:00:48,900 --> 00:00:57,090
So yeah, we have the training, the validation loss, and that year we have the training loss.

11
00:00:57,090 --> 00:01:05,400
So this is just like some general way of looking at this, though this may take different forms.

12
00:01:05,430 --> 00:01:14,730
Now, that said, as you could see, we have this two sets and your loss versus number of epochs plots.

13
00:01:14,730 --> 00:01:23,910
Clearly we could see that the validation and the training set initially start up with a similar pattern.

14
00:01:23,910 --> 00:01:27,540
Sometimes you may even have the validation which comes up like this.

15
00:01:27,540 --> 00:01:32,940
So you can even have a situation like this where the validation performs even better than the training

16
00:01:32,940 --> 00:01:33,990
set initially.

17
00:01:34,110 --> 00:01:42,600
But generally, when a model over fits, what we'll have will be this kind of model where at some point

18
00:01:42,600 --> 00:01:51,660
the model keeps doing well at the level of the training and starts doing very poorly in the validation

19
00:01:51,660 --> 00:01:52,290
set.

20
00:01:52,500 --> 00:01:56,670
We could replicate this with a precision epoch.

21
00:01:56,670 --> 00:02:02,070
This could be precision accuracy recall or some other metric which we've chosen.

22
00:02:02,070 --> 00:02:04,910
And what we could have would be something like this.

23
00:02:04,920 --> 00:02:11,250
Obviously, you saw in the practice it isn't always like this, but some in this sense.

24
00:02:11,250 --> 00:02:19,350
So you would have this year, you see, you have the training and then you have the validation right

25
00:02:19,350 --> 00:02:19,860
here.

26
00:02:19,860 --> 00:02:27,120
So what goes on is your model keeps on doing very well for the training and then with the validation,

27
00:02:27,120 --> 00:02:31,620
at some point it starts doing even, oh, very, very poorly.

28
00:02:32,370 --> 00:02:38,190
And so the danger with this is if you are considering or working only with a training set, you may

29
00:02:38,190 --> 00:02:42,890
feel like you you more epochs are training over more epochs.

30
00:02:42,890 --> 00:02:50,240
It's a good idea because you keep having this great results like supposing we have fixed this year and

31
00:02:50,340 --> 00:02:59,370
your training is like say 99%, so you have 99% precision score for on your training data and you're

32
00:02:59,370 --> 00:03:04,050
like, wow, this model will do very well in the real world.

33
00:03:04,050 --> 00:03:12,390
But this isn't generally the case because this model actually has been over fitted on your training

34
00:03:12,390 --> 00:03:12,900
data.

35
00:03:12,900 --> 00:03:21,960
Your model has learned instead to modify its weights based on the training data, instead of being able

36
00:03:21,960 --> 00:03:30,270
to extract useful information or some intelligence from the data which has been used in training this

37
00:03:30,270 --> 00:03:41,820
model, the main cause of overfitting is having a small data set and a large and complex model which

38
00:03:41,820 --> 00:03:44,460
contains so many parameters.

39
00:03:44,490 --> 00:03:52,920
Now, if a model like, say, deep neural network has many parameters and you're giving it a small data

40
00:03:52,920 --> 00:04:00,540
set, then obviously it's going to adjust its parameters such that on this small data set, it performs

41
00:04:00,540 --> 00:04:02,130
exceptionally well.

42
00:04:02,130 --> 00:04:12,930
So you may come across a model which has say 99.9% precision or accuracy, just because the data you've

43
00:04:12,930 --> 00:04:17,880
used to train the model has was very small.

44
00:04:18,180 --> 00:04:26,310
Now, in another case, you may have, say, a moderate sized data set and then not just this kind of

45
00:04:26,310 --> 00:04:29,010
model, but a very large model.

46
00:04:29,010 --> 00:04:36,120
So at the end of the day, we notice that there has to always be a balance between this data set and

47
00:04:36,120 --> 00:04:37,380
the model size.

48
00:04:37,380 --> 00:04:43,950
So this means that even if you increase the data size and then you also increase the model size, your

49
00:04:43,950 --> 00:04:49,440
model may still risk overfitting to better understand this concept of overfitting.

50
00:04:49,440 --> 00:04:51,570
Let's take this simple example.

51
00:04:51,720 --> 00:04:59,340
Supposing that you have three subjects to master our school, let's say math, English and sports.

52
00:04:59,340 --> 00:04:59,760
But.

53
00:04:59,850 --> 00:05:07,080
When kids get to school, they are only taught mathematics or they only taught English or only taught

54
00:05:07,090 --> 00:05:07,700
sports.

55
00:05:07,710 --> 00:05:09,600
Let's choose, for example, mathematics.

56
00:05:09,600 --> 00:05:17,070
So these kids get to school, and from the first year to the last year, they taught only mathematics.

57
00:05:17,100 --> 00:05:24,420
It's clear that when you evaluate this kid or when you pick a kid at random and evaluate that kid in

58
00:05:24,420 --> 00:05:33,210
mathematics, the kid would tend to have a better or an above average result in mathematics compared

59
00:05:33,210 --> 00:05:35,070
to kids from other schools.

60
00:05:35,850 --> 00:05:42,150
But when tested on subjects like English and sports, that may not be the case.

61
00:05:42,150 --> 00:05:48,000
And this is because the English and sports weren't taught at school.

62
00:05:48,240 --> 00:05:54,750
And so if you are evaluated on only what you were taught, you would tend to have very high scores like

63
00:05:54,750 --> 00:05:55,340
this.

64
00:05:55,350 --> 00:05:58,740
And then what if we start to evaluate on stuff?

65
00:05:58,740 --> 00:06:00,210
The kids were not taught?

66
00:06:00,240 --> 00:06:07,980
You'll notice that these kids will then start having poorer results because those kids haven't taken

67
00:06:07,980 --> 00:06:11,310
out some time to master English and sports.

68
00:06:11,400 --> 00:06:19,110
And so you may end up with a kid who is able to show, for example, that sine squared x plus Cos squared

69
00:06:19,110 --> 00:06:25,320
x equal one, but can't tell you the past tense of it.

70
00:06:25,440 --> 00:06:33,270
Now it's clear that to readjust this situation, the kids have to be taught all the subjects such that

71
00:06:33,270 --> 00:06:36,450
there is a balance that this kids need.

72
00:06:36,450 --> 00:06:44,430
And obviously this balance will come in a way that the kids now perform better in these two subjects.

73
00:06:44,430 --> 00:06:52,290
And because they have had part of the time they used to study maths allocated for English and sports,

74
00:06:52,290 --> 00:07:00,420
they may perform slightly less than before in maths, but at least what's important to note is this

75
00:07:00,420 --> 00:07:06,570
time around they can now express themselves better in English or practise some sports.

76
00:07:06,600 --> 00:07:08,580
Let's now take this other example.

77
00:07:08,580 --> 00:07:16,710
Supposing we train a model which predicts the presence of a car in an image and then you said you feed

78
00:07:16,710 --> 00:07:19,350
your model with this kind of data.

79
00:07:19,440 --> 00:07:28,140
Then after training your model where you get a test, that model is on this kind of real world data.

80
00:07:28,260 --> 00:07:36,270
It's clear that even if you had very great results with a previous dinner in this, working with this

81
00:07:36,270 --> 00:07:44,280
or testing on this new data set right here, you wouldn't have those amazing results you had previously.

82
00:07:44,280 --> 00:07:53,130
And so it's important that not only should your data be large enough or get as much useful data as you

83
00:07:53,130 --> 00:08:02,400
can, but ensure that your training data represents all looks like your test data or looks like what

84
00:08:02,400 --> 00:08:04,290
you're going to be having in production.

85
00:08:04,290 --> 00:08:06,200
And that's really what matters.

86
00:08:06,210 --> 00:08:14,460
What matters is the fact that you have a model in production which works well or which has a high accuracy

87
00:08:14,460 --> 00:08:23,370
and not a model in your notebook, which has, say, 100% accuracy on your training data, whereas on

88
00:08:23,370 --> 00:08:26,700
production it doesn't perform very well.

89
00:08:27,610 --> 00:08:29,770
Now moving on to under feeding.

90
00:08:29,800 --> 00:08:38,620
It turns out that your model becomes way too simple for it to even be able to extract information from

91
00:08:38,620 --> 00:08:39,360
our data.

92
00:08:39,370 --> 00:08:46,120
So we may have validation data and then training data like this.

93
00:08:46,120 --> 00:08:54,370
And then there is this huge gap between our current loss and the minimum possible loss.

94
00:08:54,550 --> 00:08:58,960
We could also have this at the level of say, let's say accuracy.

95
00:08:58,960 --> 00:09:08,200
So yeah, we could have accuracy and then we have our validation and then training and we have say,

96
00:09:08,200 --> 00:09:09,490
accuracy 100%.

97
00:09:09,490 --> 00:09:19,000
Let's say we have one right here and then our model is still too simple that we just end up say at 0.6

98
00:09:19,000 --> 00:09:29,680
or 60% accuracy in this kind of situation, our data are the relative size of our data as compared to

99
00:09:29,680 --> 00:09:31,990
the model maybe too large.

100
00:09:31,990 --> 00:09:38,290
So you could even have a situation where this data is smaller than what we had here.

101
00:09:38,290 --> 00:09:41,200
So you could be small like this.

102
00:09:41,200 --> 00:09:50,440
But if you model is way too simple, say we have just this very small model, then you may face this

103
00:09:50,440 --> 00:09:52,150
problem of order feeding.

104
00:09:52,720 --> 00:09:59,770
It also turns out that sometimes you may have a situation where you have even a very complex model,

105
00:09:59,770 --> 00:10:02,200
but that model still under fits.

106
00:10:02,200 --> 00:10:10,060
And that's because that model hasn't been built in a way that it could extract useful information from

107
00:10:10,060 --> 00:10:11,170
this data.

108
00:10:11,200 --> 00:10:17,410
Now, if you can recall in the section where we're predicting the car price, we had a situation where

109
00:10:17,500 --> 00:10:26,200
a model we use a simple dense layer that's we use a simple, single dense layer with just two parameters

110
00:10:26,560 --> 00:10:30,100
and we had this fixed data set.

111
00:10:30,610 --> 00:10:40,360
But once we increased our stacked up more dense layers, we found out that we're able to get better.

112
00:10:40,360 --> 00:10:44,930
Training and validation mean average error values.

113
00:10:44,950 --> 00:10:49,280
There are several ways in which we could mitigate this problem of overfitting.

114
00:10:49,300 --> 00:10:53,170
The very first one we'll look at is that of collecting more data.

115
00:10:53,170 --> 00:10:59,820
So it's important to lay hands on as much data as you can.

116
00:10:59,830 --> 00:11:05,680
This data has to be representative of what the model will see in real life, and this data should be

117
00:11:05,680 --> 00:11:07,480
as diverse as possible.

118
00:11:08,470 --> 00:11:15,570
Even after collecting more data to solve this problem of overfitting, we could use data augmentation.

119
00:11:15,580 --> 00:11:17,980
Now, what is data augmentation all about?

120
00:11:18,310 --> 00:11:26,710
Supposing we have this image cell right here, the cell image we have here, which happens to be parasitized.

121
00:11:27,130 --> 00:11:35,980
Now, instead of having just this in our dataset, we could have this image right here modified such

122
00:11:35,980 --> 00:11:38,860
that we now have more data to train on.

123
00:11:39,070 --> 00:11:48,310
So this means that in the case where initially we had, say, 20,000 images, so we have 20,000 images

124
00:11:48,310 --> 00:11:55,210
now after doing data augmentation, after modifying each and every image, we now have a dataset of

125
00:11:55,210 --> 00:11:56,630
80,000.

126
00:11:56,650 --> 00:12:02,830
Now we're considering 80,000 because we're supposing that each image is going to be flipped as we've

127
00:12:02,830 --> 00:12:03,970
just done right here.

128
00:12:03,970 --> 00:12:10,510
So we take this image, we rotate it, we have this other image, the same level obviously still parasitized,

129
00:12:10,510 --> 00:12:13,360
and then we flip again and get this image.

130
00:12:13,360 --> 00:12:19,660
We flip, get this image, flip, get this image we see that we now have instead of just this one,

131
00:12:19,660 --> 00:12:21,580
we have four others.

132
00:12:21,580 --> 00:12:25,810
This actually means we multiplying this by five.

133
00:12:25,810 --> 00:12:28,960
So this is 100,000 since we have it now.

134
00:12:29,860 --> 00:12:31,990
One, two, three, four, five.

135
00:12:31,990 --> 00:12:32,830
Examples.

136
00:12:32,830 --> 00:12:36,160
For this single example we had initially.

137
00:12:36,550 --> 00:12:43,840
It should also be noted that there are many other data augmentation strategies for this kind of image

138
00:12:43,840 --> 00:12:44,400
data.

139
00:12:44,410 --> 00:12:50,770
So apart from flipping as we've just done, we could crop just a portion, we could add some noise to

140
00:12:50,770 --> 00:12:56,890
this data, we could modify the contrast, we could modify the brightness and carry out so many other

141
00:12:56,890 --> 00:12:57,880
operations.

142
00:12:58,270 --> 00:13:04,630
And there is no particular data augmentation strategy which works for all problems.

143
00:13:04,640 --> 00:13:11,440
This means that when you have a particular problem, you will have to try different augmentation strategies

144
00:13:11,440 --> 00:13:15,400
and then be able to select the one which works for your data.

145
00:13:15,790 --> 00:13:24,490
Now, that said, we have drop out to better understand this notion of dropout will consider this simple

146
00:13:24,490 --> 00:13:26,200
neural network right here.

147
00:13:26,470 --> 00:13:27,130
Now.

148
00:13:27,210 --> 00:13:33,390
If you could recall, the reason why we have models which over overfeed is because we are working with

149
00:13:33,390 --> 00:13:36,930
very complex models with many parameters.

150
00:13:37,110 --> 00:13:44,190
Now, in order to reduce the complexity of this neural network, what we could do is take off, for

151
00:13:44,190 --> 00:13:50,720
example, this interaction between this neuron and all those previous neurons right here.

152
00:13:50,730 --> 00:13:58,200
So this means that when training our model, we are only going to consider that we have in this hidden

153
00:13:58,200 --> 00:14:02,660
layer just this two neurons right here.

154
00:14:02,670 --> 00:14:05,610
So all those connections, you have become useless.

155
00:14:05,640 --> 00:14:11,940
Now, this has the effect of simplifying our network as what we have as output.

156
00:14:11,970 --> 00:14:17,310
Now that's after carrying out the drop out operation looks like this.

157
00:14:17,310 --> 00:14:21,270
So we have now connections which look like this.

158
00:14:21,780 --> 00:14:26,190
Now, this particular case is an example of a drop out.

159
00:14:26,190 --> 00:14:30,480
We drop our ratio equal 0.3.

160
00:14:30,900 --> 00:14:40,830
All this is 0.333 as we're dropping out exactly one third of all the connections or rather one set of

161
00:14:40,830 --> 00:14:42,750
all our neurons right here.

162
00:14:43,440 --> 00:14:47,550
And if our equal to third, what we'll be left with will be this.

163
00:14:47,550 --> 00:14:54,030
So we'll take this off and we'll be left just with this neural network to train right here, we see

164
00:14:54,030 --> 00:15:04,170
that we could leave from this very complex model to a simplified model via this drop out operation.

165
00:15:04,170 --> 00:15:09,000
And this has an overall effect of mitigating overfitting.

166
00:15:10,110 --> 00:15:16,260
The next step we could take is that of regularization to better understand regularization.

167
00:15:16,260 --> 00:15:28,380
Suppose we have this model with weights w j so we have c pn weights and this weights are free to take

168
00:15:28,380 --> 00:15:31,890
up any value as we've seen previously.

169
00:15:31,890 --> 00:15:41,430
The fact that this weights can take up just any value may lead to overfitting as now these weights can

170
00:15:41,430 --> 00:15:47,350
be adjusted to fit on the training data in a very perfect manner.

171
00:15:47,370 --> 00:15:52,860
So this means that we could have a model which picks out each and every point like this.

172
00:15:52,860 --> 00:15:57,810
So we have this kind of model and we have this.

173
00:15:57,810 --> 00:16:00,900
So we have this model which picks up every point like that.

174
00:16:01,320 --> 00:16:02,460
Whereas.

175
00:16:04,520 --> 00:16:05,960
If we raise strain.

176
00:16:05,960 --> 00:16:13,460
This waits to stay in a given range, then we may end up with something like this.

177
00:16:13,460 --> 00:16:20,420
So we may end up with a model which looks simplified like this because it doesn't have as much freedom

178
00:16:20,420 --> 00:16:22,430
as this other model.

179
00:16:22,790 --> 00:16:29,660
Now, the problem with this model is if now you put in, you put in this new data, you will find that

180
00:16:29,660 --> 00:16:36,020
this one will try to pull out like this and then your prediction will be somewhere around here.

181
00:16:37,280 --> 00:16:43,880
And so if, for example, we're having horsepower, the x axis, and then you want to predict the price

182
00:16:43,880 --> 00:16:44,750
of a car.

183
00:16:45,890 --> 00:16:50,630
Let's do same year we have the price and then the horsepower.

184
00:16:51,900 --> 00:17:00,570
Then it happens that we have this car with this very high horsepower, this model, because it is over

185
00:17:00,570 --> 00:17:04,290
fitted on this training data will predict this very low price.

186
00:17:04,290 --> 00:17:13,800
Whereas this model, which has generalized on this training data, will tend to predict a more reasonable

187
00:17:13,800 --> 00:17:14,640
price.

188
00:17:15,540 --> 00:17:21,660
Now, it should be noted here that because this model was able to go to each and every point would have

189
00:17:21,660 --> 00:17:24,550
a training loss of almost zero.

190
00:17:24,570 --> 00:17:29,490
Whereas now when we give it this new data, it doesn't perform well.

191
00:17:29,490 --> 00:17:33,010
Whereas with this we wouldn't have a training loss of zero.

192
00:17:33,030 --> 00:17:39,270
But when given new dealer, at least we're going to have reasonable predictions.

193
00:17:39,570 --> 00:17:48,340
So coming back to regularization, our aim is to ensure that this weights because this model can be

194
00:17:48,340 --> 00:17:53,670
represented as this function and this function is made up of this weights.

195
00:17:54,240 --> 00:18:01,130
And so since when we are doing training, our aim is to ensure that we minimize the loss.

196
00:18:01,140 --> 00:18:06,630
Then we could include this weights in the computation of the loss.

197
00:18:06,720 --> 00:18:13,260
This means that we have our loss, which is now equal to the loss we would have normally, plus the

198
00:18:13,260 --> 00:18:19,530
regularization constant times the sum of this weights of each and every weight squared.

199
00:18:19,560 --> 00:18:26,460
Now this is known as LX to regularization, whereas here we have L1 regularization where we're summing

200
00:18:26,460 --> 00:18:29,040
up the absolute value of each and every week.

201
00:18:29,070 --> 00:18:36,210
For now, we're just going to explain how regularization helps in mitigating that problem of overfitting

202
00:18:36,210 --> 00:18:39,240
by restraining those weights in a given range.

203
00:18:39,630 --> 00:18:43,650
So let's have your loss equal.

204
00:18:44,600 --> 00:18:46,190
L does initial loss.

205
00:18:46,190 --> 00:18:48,680
Plus let's call this R.

206
00:18:49,460 --> 00:18:59,090
Now, if our aim is to minimize this loss, then obviously this ll will be minimized and the R will

207
00:18:59,090 --> 00:19:00,150
be minimized.

208
00:19:00,170 --> 00:19:08,390
And so if we're trying to minimize this year or this some in general, then it would have that overall

209
00:19:08,390 --> 00:19:17,810
effect of restraining this weights in a given range, especially as we know that when we square very

210
00:19:17,810 --> 00:19:21,410
large values, these values become even larger.

211
00:19:21,410 --> 00:19:23,120
And so to avoid this.

212
00:19:24,320 --> 00:19:30,290
Our weights will tend to take up smaller values which fall on the smaller range.

213
00:19:30,830 --> 00:19:35,180
This Alta regularization is also known as weight decay.

214
00:19:35,750 --> 00:19:41,360
It should be noted that the main difference between this LX to regularization and L1 regularization

215
00:19:41,360 --> 00:19:49,700
is that in trying to restrain the range of values which these weights can take up, the L1 regularization

216
00:19:49,700 --> 00:19:58,380
has that negative effect of making many of these weights to take up values around zero.

217
00:19:58,400 --> 00:20:03,890
Thus values very, very small or take up many zero values.

218
00:20:03,890 --> 00:20:09,710
So this will lead to sparse models as compared to the LX to regularization.

219
00:20:09,710 --> 00:20:13,970
And that's why in practice we generally use the LX to regularization.

220
00:20:14,930 --> 00:20:21,100
That said, we move on to LX stopping, which we've seen already in early stopping as we have seen.

221
00:20:21,110 --> 00:20:28,790
If we have a model, let's say we have precision validation position and then the training position

222
00:20:28,790 --> 00:20:34,550
which keeps on increasing and then we have this limit of one or 100%.

223
00:20:34,550 --> 00:20:37,050
So we have the precision and we have our epochs.

224
00:20:37,070 --> 00:20:45,080
And then after a while this starts dropping and this is dropping simply because our model is now trying

225
00:20:45,080 --> 00:20:49,520
to over fit on this data it has been trained on.

226
00:20:49,520 --> 00:20:58,900
And so we have to stop training once we notice that the validation performance isn't improving any longer.

227
00:20:58,910 --> 00:21:03,170
So this means that after a certain number of epochs we are going to stop the training.

228
00:21:03,170 --> 00:21:04,100
And what's in this?

229
00:21:04,100 --> 00:21:10,370
Already in the previous section we've seen it both theoretical like this and then we had seen it practically.

230
00:21:11,060 --> 00:21:18,830
Then another thing to do is to reduce the size of the network or use a less complex network.

231
00:21:19,010 --> 00:21:23,030
Our next step will be to properly tune our hyper parameters.

232
00:21:23,030 --> 00:21:30,470
Hyper parameters like by size, drop our rate, regularization rate, and the learning rate can affect

233
00:21:30,470 --> 00:21:35,840
our model and dictate whether this model will fit or not.

234
00:21:36,020 --> 00:21:43,160
Now, if we look at the batch size training with a larger batch, size may speed up our training process,

235
00:21:43,160 --> 00:21:51,770
but working with smaller batch sizes have a regularization effect, which help reduce overfitting.

236
00:21:52,280 --> 00:22:00,320
And so according to your local friends, don't let friends use mini batches larger than 32 for the drop

237
00:22:00,320 --> 00:22:00,760
out rate.

238
00:22:00,770 --> 00:22:01,940
We've seen this already.

239
00:22:01,940 --> 00:22:10,040
Increasing the dropout rate means we are making the model simpler and the regularization rate.

240
00:22:10,040 --> 00:22:15,500
Increasing the regulation rate means we're reducing the effect of overfitting.

241
00:22:15,500 --> 00:22:21,620
And then finally, picking too small of a learning rate may lead to overfitting.

242
00:22:21,860 --> 00:22:28,490
So in general, we have some hyper parameters to tune and they are not only limited to this, as you

243
00:22:28,490 --> 00:22:32,780
may have many other hyper parameters depending on your problem.

244
00:22:33,380 --> 00:22:41,990
Now the fact that normalization introduces extra parameters Mu sigma, which bring in some noise in

245
00:22:41,990 --> 00:22:47,830
the model, has that regularization effect which help reduce overfitting.

246
00:22:47,840 --> 00:22:57,620
So if you include in batch norm in your model, then you could feel free to reduce the dropout rate

247
00:22:58,400 --> 00:23:03,440
since this normalization layer already brings in that regularization effect.

248
00:23:03,680 --> 00:23:08,540
From here we look at ways of mitigating the problem of over under feeding.

249
00:23:08,540 --> 00:23:14,420
So with under feeding, we could use more complex models.

250
00:23:15,020 --> 00:23:17,300
We could also collect more data.

251
00:23:17,300 --> 00:23:22,970
We see that this solution falls on the two that's both overfitting and under.

252
00:23:22,970 --> 00:23:32,030
Feeding your is always a good thing to collect more data or more clean and representative data.

253
00:23:32,510 --> 00:23:36,090
So from here we have you could also improve the training time.

254
00:23:36,110 --> 00:23:43,910
Now note that you could have this model like, let's take this and then we have your you've trained

255
00:23:43,910 --> 00:23:46,400
and then you have the validation.

256
00:23:46,610 --> 00:23:53,600
So here you have the vowel and then you have the train and then you've been training for over a thousand

257
00:23:53,600 --> 00:23:59,420
epochs and then you feel like the model may not perform any better.

258
00:23:59,630 --> 00:24:03,860
Now, several scientists have reported that many times.

259
00:24:03,860 --> 00:24:10,340
They've given up on a model and then come back later after forgetting to stop the training process and

260
00:24:10,340 --> 00:24:15,050
notice that this model kept on performing much better.

261
00:24:15,050 --> 00:24:18,530
So sometimes you don't have to give up on your model.

262
00:24:18,530 --> 00:24:22,430
So you could train or you could increase this training time.

263
00:24:22,430 --> 00:24:23,420
Then again, we have.

264
00:24:23,560 --> 00:24:28,840
Have hyper parameter tuning which could help in making your model more performant.

265
00:24:28,840 --> 00:24:35,770
And we have normalization which stabilizes the training process and leads to better performance in the

266
00:24:35,770 --> 00:24:36,430
model.

267
00:24:36,700 --> 00:24:41,360
We now see practically how the dropout could be implemented with TensorFlow.

268
00:24:41,380 --> 00:24:48,850
So yeah, we have this drop out layer which takes as an argument the rate, the dropout rate, noise

269
00:24:48,850 --> 00:24:58,150
shape and the setting to better understand how and why we need to use seeding is simply in the case

270
00:24:58,150 --> 00:25:01,440
where one reproducible experiment.

271
00:25:01,450 --> 00:25:09,340
So if we want to apply dropout in this layer with a dropout rate of 0.2, then we'll be taken off one

272
00:25:09,340 --> 00:25:14,440
neuron out of this five neurons to make our model simpler and avoid overfitting.

273
00:25:14,470 --> 00:25:21,040
Now, in doing so, we may take this one or this or this one or this or this other one.

274
00:25:21,040 --> 00:25:22,670
So it's a random choice.

275
00:25:22,690 --> 00:25:29,110
Now, if we want to fix this choice so that this experiment can be reproducible, then we can set this

276
00:25:29,110 --> 00:25:34,300
seed so that each time we run the experiment, it's going to be exactly the same neuron which is going

277
00:25:34,300 --> 00:25:35,860
to be taken off.

278
00:25:36,070 --> 00:25:42,580
Getting back to the code, the way we could use this is by imparting the dropout layer.

279
00:25:42,580 --> 00:25:46,510
So here we just have this dropout, we run that and it's fine.

280
00:25:46,510 --> 00:25:49,690
And then we could include this dropout right here.

281
00:25:49,690 --> 00:25:59,630
So we could have here dropout let's let's say dropout rate dropout rates equal 0.2.

282
00:25:59,650 --> 00:26:01,480
Now, you could always increase this rate.

283
00:26:01,480 --> 00:26:02,800
So let's have that.

284
00:26:02,800 --> 00:26:10,670
And then we have the dropout and then drop out or rather rate equal dropout rate.

285
00:26:10,690 --> 00:26:11,500
That's fine.

286
00:26:11,500 --> 00:26:13,270
We could place this out here.

287
00:26:13,270 --> 00:26:20,410
But wouldn't all the one drop out here anyway, you could always include the dropout at this level,

288
00:26:20,410 --> 00:26:26,110
depending on how the model responds to this dropout which has been added here.

289
00:26:26,110 --> 00:26:32,350
So let's add this dropout and then add the dropout right here, bearing in mind that we could always

290
00:26:32,350 --> 00:26:37,990
add more dropout layers and increase the dropout rate or even reduce this rate.

291
00:26:37,990 --> 00:26:38,890
So that's fine.

292
00:26:38,890 --> 00:26:40,780
We run our model.

293
00:26:40,780 --> 00:26:46,900
And as you could see right here, this dropout has no parameters from this.

294
00:26:46,900 --> 00:26:48,460
We look at regular users.

295
00:26:48,460 --> 00:26:52,930
We'll see how to implement the LX to regularize our and the LL one regularize.

296
00:26:53,350 --> 00:26:58,750
So, yeah, we have to cross and then we have this regularization right here.

297
00:26:58,960 --> 00:27:02,220
So if you select LL two, you should follow on this page.

298
00:27:02,230 --> 00:27:09,580
Now, once you get this, you see you have this tab that carries a regularised L two and then you specify

299
00:27:09,580 --> 00:27:11,530
the regularization rate.

300
00:27:11,560 --> 00:27:18,350
So let's copy this out and then get back to our model, the level of our model.

301
00:27:18,370 --> 00:27:22,440
Let's come back to the definition of the curve to DX.

302
00:27:22,450 --> 00:27:30,220
So here we have this kernel regular riser and this kernel regular riser that we use in carrying out

303
00:27:30,220 --> 00:27:31,390
regularization.

304
00:27:31,390 --> 00:27:38,590
So here we have this curve to DX and then we simply specify the kernel regular riser.

305
00:27:38,710 --> 00:27:46,510
We have kernel regular riser which is equal this regularized right here.

306
00:27:46,690 --> 00:27:49,000
Now let's take this off and there we go.

307
00:27:49,000 --> 00:27:52,720
We have this canon regular riser, which is now our L two regular riser.

308
00:27:52,720 --> 00:27:54,790
You could always take this off.

309
00:27:54,790 --> 00:28:08,890
And then from here and you have from tensor flow that carries the regular risers impart L two So that's

310
00:28:08,890 --> 00:28:08,950
it.

311
00:28:08,950 --> 00:28:12,220
You could also import L one so run that.

312
00:28:13,150 --> 00:28:13,630
That's correct.

313
00:28:13,630 --> 00:28:20,620
That that's regular riser, Regular riser OC So that's fine.

314
00:28:21,400 --> 00:28:26,050
We get back to our model right here and then we have just l two.

315
00:28:26,260 --> 00:28:32,620
Now you could have this taken up from here and then you add it out here.

316
00:28:32,620 --> 00:28:37,090
So pattern valid activation ratio and that.

317
00:28:37,090 --> 00:28:42,390
Okay, so we have this and then we add our regular riser and that's fine.

318
00:28:42,400 --> 00:28:45,010
Now you could also do this for the dense layers.

319
00:28:45,010 --> 00:28:49,930
So just right here you could have col regular riser and that should be fine.

320
00:28:49,930 --> 00:28:54,350
So this is how we implement the weight decay with TensorFlow.

321
00:28:54,370 --> 00:28:56,500
Now you could always modify those parameters.

322
00:28:56,500 --> 00:29:09,310
So let's, let's have the k r regularization authorization rate 0.01.

323
00:29:09,310 --> 00:29:10,150
So that's it.

324
00:29:10,150 --> 00:29:11,950
And then we run our model.

325
00:29:12,280 --> 00:29:13,240
That should be fine.

326
00:29:13,240 --> 00:29:17,050
And we could go ahead and retrain this model.

327
00:29:17,290 --> 00:29:19,030
That's it for this section.

328
00:29:19,030 --> 00:29:22,090
Thank you for getting up to this point and see you next.