1
00:00:00,360 --> 00:00:04,890
Now, let's take a look at another regularization technique called dropout.

2
00:00:05,670 --> 00:00:07,110
So what is dropout?

3
00:00:07,140 --> 00:00:13,980
Well, it's a different regularization technique and unlike L1 L2 normalization, dropout doesn't modify

4
00:00:13,980 --> 00:00:14,790
the loss function.

5
00:00:15,150 --> 00:00:17,880
Instead, it modifies the network itself.

6
00:00:18,330 --> 00:00:25,140
It works by randomly dropping out sets of lawyers by setting to zero number of nodes in the feature

7
00:00:25,140 --> 00:00:26,670
maps of that layer during training.

8
00:00:27,210 --> 00:00:30,590
So let's take a look at how it is implemented, how this algorithm is implemented.

9
00:00:30,660 --> 00:00:33,930
So the dropout process here is illustration of what's happening here.

10
00:00:34,410 --> 00:00:36,210
This is a neural network here.

11
00:00:36,210 --> 00:00:37,170
We have all in the woods.

12
00:00:37,440 --> 00:00:39,020
This is just a fully connected nodes here.

13
00:00:39,030 --> 00:00:39,810
This example.

14
00:00:40,350 --> 00:00:46,470
And what Drapeau does during training time, we start by randomly choosing half over nodes or whatever

15
00:00:46,470 --> 00:00:47,260
value we set.

16
00:00:47,310 --> 00:00:53,670
This is a parameter that half is what we talk about for now in our network to delete or turn off temporarily.

17
00:00:54,090 --> 00:01:00,270
So what that means is that when we found propagate our mini-Budget through the model and back propagate

18
00:01:00,270 --> 00:01:08,070
together, gradients were only doing it for a modified half of the network right now after for the next

19
00:01:08,070 --> 00:01:08,610
menu batch.

20
00:01:08,910 --> 00:01:14,340
We then restore order nodes and randomly delete another set of nodes for the next mini batch.

21
00:01:14,760 --> 00:01:20,460
So as you can see in every training cycle, a mini batch, different nodes are being trained and different

22
00:01:20,460 --> 00:01:21,660
notes are being turned off.

23
00:01:22,380 --> 00:01:24,180
Well, so what effect does this have?

24
00:01:24,660 --> 00:01:30,630
Well, firstly, the fraction of the amount of nodes a drop of drop out depends on the drop out rate,

25
00:01:30,630 --> 00:01:33,120
which is basically a parameter we set the dropout to.

26
00:01:33,150 --> 00:01:39,000
Usually, it's zero point three, which means that every training mini batch we're dropping 0.3 or three

27
00:01:39,000 --> 00:01:40,590
percent of our nodes here.

28
00:01:41,160 --> 00:01:43,510
But how does dropout achieve this?

29
00:01:43,530 --> 00:01:47,100
Well, achieve this generalization and reduction of overfitting?

30
00:01:47,610 --> 00:01:53,280
Well, it forces the network to lean more robust and reliable features because it acts like we trained

31
00:01:53,280 --> 00:01:56,040
several different networks during the training process.

32
00:01:56,430 --> 00:02:02,580
So intuitively, it's a very, very good way of getting a model to generalize quite well and without

33
00:02:02,580 --> 00:02:07,950
putting too much emphasis on certain features, which would cause a model to have a fit.

34
00:02:08,580 --> 00:02:14,190
But it effectively double doubled the number of iterations required to make or model converge.

35
00:02:14,670 --> 00:02:20,010
When we say converge, we mean we basically mean our model to find a global minimum or at least get

36
00:02:20,010 --> 00:02:20,850
very close to it.

37
00:02:21,390 --> 00:02:27,350
So the longer it takes, the more training time we need, a more GPU power and an actual time we need.

38
00:02:27,750 --> 00:02:34,410
So it's not always a good thing, but generally I would always use drop out when treating a model in

39
00:02:34,410 --> 00:02:38,940
testing to just remember this, we use all activation because all the nodes.

40
00:02:39,330 --> 00:02:44,250
But reduce them by a factor of PE to account for the missing activations during training.

41
00:02:44,430 --> 00:02:45,540
That's important to note.

42
00:02:45,540 --> 00:02:52,870
It's not really that important to be aware of because behind the scenes to pay too much and the celebrities

43
00:02:53,250 --> 00:02:55,590
are taking care of all of those things for us.

44
00:02:56,130 --> 00:02:59,140
But nevertheless, this is an important thing to know.

45
00:02:59,140 --> 00:03:01,830
And just in case your mind is curious about it, we care.

46
00:03:04,430 --> 00:03:11,180
Next, we'll take a look at a very, very useful regularization technique, which is called data augmentation.

47
00:03:11,660 --> 00:03:13,520
So I'll see you in the next lesson.

48
00:03:13,680 --> 00:03:13,910
Thank.