1
00:00:02,980 --> 00:00:06,610
So now let's take a look at a final regularization technique.

2
00:00:07,030 --> 00:00:12,130
And this one's called batch normalization, which is a very, very good technique.

3
00:00:12,160 --> 00:00:17,980
However, it's a bit tricky to understand, but it's also very simple to implement suits, give and

4
00:00:17,980 --> 00:00:18,310
take.

5
00:00:19,000 --> 00:00:24,370
So firstly, before I explain batch normalization, let's talk about some deep learning problems.

6
00:00:24,640 --> 00:00:30,670
Firstly, training deep learning networks is slow, requires a lot of tweaking parameter adjustments,

7
00:00:30,670 --> 00:00:36,610
little adjustments, a lot of experimentation in general, and obviously lots of data.

8
00:00:37,450 --> 00:00:44,070
So remember, models are updated layer by layer backwards to what the input right to left using back

9
00:00:44,080 --> 00:00:44,740
propagation.

10
00:00:45,490 --> 00:00:50,080
This assumes that a width adjusted into probably a fixed.

11
00:00:50,950 --> 00:00:55,270
So here's a code it took from the Deep Learning Handbook page 317.

12
00:00:55,840 --> 00:00:56,860
I'll read it out for you.

13
00:00:57,400 --> 00:01:03,520
Remember this so very deep models involve the composition of several functions or layers.

14
00:01:04,090 --> 00:01:09,800
The gradient tells how to update each parameter under the assumption that utterly is to not change.

15
00:01:09,820 --> 00:01:12,250
That's what I'm seeing here in practice.

16
00:01:12,250 --> 00:01:15,070
Do we update all the layers simultaneously?

17
00:01:15,580 --> 00:01:17,200
So that's a fallacy in a way.

18
00:01:18,070 --> 00:01:21,220
So take a look at remember back propagation works.

19
00:01:21,790 --> 00:01:25,270
This is just a quick recap to show you that it's moving right to left.

20
00:01:25,630 --> 00:01:31,330
So it's challenging doing treating it as inputs changes the parameters of the previously themselves

21
00:01:31,330 --> 00:01:31,870
change.

22
00:01:32,500 --> 00:01:39,060
This results in slower training as it requires smaller learning rates and careful parameter in initialization.

23
00:01:41,210 --> 00:01:43,420
And we have to go through all of this again, but that's OK.

24
00:01:44,020 --> 00:01:47,710
Just a nice little graphical illustration of back propagation.

25
00:01:48,970 --> 00:01:50,650
So what does batch normalization?

26
00:01:50,740 --> 00:01:52,150
Well, that's known for sure.

27
00:01:52,150 --> 00:01:58,960
It is a technique that helps coordinate the object of multiple layers in a model, so it standardizes

28
00:01:58,960 --> 00:01:59,960
by reskilling it.

29
00:01:59,970 --> 00:02:02,530
So we have a mean of zero and a standard deviation of one.

30
00:02:03,010 --> 00:02:10,270
The activations of two prior LEO, thus scaling the output of the Leo IT repositories as the model to

31
00:02:10,270 --> 00:02:13,180
make some units always be standardized by definition.

32
00:02:13,720 --> 00:02:16,480
This allows it to reduce internal covariance shift.

33
00:02:16,930 --> 00:02:20,770
Now this is getting a bit complicated, so I don't expect you guys to follow all of this.

34
00:02:21,100 --> 00:02:26,350
However, if you want to get a deeper understanding or have a deeper understanding of batch normalization,

35
00:02:26,800 --> 00:02:30,010
I'd encourage you to read this paper here called batch normalization.

36
00:02:30,700 --> 00:02:31,880
You can find it at this link.

37
00:02:31,900 --> 00:02:37,930
This URL here and this is how the what is defined, what internal covariance shift is.

38
00:02:38,350 --> 00:02:44,500
We define internal coverage shift as a change in the distribution of network activations due to the

39
00:02:44,500 --> 00:02:47,740
change in that would parameters during training, but confusing.

40
00:02:47,750 --> 00:02:53,020
I admit, however, just remember that's known as a technique that helps coordinate the update of multiple

41
00:02:53,020 --> 00:02:53,950
layers in a model.

42
00:02:54,490 --> 00:03:02,260
It's by doing that, and it does it by actually finding summarizing the outputs of the activations of

43
00:03:02,260 --> 00:03:03,100
the previous layer.

44
00:03:03,400 --> 00:03:04,630
That's basically what it does.

45
00:03:05,080 --> 00:03:07,870
So let's take a look at how we implement batch normalization.

46
00:03:08,680 --> 00:03:13,300
So recall that output of a convolutional layer is four dimensional.

47
00:03:13,660 --> 00:03:19,180
It has but size, feature, map height, feature map width and the number of channels, meaning the

48
00:03:19,180 --> 00:03:26,590
depth so it calculates and botched on the mean and the standard deviation of each input variable associated

49
00:03:26,590 --> 00:03:30,700
with that layer per mini batch and uses this to perform the centralization.

50
00:03:31,240 --> 00:03:33,700
In CNN's Remember filter, we had shear.

51
00:03:34,150 --> 00:03:37,240
That's the same field as applied to all parts of the image, which is a good thing.

52
00:03:37,900 --> 00:03:43,600
Therefore, this means that the mean and standard deviation are taken for each filter per minute batch,

53
00:03:43,600 --> 00:03:48,220
giving us either tree means and standard deviations, or one if it's a grayscale image.

54
00:03:49,630 --> 00:03:52,630
So here's some advice in how to use batch on properly.

55
00:03:53,020 --> 00:03:54,910
Maximum was actually is super easy to implement.

56
00:03:55,510 --> 00:04:00,550
It's as easy as just adding a bottom layer in cameras and similar impact by to watch as well.

57
00:04:01,690 --> 00:04:06,910
So when we using it would drop out, this is a recommendation or the that we do these things, we have

58
00:04:06,910 --> 00:04:09,380
a layer matching real robots.

59
00:04:09,820 --> 00:04:12,970
Then another country, if you want to continue having more layers.

60
00:04:13,900 --> 00:04:19,270
So generally it's used between the conflict and the activation function, as you can see here.

61
00:04:20,200 --> 00:04:21,670
So it will stop there for now.

62
00:04:22,180 --> 00:04:27,580
And next, we'll take a look at how and when we use a regularization.

63
00:04:27,700 --> 00:04:30,670
So, I mean, I'll tell you generally, you should always use it.

64
00:04:31,000 --> 00:04:35,020
But there are some specific cases where you may want to use it or may not want to use it.

65
00:04:35,470 --> 00:04:37,520
So we'll take a look at that in the next section.

66
00:04:37,600 --> 00:04:38,080
Thank you.