1
00:00:04,020 --> 00:00:10,890
Hi and welcome to our summary chapter of CNN's This one should help you put it all together so that

2
00:00:10,890 --> 00:00:16,740
you can come away from these lessons having a very good understanding of what convolutional neural networks

3
00:00:16,740 --> 00:00:21,210
do, how they're trained and how we get how they work in classifying images.

4
00:00:22,440 --> 00:00:23,490
So let's get started.

5
00:00:24,390 --> 00:00:27,030
So firstly, let's do a recap.

6
00:00:27,360 --> 00:00:29,010
Remember what convolutional layers were?

7
00:00:29,460 --> 00:00:34,800
They were the filters or kernels that we basically applied to the image the sliding window here to get

8
00:00:34,800 --> 00:00:36,030
the feature maps output.

9
00:00:36,480 --> 00:00:37,200
So it accomplished.

10
00:00:37,200 --> 00:00:42,600
No operations occur when we can evolve or filters with the input image, slathering it over the image

11
00:00:42,600 --> 00:00:45,510
which you've seen destroyed, and padding, which we'll go over shortly.

12
00:00:46,260 --> 00:00:48,720
And it produces two feature maps of what's here.

13
00:00:48,930 --> 00:00:49,740
That's what we get.

14
00:00:50,160 --> 00:00:55,080
And remember, this is a there's a formula which you will see in the next slide that tells you how we

15
00:00:55,080 --> 00:00:59,160
get the size, how we determine the size of the feature map that we get.

16
00:00:59,970 --> 00:01:05,250
So remember, the size depends a lot on stride and padding as well.

17
00:01:05,790 --> 00:01:07,530
So we have this whole formula here.

18
00:01:07,890 --> 00:01:13,500
So you remember that we just patted zeros around the image initially that allows us to maintain the

19
00:01:13,500 --> 00:01:17,970
size so we don't actually lose image size if we have padding equal to one.

20
00:01:18,630 --> 00:01:24,990
And by having a stride, remember when we increase the stride, we decrease the feature map size.

21
00:01:25,440 --> 00:01:30,690
So by keeping us straight at one that generally will bend T and the size of the image, assuming that

22
00:01:30,690 --> 00:01:32,310
the padding is one as well.

23
00:01:33,840 --> 00:01:39,390
Next, we looked at really, really was the activation function that basically allows the neural network,

24
00:01:39,390 --> 00:01:46,410
or CNN, to introduce nonlinearity to its learning capacity and the real activation function.

25
00:01:46,410 --> 00:01:53,100
Basically to tell Tullis that when we when we apply it to a feature map, all the values are less than

26
00:01:53,370 --> 00:01:53,850
zero.

27
00:01:54,030 --> 00:02:01,140
The negative values are clumped to zero and all the values that are positive over zero are left alone.

28
00:02:01,410 --> 00:02:04,080
So that's how we end up with this rectified feature map here.

29
00:02:05,880 --> 00:02:07,680
Next, we looked at max pooling.

30
00:02:08,370 --> 00:02:13,890
Remember, the max pooling operation is a way we can actually downsample or reduce the feature map size

31
00:02:14,220 --> 00:02:16,760
and still retain valuable information.

32
00:02:16,770 --> 00:02:23,580
It doesn't lose much in this process, and it reduces the number of parameters in our CNN, which is

33
00:02:23,580 --> 00:02:29,220
a good benefit because the more parameters we have slower is the trend, but the more tendency it is

34
00:02:29,250 --> 00:02:29,790
over it.

35
00:02:30,150 --> 00:02:32,310
It just becomes a more difficult problem at that point.

36
00:02:32,430 --> 00:02:39,780
So by having this downsampling max balloon operations, we can improve our training process much better.

37
00:02:40,530 --> 00:02:47,130
And you can see with the max pooling Sofia, we just pick the largest value the square in the square,

38
00:02:47,130 --> 00:02:48,360
in the square, in the square.

39
00:02:48,720 --> 00:02:49,630
And then we put it here.

40
00:02:49,650 --> 00:02:51,450
And again, I still have the same error here.

41
00:02:51,840 --> 00:02:53,600
This to have to choose the largest values.

42
00:02:53,790 --> 00:02:56,100
This one, six or seven should be 253.

43
00:02:56,400 --> 00:02:57,330
Apologies for that.

44
00:02:58,530 --> 00:03:00,990
Next, we took a look at the fully connected layer.

45
00:03:01,350 --> 00:03:08,280
The fully connected layer basically just flattens all of the feature map outputs from the max pool operation

46
00:03:08,790 --> 00:03:13,320
into one vector, one a one dimensional vector, one long vector.

47
00:03:13,320 --> 00:03:16,710
You can think about it where every node is connected to another node.

48
00:03:17,130 --> 00:03:19,140
So this 123 is a new ID here.

49
00:03:19,500 --> 00:03:23,640
And of course, it connects to this node, also to this node and vice versa.

50
00:03:24,180 --> 00:03:27,210
So if you have a lot of nodes here, it's a lot of connections going on.

51
00:03:28,260 --> 00:03:34,770
And then finally, we took a look at the SoftBank's layer, which is the end of the year to CNN.

52
00:03:35,310 --> 00:03:39,270
And basically, this converts our scores to probabilities.

53
00:03:39,540 --> 00:03:40,950
And by using this formula here.

54
00:03:42,030 --> 00:03:45,690
So that's this is a that's a full CNN put together.

55
00:03:46,110 --> 00:03:51,720
We have an input image, we have a conveyors, we have a max blue operations, we have to flatten function

56
00:03:52,170 --> 00:03:55,470
and then we have which is connected to the fully connected layer here.

57
00:03:55,920 --> 00:04:01,620
And then we have the output nodes, the output nodes of basically the final nodes, where each output

58
00:04:01,620 --> 00:04:03,540
corresponds to a class score.

59
00:04:03,990 --> 00:04:07,440
And you can see that in the in the diagrams, we have fleet of one.

60
00:04:08,040 --> 00:04:13,080
This is another way to represent the CNN, which is actually prefer this way because you can actually

61
00:04:13,080 --> 00:04:15,270
see and visualize the filter sizes here.

62
00:04:15,690 --> 00:04:19,200
You can see the treaty size of the feature maps as they're produced.

63
00:04:19,650 --> 00:04:22,350
And it looks nice as well as pretty.

64
00:04:22,530 --> 00:04:23,130
It's pretty cool.

65
00:04:23,760 --> 00:04:29,280
And these are the calculations to determine the number of parameters in each of these layers.

66
00:04:29,850 --> 00:04:35,190
And they are non-tradable parameters here, which involve the max flattening real disorders functions

67
00:04:35,190 --> 00:04:38,670
and essentially the fully connected or densely.

68
00:04:38,670 --> 00:04:39,960
It's often called densely.

69
00:04:40,410 --> 00:04:46,950
And some libraries and literature that gives us that's where the bulk of the parameters are in this

70
00:04:46,950 --> 00:04:50,410
case, as well as in the CNN two.

71
00:04:51,270 --> 00:04:57,330
And then we have the final output of Celia here with these parameters as well and summarized in this

72
00:04:57,330 --> 00:04:58,350
table right here.

73
00:05:00,210 --> 00:05:02,310
So this is the training process.

74
00:05:02,930 --> 00:05:04,910
Let's let's take a look at this.

75
00:05:05,330 --> 00:05:07,070
So we have a batch of images here.

76
00:05:07,490 --> 00:05:10,190
These images are fed into the network.

77
00:05:10,670 --> 00:05:11,370
Now we can feed it.

78
00:05:11,390 --> 00:05:12,710
We do feed them one by one.

79
00:05:12,740 --> 00:05:18,780
However, we can also feed them in as a batch like that and the batch we get by feeding into batch.

80
00:05:18,830 --> 00:05:20,180
We get these outputs here.

81
00:05:20,660 --> 00:05:27,920
Now we remember we use these output scores to calculate a loss that's using the cross entropy loss function.

82
00:05:28,550 --> 00:05:30,620
And now we have that quantified loss.

83
00:05:30,980 --> 00:05:36,440
We then used that loss to apply by propagation to update the gradients.

84
00:05:36,950 --> 00:05:39,800
So let's take a look at the overview of the training process.

85
00:05:40,220 --> 00:05:47,450
So remember I said we just we have the loss of function values from the forward propagated batch of

86
00:05:47,450 --> 00:05:54,850
images and then we use that propagation to back propagate and update the weights or gradients as it

87
00:05:54,860 --> 00:06:01,740
moves from right to left, going back into the network and then the process by which we apply by propagation.

88
00:06:01,760 --> 00:06:08,570
This gradient descent gradient descent effect if it basically encompasses propagation, but it basically

89
00:06:08,570 --> 00:06:14,900
speaks about algorithms that optimizes with how we update IC, how much we update the gradients by,

90
00:06:15,260 --> 00:06:17,240
and that's the lambda parameter, which we saw.

91
00:06:17,750 --> 00:06:23,000
So this overview, we took a look at how a standard model is designed and defined.

92
00:06:23,690 --> 00:06:29,660
How do we see initialized with random values or batches of images, typically eight 256, depending

93
00:06:29,660 --> 00:06:35,600
on whatever can fit into a GPU or a system RAM and then forward propagated should have seen in model.

94
00:06:36,110 --> 00:06:43,040
Then we use back propagation with typically mini-Budget in descent or something like Adam, or might

95
00:06:43,040 --> 00:06:47,390
have one of those optimized algorithms and we did gradients right to left.

96
00:06:48,020 --> 00:06:53,660
And then we basically use the updates to produce tweets.

97
00:06:54,110 --> 00:06:56,360
The updates produced tweets that have a lower loss.

98
00:06:56,990 --> 00:06:59,660
And this is done for the entire dataset.

99
00:07:00,020 --> 00:07:01,670
So we keep for for what?

100
00:07:01,670 --> 00:07:07,880
Propagating these images continuously into the network until we've completed the entire dataset.

101
00:07:08,180 --> 00:07:12,860
And that's called one epoch and we typically train for five to 50.

102
00:07:13,280 --> 00:07:15,370
You can train for more, you can train 500.

103
00:07:16,340 --> 00:07:21,230
It's basically diminishing returns at some point because eventually the lops loss stops decreasing.

104
00:07:21,860 --> 00:07:25,040
So you tend to use 50 bucks as a good ballpark figure.

105
00:07:25,310 --> 00:07:30,320
Sometimes you can get away with five, depending if the dataset is quite simple and the network as well

106
00:07:30,320 --> 00:07:30,860
optimize.

107
00:07:31,550 --> 00:07:33,980
So let's go over what quickly?

108
00:07:33,980 --> 00:07:39,830
What batches many batches of iterations of epochs are because these terms confuse a lot of beginners.

109
00:07:40,670 --> 00:07:43,580
So let me, we can feed them just one at a time.

110
00:07:43,730 --> 00:07:49,820
But as we saw previously using batches as a low, so much more faster training process.

111
00:07:50,660 --> 00:07:52,340
But what a mini batches.

112
00:07:52,670 --> 00:07:55,790
Well, the mini batches effectively matching itself.

113
00:07:56,270 --> 00:07:56,720
It isn't.

114
00:07:56,720 --> 00:07:57,830
It isn't anything different.

115
00:07:57,840 --> 00:08:04,360
It's just what we call the process by which we batch data and feed it into the property before it propagated

116
00:08:04,400 --> 00:08:06,550
into the neural net that so many batch.

117
00:08:06,560 --> 00:08:08,240
Do you know, iteration?

118
00:08:08,690 --> 00:08:09,870
What is an iteration?

119
00:08:09,890 --> 00:08:17,960
Well, an iteration is a number of mini batches that to take the data taken to complete one epoch and

120
00:08:17,960 --> 00:08:23,420
one epoch is basically where we finish passing the entire data set to the network.

121
00:08:24,950 --> 00:08:28,310
So what are some advantages of CNN's just to recap?

122
00:08:28,830 --> 00:08:30,120
There's scale invariant.

123
00:08:30,140 --> 00:08:36,380
Remember how well Max Boulia basically kind of ensures that we don't lose that and variance as we don't

124
00:08:36,380 --> 00:08:44,030
sell both of the filters then allow for parameter sharing because of the highly correlated pixels in

125
00:08:44,030 --> 00:08:44,540
an image.

126
00:08:45,050 --> 00:08:51,050
This allows for parameter sharing with weights, as well as also because we're also using the same filters

127
00:08:51,440 --> 00:08:52,850
for all parts of the image.

128
00:08:53,270 --> 00:08:55,730
That basically enables parameter sharing as well.

129
00:08:55,730 --> 00:09:01,130
In the column for filters, which is a very good thing, reduces the size of the number of parameters.

130
00:09:01,130 --> 00:09:07,130
We need to learn from the images and then the positive connections basically expands on that point.

131
00:09:07,580 --> 00:09:12,710
That's also because of the highly correlated pixels and because we're also sharing parameters.

132
00:09:12,710 --> 00:09:12,920
We do.

133
00:09:13,190 --> 00:09:14,840
We need a lot less connections.

134
00:09:15,230 --> 00:09:16,640
So overall, it's a good thing.

135
00:09:17,570 --> 00:09:19,400
And you remember these are the assumptions.

136
00:09:19,610 --> 00:09:25,120
CNN stick low level features that there was a features that are grouped close together with a local.

137
00:09:25,510 --> 00:09:27,170
So that's what Lucho means.

138
00:09:27,530 --> 00:09:29,060
Global features are basically simple.

139
00:09:29,060 --> 00:09:35,420
Features features a translation invariant, which means it can be anywhere on the image, and high level

140
00:09:35,420 --> 00:09:39,360
features are made up of low level features combinations of them.

141
00:09:40,160 --> 00:09:42,620
So hopefully that recap of CNN.

142
00:09:43,280 --> 00:09:46,580
So hopefully it solidifies everything we've done so far with.

143
00:09:47,810 --> 00:09:51,860
We'll stop there for now, and what we'll do will now take a look.

144
00:09:52,100 --> 00:09:55,920
Basically the history history lesson of deep learning and AI.

145
00:09:56,240 --> 00:09:56,690
Thank you.