1
00:00:11,600 --> 00:00:16,940
In this lecture, we are going to extend our understanding of convolution so that it looks a little

2
00:00:16,940 --> 00:00:19,220
more like what goes on in a neural network.

3
00:00:20,090 --> 00:00:25,370
First, remember that in general, a neural network accepts as input color images.

4
00:00:25,940 --> 00:00:29,630
We know that these are three dimensional objects, height by width, by color.

5
00:00:30,290 --> 00:00:36,200
But for all the examples of convolution, we just looked at both the image and the filter were two dimensional.

6
00:00:36,680 --> 00:00:39,200
So how can we reconcile this discrepancy?

7
00:00:44,300 --> 00:00:50,810
Previously, we conceptualized an image as a square and the filter also as a square, then we would

8
00:00:50,810 --> 00:00:55,080
slide this little square along the big square and do multiply add.

9
00:00:55,370 --> 00:00:56,570
And that's convolution.

10
00:00:57,290 --> 00:01:01,040
So if our image has three dimensions, we just go with the same analogy.

11
00:01:01,430 --> 00:01:05,180
Our color image is a big box and our filter is a little box.

12
00:01:05,570 --> 00:01:11,570
Then we slide this little box along the big box, and at each position we do a multiply add and that's

13
00:01:11,570 --> 00:01:12,290
a convolution.

14
00:01:17,380 --> 00:01:19,820
Here are the details at each position.

15
00:01:19,840 --> 00:01:26,020
We used to some over the height and width of the col because the colonel was 2D, but now since the

16
00:01:26,020 --> 00:01:29,500
colonel is 3D, we're going to sum over the color channel as well.

17
00:01:30,190 --> 00:01:34,210
Note that both the colonel and the image four now have three color channels.

18
00:01:34,360 --> 00:01:36,520
If you recall, that's red, green, blue.

19
00:01:38,640 --> 00:01:43,560
The way that you can think of this is that instead of the filter being a grayscale pattern finder,

20
00:01:43,860 --> 00:01:45,540
it's now a color pattern finder.

21
00:01:46,260 --> 00:01:51,510
Suppose the filter is looking for a red circle, so there's a circle in the red channel.

22
00:01:52,350 --> 00:01:55,050
Now it's only going to match red circles.

23
00:01:55,140 --> 00:01:57,540
It won't match blue circles or green circles.

24
00:01:58,020 --> 00:02:01,200
This is opposed to the black and white filter, which has no color at all.

25
00:02:01,500 --> 00:02:04,200
If it saw a circle, then the pattern would be found.

26
00:02:04,320 --> 00:02:07,470
There's no concept of color in a black and white image.

27
00:02:12,630 --> 00:02:17,850
Ultimately, this picture is still not complete using what we just learned, this is what would happen.

28
00:02:18,510 --> 00:02:20,850
Our input image is a three dimensional object.

29
00:02:20,910 --> 00:02:22,380
Height by width by three.

30
00:02:22,950 --> 00:02:26,460
Our kernel is also a three dimensional object, K by K by three.

31
00:02:27,390 --> 00:02:30,660
The output of this convolution is only two dimensional.

32
00:02:30,960 --> 00:02:34,050
Height minus K plus one with minus K plus one.

33
00:02:34,080 --> 00:02:37,830
And this is because we're summing over all three dimensions.

34
00:02:39,330 --> 00:02:43,350
And so the third dimension kind of gets sorted out now.

35
00:02:43,360 --> 00:02:46,680
There's nothing necessarily wrong with the output being two dimensional.

36
00:02:47,100 --> 00:02:50,100
Although this should offend your sensibilities to a certain extent.

37
00:02:50,880 --> 00:02:56,700
Remember this idea of the uniformity of a neural network and how we have repeating structures.

38
00:02:57,570 --> 00:03:03,510
The feedforward A9 works because you can add any number of layers one dense layer followed by another,

39
00:03:03,510 --> 00:03:04,440
followed by another.

40
00:03:05,220 --> 00:03:08,760
This is OK because the input and the output are of the same type.

41
00:03:09,390 --> 00:03:15,150
The input is a vector and the output is also a vector, which can then be the input into the next layer.

42
00:03:15,930 --> 00:03:21,540
But if a convolutions so far, this doesn't seem to work if the input is three dimensional, but the

43
00:03:21,540 --> 00:03:22,920
output is only two dimensional.

44
00:03:23,280 --> 00:03:24,870
Then we do not have this uniformity.

45
00:03:25,260 --> 00:03:30,750
We can't plug in the same kind of convolution right after the first one since the input is not the same

46
00:03:30,750 --> 00:03:31,380
as the output.

47
00:03:36,450 --> 00:03:37,950
But here's a question to consider.

48
00:03:38,790 --> 00:03:44,070
Remember that the filter can be thought of as a pattern finder, so it's looking for a particular pattern

49
00:03:44,160 --> 00:03:46,680
or, in other words, looking for a particular feature.

50
00:03:47,490 --> 00:03:52,860
But we know that in machine learning, we have multiple features, just like how each hidden unit in

51
00:03:52,860 --> 00:03:56,310
a neural network represents a different feature based on the inputs.

52
00:03:57,000 --> 00:03:58,140
That kind of makes sense.

53
00:03:58,740 --> 00:04:03,510
If we're looking at an image, we would want to identify multiple features in the image.

54
00:04:04,080 --> 00:04:07,950
For example, let's say you're building a facial recognition network.

55
00:04:08,520 --> 00:04:12,780
One filter might be looking for an eye and another filter might be looking for a nose.

56
00:04:13,380 --> 00:04:19,440
So it makes sense, then, that if we want to find multiple features, we should have multiple filters

57
00:04:19,440 --> 00:04:20,130
to find them.

58
00:04:25,260 --> 00:04:31,050
So consider for a second what would happen if you had two filters acting on the same input image called

59
00:04:31,050 --> 00:04:36,360
the input image a call the first filter one call the second filter W2.

60
00:04:37,530 --> 00:04:43,580
The they'll put a vague and vaults with one is be one, and the output of a car involved with W2 is

61
00:04:43,580 --> 00:04:50,270
B to, let's say for simplicity's sake that we're using same motor convolution so that B one and B to

62
00:04:50,510 --> 00:04:58,820
have the same height and width as a so of A is each by W by three, then B one is h by W and B two is

63
00:04:58,820 --> 00:05:00,020
also h by W.

64
00:05:00,920 --> 00:05:05,510
Now we have two outputs B one and be two, but with different features found in each.

65
00:05:06,290 --> 00:05:07,280
What should we do with them?

66
00:05:08,150 --> 00:05:10,310
Well, how about we simply stack them together?

67
00:05:11,270 --> 00:05:15,710
Now, all of a sudden in our output is three dimensional, its height by width by two.

68
00:05:16,310 --> 00:05:20,480
And in fact, we can add any number of filters to find any number of features.

69
00:05:21,020 --> 00:05:24,650
So nose, eyes, lips, eyebrows, chin, so on.

70
00:05:29,760 --> 00:05:31,770
Now, our input looks like our output.

71
00:05:31,920 --> 00:05:33,420
They are both three dimensional.

72
00:05:34,050 --> 00:05:39,780
This means we can have multiple layers of convolution one, followed by the other in a repeating pattern,

73
00:05:39,780 --> 00:05:40,950
just like we do with an.

74
00:05:41,940 --> 00:05:47,070
But of course, just like with the Ann and Dennis there, we want to victories this operation so that

75
00:05:47,070 --> 00:05:52,380
it can be done all in one go rather than doing, say, 10 different convolutions and then stacking them

76
00:05:52,380 --> 00:05:52,860
together.

77
00:05:57,970 --> 00:06:04,030
So here is the formula that represents how to do a complete convolution in a deep neural network so

78
00:06:04,030 --> 00:06:07,870
that both the input and the output can be three dimensional objects.

79
00:06:09,940 --> 00:06:12,010
Now, this is a lot to look at all at once.

80
00:06:12,400 --> 00:06:15,370
So let's try to consider each part one step at a time.

81
00:06:16,390 --> 00:06:18,400
First, let's consider the shape of everything.

82
00:06:19,120 --> 00:06:25,600
We know that a will b h by W by C one, you can think of C one as the number of color channels the image

83
00:06:25,810 --> 00:06:26,320
has.

84
00:06:27,070 --> 00:06:34,480
Then we have the filter w, which has the shape c one bouquet by K by C to note that this is now a four

85
00:06:34,480 --> 00:06:35,500
dimensional tensor.

86
00:06:37,120 --> 00:06:41,500
Then we have the output image of B assuming that we're doing C mode convolution.

87
00:06:41,740 --> 00:06:46,870
The height and with dimensions will be the same each by W, but the third dimension will be C two.

88
00:06:47,650 --> 00:06:52,120
So in our equation for the convolution, we can see that it's pretty much the same as before.

89
00:06:52,480 --> 00:07:01,420
Just with more indices, BJC represents the image pixel value at Row II Column J and Color Channel C.

90
00:07:02,290 --> 00:07:06,670
Note that I use the term color loosely here because it doesn't really represent color anymore.

91
00:07:07,180 --> 00:07:12,730
We can have an arbitrary number of features, so C two can be 10 120000.

92
00:07:13,330 --> 00:07:19,210
Conceptually, however, we're still just doing what we talked about previously, just individual convolutions,

93
00:07:19,210 --> 00:07:24,880
which output two dimensional images and then stacking all those images together to get a three dimensional

94
00:07:24,880 --> 00:07:25,480
object.

95
00:07:30,650 --> 00:07:33,980
So let's just summarize what we learned in this lecture, since that was a bit long.

96
00:07:35,060 --> 00:07:41,360
First, we started with realizing that we only defined convolution for grayscale images and with corresponding

97
00:07:41,360 --> 00:07:42,650
two dimensional filters.

98
00:07:44,330 --> 00:07:49,580
This is, of course, not flexible enough for CNN's, which must accept color images as input.

99
00:07:50,300 --> 00:07:55,700
So then we said, OK, let's just extend that by doing a three dimensional dot product between a color

100
00:07:55,700 --> 00:07:59,930
image and a three dimensional filter with the same number of color channels.

101
00:08:00,530 --> 00:08:02,210
So we're dotting over the color channel.

102
00:08:03,230 --> 00:08:08,660
Next, we realized that this breaks uniformity so we can't stack layers together to build a deep neural

103
00:08:08,660 --> 00:08:09,180
network.

104
00:08:09,620 --> 00:08:11,490
And this is deep learning, after all.

105
00:08:11,540 --> 00:08:13,370
So we kind of need to be able to do that.

106
00:08:14,060 --> 00:08:20,030
And this is because when you involve a 3-D image with a 3-D filter, you still get a 2D image as output.

107
00:08:21,800 --> 00:08:26,390
This is just like when you dot a one vector by another one vector, you get a scalar.

108
00:08:27,260 --> 00:08:32,570
So we extended this by realizing that we don't just want one filter that's going to find a one single

109
00:08:32,570 --> 00:08:32,990
feature.

110
00:08:33,350 --> 00:08:35,090
We need to find multiple features.

111
00:08:35,720 --> 00:08:40,940
Therefore, we need multiple filters, which will output multiple 2-D images, which of course, can

112
00:08:40,940 --> 00:08:43,010
be stacked into a 3D image.

113
00:08:43,730 --> 00:08:49,220
Then we voucherize this operation so that both the input and output are three dimensional.

114
00:08:54,290 --> 00:09:00,380
Now, you might be wondering, is it correct to call the final dimension color that may be the case

115
00:09:00,380 --> 00:09:05,390
for their input into the neural network, which is always hype by with by three, but this breaks down

116
00:09:05,390 --> 00:09:06,260
at every other layer.

117
00:09:07,040 --> 00:09:12,230
For example, if you have hypo with by 10, then the 10 dimensions don't really represent color.

118
00:09:13,070 --> 00:09:17,690
But as we discussed conceptually, we're just finding different features in an image.

119
00:09:18,200 --> 00:09:23,780
So our terminology changes to cover this more general case, we call these feature maps.

120
00:09:24,410 --> 00:09:29,870
This makes sense because you can think of each 2-D image as a map, as in a literal map you can look

121
00:09:29,870 --> 00:09:35,240
at, which is the result of a convolution with some filter, which is looking for some feature.

122
00:09:36,630 --> 00:09:41,100
So this map is telling us where in the original image that this feature can be found.

123
00:09:41,400 --> 00:09:46,470
So just like a map, the final output is just a stack of all these different maps.

124
00:09:46,860 --> 00:09:48,420
One map for each feature.

125
00:09:49,380 --> 00:09:54,990
And so the size of the final dimension is referred to as the number of channels or the number of feature

126
00:09:54,990 --> 00:09:56,730
maps rather than colors.

127
00:10:00,880 --> 00:10:06,130
Finally, I want to talk about how convolution is going to look as a neural network layer.

128
00:10:07,290 --> 00:10:11,100
If you recall, you can think of convolution as just matrix multiplication.

129
00:10:11,370 --> 00:10:17,880
But with shared weights and as you know, in a feedforward and we don't just have matrix multiplication

130
00:10:17,880 --> 00:10:18,600
at each layer.

131
00:10:18,960 --> 00:10:20,400
We do some other things as well.

132
00:10:20,970 --> 00:10:25,200
In particular, we can add a biased term and apply an activation function.

133
00:10:26,360 --> 00:10:28,740
Well, a convolution layer works the same way.

134
00:10:29,730 --> 00:10:35,130
We're going to add a base term and apply an activation function, because remember, this allows us

135
00:10:35,130 --> 00:10:40,710
to find a nonlinear features or nonlinear functions of the input, which are more powerful than simple

136
00:10:40,710 --> 00:10:41,790
linear functions.

137
00:10:46,920 --> 00:10:51,960
Importantly, we have to consider what the shape of the bias term should be, given that the output

138
00:10:51,960 --> 00:10:56,520
of the convolution is a three dimensional image in a dense layer.

139
00:10:56,790 --> 00:11:01,620
If we have a vector of size m, then a biased term is also a vector of size M.

140
00:11:02,340 --> 00:11:05,640
In other words, there is one scalar for each element in the result.

141
00:11:06,390 --> 00:11:12,960
But for convolution, the bias term will not have the same shape as W convulsed with X, which is a

142
00:11:12,960 --> 00:11:14,190
three dimensional image.

143
00:11:15,800 --> 00:11:22,010
Now, you might think, isn't this equation invalid if B does not have the same shape as W can with

144
00:11:22,010 --> 00:11:22,490
X?

145
00:11:23,300 --> 00:11:26,720
Technically, this shouldn't be allowed by the rules of matrix arithmetic.

146
00:11:27,140 --> 00:11:30,770
If we want to add two matrices together, they should have the same shape.

147
00:11:31,670 --> 00:11:37,310
Technically, this is true, but this is conceptual only, and this will actually work in code due to

148
00:11:37,310 --> 00:11:38,420
the rules of broadcast.

149
00:11:39,350 --> 00:11:44,390
So the bias vector B is only one dimensional and it has the size C two.

150
00:11:44,780 --> 00:11:49,670
If W can vote with X as a shape each by W by C two.

151
00:11:50,390 --> 00:11:55,550
In other words, the same bias term applies to every pixel for each feature map.

152
00:11:57,050 --> 00:12:02,570
As a sign, no convolution is generally a commutative operation, so it doesn't really matter if you're

153
00:12:02,570 --> 00:12:05,360
right w star X or X start w.

154
00:12:05,900 --> 00:12:11,510
Although this is not entirely true in the deep learning case, since W is four dimensional annexes three

155
00:12:11,510 --> 00:12:12,050
dimensional.

156
00:12:12,800 --> 00:12:18,470
Nonetheless, this is just conceptual only, so let's pretend that W comes first so that it looks more

157
00:12:18,470 --> 00:12:19,850
like the dense layer equation.

158
00:12:24,980 --> 00:12:30,110
All right, so now that you understand how convolution and a full convolution layer works in a neural

159
00:12:30,110 --> 00:12:32,240
network, let's consider our savings.

160
00:12:33,080 --> 00:12:38,840
Remember that one way to look at convolution is that its parameter sharing sharing parameters means

161
00:12:38,840 --> 00:12:43,730
that we use the same weights in multiple places, reducing the number of total weights we need.

162
00:12:44,480 --> 00:12:47,420
So let's do a quick calculation to determine our savings.

163
00:12:48,260 --> 00:12:51,890
Suppose our input image is of size 32 by 32 by three.

164
00:12:52,160 --> 00:12:53,540
A modest size image.

165
00:12:54,410 --> 00:12:58,700
Now, let's suppose the size of the filter is three by five by five by 64.

166
00:12:59,180 --> 00:13:01,490
So there are 64 output feature maps.

167
00:13:02,920 --> 00:13:07,120
Then our output image will have a shape 28 by 28 by 64.

168
00:13:07,750 --> 00:13:10,750
This is because 32 minus five plus one is 28.

169
00:13:11,230 --> 00:13:14,020
And let's just assume we're using valid mode convolution.

170
00:13:15,070 --> 00:13:20,560
The number of parameters we needed to compute this output is just three times five times five times

171
00:13:20,560 --> 00:13:22,990
64, which is 4800.

172
00:13:28,040 --> 00:13:33,260
Now, let's compare that to the case where we would use a feed for neural network with a flattened image

173
00:13:33,500 --> 00:13:40,400
like we did in the previous section, then our input would be 32 times 32 times three, which is three

174
00:13:40,400 --> 00:13:41,480
thousand seventy two.

175
00:13:43,580 --> 00:13:50,390
Our output again would be a size 28 by 28 by 64, which is fifty thousand one hundred seventy six.

176
00:13:51,170 --> 00:13:57,110
So if we use the dense there, we would have a weight matrix of size three thousand seventy two times

177
00:13:57,110 --> 00:14:02,510
fifty thousand one hundred seventy six, which is approximately 154 million.

178
00:14:03,050 --> 00:14:06,470
Now that's a lot of parameters for just one layer of a neural network.

179
00:14:07,160 --> 00:14:12,890
In fact, compared to using convolution, we have about 32000 times more parameters.

180
00:14:13,610 --> 00:14:18,350
As you can imagine, this would require a huge amount of RAM and a huge amount of compute.

181
00:14:19,070 --> 00:14:24,500
It would take longer and would perform suboptimal because, as we stated earlier, convolution is a

182
00:14:24,500 --> 00:14:25,280
pattern finder.

183
00:14:25,820 --> 00:14:31,010
We use the same filter to find the same pattern in multiple places without shared weights.

184
00:14:31,340 --> 00:14:34,460
Every weight needs to learn the pattern at every location.

185
00:14:34,610 --> 00:14:37,100
The pattern could appear, which is suboptimal.

186
00:14:38,450 --> 00:14:40,460
This would be bad for a generalization.

187
00:14:45,530 --> 00:14:52,270
Using this concept of convolution, just being part of a neural network layer, it's easy to see how

188
00:14:52,280 --> 00:14:57,770
are you going to find these convolution filters through the process of training at this point, we've

189
00:14:57,770 --> 00:15:03,110
come quite a long way since our first introduction to convolution, where we had the perspective that

190
00:15:03,110 --> 00:15:05,480
convolution is just an image modifier.

191
00:15:06,880 --> 00:15:09,460
The examples we use were blurring and edge detection.

192
00:15:10,180 --> 00:15:13,930
We took an image and then convolution modified it in some way.

193
00:15:14,710 --> 00:15:17,470
But now we have a deeper understanding of convolution.

194
00:15:17,980 --> 00:15:22,540
It's in fact a pattern finder and a shared parameter matrix multiplication.

195
00:15:23,170 --> 00:15:24,880
It's just a feature transformer.

196
00:15:25,660 --> 00:15:29,890
In this way, it's just like the neural network layers we'd seen earlier in the course.

197
00:15:30,340 --> 00:15:36,100
And so having this perspective, it becomes pretty obvious how these filters or feature transformers

198
00:15:36,460 --> 00:15:38,410
will be found during the training process.

199
00:15:38,980 --> 00:15:43,300
The answer is that it's all going to happen automatically when we call model that fit.

200
00:15:43,990 --> 00:15:49,300
This is going to end up doing gradient descent on our new weights and biases, which for convolution

201
00:15:49,630 --> 00:15:53,680
just happened to be a different shape than what we had before in the dense layer.