1
00:00:11,580 --> 00:00:16,950
In this lecture, we are going to extend our understanding of convolution so that it looks a little

2
00:00:16,950 --> 00:00:19,470
more like what goes on in a neural network.

3
00:00:20,070 --> 00:00:25,590
First, remember that in general, a neural network accepts as input color images.

4
00:00:25,800 --> 00:00:29,850
We know that these are three dimensional objects height by width by color.

5
00:00:30,150 --> 00:00:36,390
But for all the examples of convolution we just looked at, both the image and the filter were two dimensional.

6
00:00:36,480 --> 00:00:39,390
So how can we reconcile this discrepancy?

7
00:00:44,220 --> 00:00:49,680
Previously we conceptualized an image as a square and the filter also has a square.

8
00:00:50,070 --> 00:00:56,790
Then we would slide this little square along the big square and do a multiply add, and that's convolution.

9
00:00:57,150 --> 00:01:01,290
So if our image has three dimensions, we just go with the same analogy.

10
00:01:01,350 --> 00:01:05,420
Our color image is a big box and our filter is a little box.

11
00:01:05,430 --> 00:01:11,550
Then we slide this little box along the big box, and at each position we do a multiply add and that's

12
00:01:11,550 --> 00:01:12,480
a convolution.

13
00:01:17,300 --> 00:01:18,650
Here are the details.

14
00:01:18,890 --> 00:01:24,440
At each position we used to sum over the height and width of the kernel because the kernel was 2D.

15
00:01:24,860 --> 00:01:29,720
But now since the kernel is three D, we're going to sum over the Color Channel as well.

16
00:01:30,080 --> 00:01:34,310
Note that both the kernel and the image for now have three color channels.

17
00:01:34,310 --> 00:01:36,740
If you recall, that's red, green, blue.

18
00:01:38,660 --> 00:01:43,790
The way that you can think of this is that instead of the filter being a grayscale pattern finder,

19
00:01:43,820 --> 00:01:45,740
it's now a color pattern finder.

20
00:01:46,130 --> 00:01:49,100
Suppose the filter is looking for a red circle.

21
00:01:49,190 --> 00:01:51,710
So there is a circle in the red channel.

22
00:01:52,250 --> 00:01:55,170
Now it's only going to match red circles.

23
00:01:55,190 --> 00:01:57,770
It won't match blue circles or green circles.

24
00:01:57,860 --> 00:02:01,350
This is opposed to the black and white filter, which has no color at all.

25
00:02:01,370 --> 00:02:04,280
If it saw a circle, then the pattern would be found.

26
00:02:04,310 --> 00:02:07,700
There's no concept of color in a black and white image.

27
00:02:12,640 --> 00:02:16,860
Ultimately this picture is still not complete using what we just learned.

28
00:02:16,870 --> 00:02:18,070
This is what would happen.

29
00:02:18,430 --> 00:02:20,870
Our input image is a three dimensional object.

30
00:02:20,890 --> 00:02:22,570
Height by width by three.

31
00:02:22,900 --> 00:02:25,300
Our kernel is also a three dimensional object.

32
00:02:25,300 --> 00:02:26,650
K by K by three.

33
00:02:27,280 --> 00:02:34,060
The output of this convolution is only two dimensional height minus K plus one width minus K plus one.

34
00:02:34,060 --> 00:02:38,050
And this is because we're summing over all three dimensions.

35
00:02:39,280 --> 00:02:42,100
And so the third dimension kind of gets dotted out.

36
00:02:42,880 --> 00:02:47,890
Now, there's nothing necessarily wrong with the output being two dimensional, although this should

37
00:02:47,890 --> 00:02:50,320
offend your sensibilities to a certain extent.

38
00:02:50,800 --> 00:02:56,920
Remember this idea of the uniformity of a neural network and how we have repeating structures?

39
00:02:57,490 --> 00:03:03,130
The feedforward, A and N works because you can add any number of layers one dense layer followed by

40
00:03:03,130 --> 00:03:04,630
another, followed by another.

41
00:03:05,080 --> 00:03:09,070
This is okay because the input and the output are of the same type.

42
00:03:09,340 --> 00:03:15,340
The input is a vector and the output is also a vector which can then be the input into the next layer.

43
00:03:15,820 --> 00:03:19,000
But for convolution, so far this doesn't seem to work.

44
00:03:19,270 --> 00:03:24,160
If the input is three dimensional, but the output is only two dimensional, then we do not have this

45
00:03:24,160 --> 00:03:25,110
uniformity.

46
00:03:25,150 --> 00:03:30,490
We can't plug in the same kind of convolution right after the first one, since the input is not the

47
00:03:30,490 --> 00:03:31,510
same as the output.

48
00:03:36,350 --> 00:03:38,180
But here's a question to consider.

49
00:03:38,750 --> 00:03:41,720
Remember that the filter can be thought of as a pattern finder.

50
00:03:41,960 --> 00:03:46,880
So it's looking for a particular pattern, or, in other words, looking for a particular feature.

51
00:03:47,360 --> 00:03:52,940
But we know that in machine learning we have multiple features, just like how each hidden unit in a

52
00:03:52,940 --> 00:03:56,540
neural network represents a different feature based on the inputs.

53
00:03:56,990 --> 00:03:58,400
That kind of makes sense.

54
00:03:58,610 --> 00:04:03,710
If we're looking at an image, we would want to identify multiple features in the image.

55
00:04:03,950 --> 00:04:09,710
For example, let's say you're building a facial recognition network when a filter might be looking

56
00:04:09,710 --> 00:04:13,010
for an AI and another filter might be looking for a nose.

57
00:04:13,250 --> 00:04:19,610
So it makes sense then that if we want to find multiple features, we should have multiple filters to

58
00:04:19,610 --> 00:04:20,270
find them.

59
00:04:25,180 --> 00:04:30,730
So consider for a second what would happen if you had two filters acting on the same input image.

60
00:04:30,760 --> 00:04:36,520
Call the input image a call the first filter one call the second filter w two.

61
00:04:37,680 --> 00:04:44,340
The output of a convulse with one is v one and the output of a can involved with W two is b two.

62
00:04:44,970 --> 00:04:50,790
Let's say for simplicity's sake that we're using same mode convolution so that b one and b to have the

63
00:04:50,790 --> 00:04:59,640
same height in width as A, so of A is H by W by three, then B one is H by W and B two is also H by

64
00:04:59,640 --> 00:05:00,060
W.

65
00:05:00,840 --> 00:05:05,760
Now we have two outputs B one and B two, but with different features found in each.

66
00:05:06,180 --> 00:05:07,500
What should we do with them?

67
00:05:08,010 --> 00:05:10,530
Well, how about we simply stack them together?

68
00:05:11,160 --> 00:05:14,220
Now all of a sudden our output is three dimensional.

69
00:05:14,250 --> 00:05:15,930
It's height by width by two.

70
00:05:16,200 --> 00:05:20,670
And in fact, we can add any number of filters to find any number of features.

71
00:05:20,910 --> 00:05:24,810
So nose, eyes, lips, eyebrows, chin, so on.

72
00:05:29,690 --> 00:05:31,830
Now our input looks like our output.

73
00:05:31,850 --> 00:05:33,590
They are both three dimensional.

74
00:05:33,980 --> 00:05:39,770
This means we can have multiple layers of convolution, one followed by the other in a repeating pattern,

75
00:05:39,770 --> 00:05:41,750
just like we do with a and NS.

76
00:05:41,750 --> 00:05:47,060
But of course, just like with the A and N dense layer, we want to valorize this operation so that

77
00:05:47,060 --> 00:05:52,250
it can be done all in one go rather than doing, say, ten different convolutions and then stacking

78
00:05:52,250 --> 00:05:53,000
them together.

79
00:05:57,860 --> 00:06:04,040
So here is the formula that represents how to do a complete convolution in a deep neural network so

80
00:06:04,040 --> 00:06:08,120
that both the input and the output can be three dimensional objects.

81
00:06:09,860 --> 00:06:10,220
Now.

82
00:06:10,220 --> 00:06:12,220
This is a lot to look at all at once.

83
00:06:12,230 --> 00:06:15,590
So let's try to consider each part one step at a time.

84
00:06:16,280 --> 00:06:18,680
First, let's consider the shape of everything.

85
00:06:19,040 --> 00:06:22,380
We know that A will b h by W by c one.

86
00:06:22,400 --> 00:06:26,540
You can think of c one as the number of color channels the image A has.

87
00:06:26,960 --> 00:06:34,460
Then we have the filter W which has the shape C one by K, by K by C to note that this is now a four

88
00:06:34,460 --> 00:06:35,660
dimensional tensor.

89
00:06:37,010 --> 00:06:38,390
Then we have the output image.

90
00:06:38,420 --> 00:06:43,730
B Assuming that we're doing same mode convolution, the height and width dimensions will be the same

91
00:06:43,730 --> 00:06:47,090
H by W, but the third dimension will be C two.

92
00:06:47,540 --> 00:06:52,970
So in our equation for the convolution, we can see that it's pretty much the same as before just with

93
00:06:52,970 --> 00:06:54,140
more indices.

94
00:06:54,470 --> 00:07:01,610
B IGC represents the image pixel value at a row, i.e. column J and color channel c.

95
00:07:02,270 --> 00:07:06,860
Note that I use the term color loosely here because it doesn't really represent color anymore.

96
00:07:07,040 --> 00:07:12,920
We can have an arbitrary number of features, so c two can be ten, 100 or 1000.

97
00:07:13,280 --> 00:07:19,160
Conceptually, however, we're still just doing what we talked about previously, just individual convolutions

98
00:07:19,160 --> 00:07:24,920
which output two dimensional images and then stacking all those images together to get a three dimensional

99
00:07:24,920 --> 00:07:25,670
object.

100
00:07:30,580 --> 00:07:34,240
So let's just summarize what we learned in this lecture, since that was a bit long.

101
00:07:34,960 --> 00:07:41,350
First, we started with realizing that we only defined convolution for grayscale images and with corresponding

102
00:07:41,380 --> 00:07:42,820
two dimensional filters.

103
00:07:44,280 --> 00:07:49,740
This is, of course, not flexible enough for CNN's, which must accept color images as input.

104
00:07:50,130 --> 00:07:55,680
So then we said, okay, let's just extend that by doing a three dimensional dot product between a color

105
00:07:55,680 --> 00:08:00,150
image and a three dimensional filter with the same number of color channels.

106
00:08:00,390 --> 00:08:02,400
So we're dotting over the Color Channel.

107
00:08:03,120 --> 00:08:08,670
Next, we realize that this breaks uniformity so we can't stack layers together to build a deeper neural

108
00:08:08,670 --> 00:08:09,390
network.

109
00:08:09,480 --> 00:08:11,480
And this is deep learning, after all.

110
00:08:11,490 --> 00:08:13,560
So we kind of need to be able to do that.

111
00:08:13,950 --> 00:08:20,190
And this is because when you can evolve a 3D image with a3d filter, you still get a 2D image as output.

112
00:08:21,750 --> 00:08:26,550
This is just like when you dot a1d vector by another one d vector, you get a scalar.

113
00:08:27,120 --> 00:08:32,550
So we extended this by realizing that we don't just want one filter that's going to find one single

114
00:08:32,550 --> 00:08:33,200
feature.

115
00:08:33,210 --> 00:08:35,310
We need to find multiple features.

116
00:08:35,610 --> 00:08:41,070
Therefore we need multiple filters which will output multiple to DX images, which of course can be

117
00:08:41,070 --> 00:08:43,230
stacked into a3d image.

118
00:08:43,560 --> 00:08:49,380
Then we vector arise this operation so that both the input and the output are three dimensional.

119
00:08:54,270 --> 00:08:58,770
Now you might be wondering, is it correct to call the final dimension color?

120
00:08:59,130 --> 00:09:04,140
That may be the case for the input into the neural network, which is always height by width by three.

121
00:09:04,230 --> 00:09:06,480
But this breaks down at every other layer.

122
00:09:06,900 --> 00:09:12,450
For example, if you have height by width by ten, then the ten dimensions don't really represent color.

123
00:09:12,960 --> 00:09:17,970
But as we discussed conceptually, we're just finding different features in an image.

124
00:09:18,000 --> 00:09:21,840
So our terminology changes to cover this more general case.

125
00:09:22,110 --> 00:09:24,000
We call these feature maps.

126
00:09:24,300 --> 00:09:29,880
This makes sense because you can think of each tutti image as a map, as in a literal map you can look

127
00:09:29,880 --> 00:09:35,460
at which is the result of a convolution with some filter which is looking for some feature.

128
00:09:36,520 --> 00:09:41,260
So this map is telling us where in the original image that this feature can be found.

129
00:09:41,290 --> 00:09:47,980
So just like a map, the final output is just a stack of all these different maps, one map for each

130
00:09:47,980 --> 00:09:48,610
feature.

131
00:09:49,240 --> 00:09:54,970
And so the size of the final dimension is referred to as the number of channels or the number of feature

132
00:09:54,970 --> 00:09:56,920
maps rather than colors.

133
00:10:00,860 --> 00:10:04,560
Finally, I want to talk about how convolution is going to look.

134
00:10:04,580 --> 00:10:06,290
As a neural network layer.

135
00:10:07,240 --> 00:10:12,910
If you recall, you can think of convolution as just matrix multiplication, but with shared weights.

136
00:10:13,060 --> 00:10:18,820
And as you know, in a feed forward and then we don't just have matrix multiplication at each layer,

137
00:10:18,820 --> 00:10:20,620
we do some other things as well.

138
00:10:20,860 --> 00:10:25,420
In particular, we can add a bias term and apply an activation function.

139
00:10:26,200 --> 00:10:28,930
Well, a convolution layer works the same way.

140
00:10:29,720 --> 00:10:33,260
We're going to add a bias term and apply an activation function.

141
00:10:33,260 --> 00:10:39,290
Because remember, this allows us to find a nonlinear features or nonlinear functions of the input which

142
00:10:39,290 --> 00:10:41,990
are more powerful than simple linear functions.

143
00:10:46,940 --> 00:10:51,950
Importantly, we have to consider what the shape of the bias term should be, given that the output

144
00:10:51,950 --> 00:10:56,680
of the convolution is a three dimensional image in a dense layer.

145
00:10:56,690 --> 00:11:01,310
If we have a vector of size m then our bias term is also a vector of size.

146
00:11:01,310 --> 00:11:08,360
M In other words, there is one scalar for each element in the result, but for convolution, the bias

147
00:11:08,360 --> 00:11:14,480
term will not have the same shape as W involved with X, which is a three dimensional image.

148
00:11:15,840 --> 00:11:18,720
Now you might think, isn't this equation invalid?

149
00:11:18,750 --> 00:11:22,680
If B does not have the same shape as W, consult with X.

150
00:11:23,340 --> 00:11:26,950
Technically, this shouldn't be allowed by the rules of matrix arithmetic.

151
00:11:26,970 --> 00:11:31,020
If we want to add two matrices together, they should have the same shape.

152
00:11:31,710 --> 00:11:37,350
Technically this is true, but this is conceptual only, and this will actually work in code due to

153
00:11:37,350 --> 00:11:38,820
the rules of broadcasting.

154
00:11:39,210 --> 00:11:44,580
So the bias vector b is only one dimensional and it has the size c two.

155
00:11:44,610 --> 00:11:49,890
If W consult with X has a shape H by W by C two.

156
00:11:50,250 --> 00:11:55,740
In other words, the same bias term applies to every pixel for each feature map.

157
00:11:57,000 --> 00:12:02,580
As a side note, convolution is generally a commutative operation, so it doesn't really matter if you

158
00:12:02,580 --> 00:12:10,110
write W Star X or x star w, although this is not entirely true in the deep learning case, since W

159
00:12:10,110 --> 00:12:15,300
is four dimensional and X is three dimensional, nonetheless, this is just conceptual only.

160
00:12:15,300 --> 00:12:20,010
So let's pretend that W comes first so that it looks more like the dense layer equation.

161
00:12:24,960 --> 00:12:25,350
All right.

162
00:12:25,350 --> 00:12:30,660
So now that you understand how convolution and a full convolution layer works in a neural network,

163
00:12:30,660 --> 00:12:32,460
let's consider our savings.

164
00:12:33,060 --> 00:12:36,990
Remember that one way to look at convolution is that it's parameter sharing.

165
00:12:37,500 --> 00:12:42,990
Sharing parameters means that we use the same weights in multiple places, reducing the number of total

166
00:12:42,990 --> 00:12:43,950
weights we need.

167
00:12:44,340 --> 00:12:47,640
So let's do a quick calculation to determine our savings.

168
00:12:48,090 --> 00:12:53,790
Suppose our input image is of size 32 by 32 by three, a modest size image.

169
00:12:54,330 --> 00:12:58,890
Now let's suppose the size of the filter is three by five by five by 64.

170
00:12:59,040 --> 00:13:01,710
So there are 64 output feature maps.

171
00:13:02,840 --> 00:13:07,340
Then our output image will have the shape 28 by 28 by 64.

172
00:13:07,700 --> 00:13:10,940
This is because 32 minus five plus one is 28.

173
00:13:11,060 --> 00:13:14,210
And let's just assume we're using valid mode convolution.

174
00:13:14,990 --> 00:13:20,540
The number of parameters we needed to compute this output is just three times, five times five times

175
00:13:20,540 --> 00:13:23,180
64, which is 4800.

176
00:13:27,990 --> 00:13:32,820
Now let's compare that to the case where we would use a feedforward neural network with a flattened

177
00:13:32,820 --> 00:13:35,370
image like we did in the previous section.

178
00:13:35,910 --> 00:13:41,670
Then our input would be 32 times, 32 times three, which is 3072.

179
00:13:43,530 --> 00:13:50,580
Our output again would be of size 28 by 28 by 64, which is 50,176.

180
00:13:51,060 --> 00:13:59,760
So if we use the dense layer, we would have a weight matrix of size 3000, 72 times 50,176, which

181
00:13:59,760 --> 00:14:02,670
is approximately 154 million.

182
00:14:02,910 --> 00:14:06,720
Now, that's a lot of parameters for just one layer of a neural network.

183
00:14:07,030 --> 00:14:13,080
In fact, compared to using convolution, we have about 32,000 times more parameters.

184
00:14:13,500 --> 00:14:18,570
As you can imagine, this would require a huge amount of RAM and a huge amount of compute.

185
00:14:18,930 --> 00:14:24,840
It would take longer and would perform suboptimal because as we stated earlier, convolution is a pattern

186
00:14:24,840 --> 00:14:25,500
finder.

187
00:14:25,680 --> 00:14:31,220
We use the same filter to find the same pattern in multiple places without shared weights.

188
00:14:31,230 --> 00:14:34,560
Every weight needs to learn the pattern at every location.

189
00:14:34,560 --> 00:14:37,230
The pattern could appear, which is suboptimal.

190
00:14:38,470 --> 00:14:40,630
This would be bad for a generalization.

191
00:14:45,460 --> 00:14:52,240
Using this concept of convolution just being a part of a neural network layer, it's easy to see how

192
00:14:52,240 --> 00:14:56,320
we are going to find these convolution filters through the process of training.

193
00:14:56,710 --> 00:15:02,230
At this point, we've come quite a long way since our first introduction to convolution, where we had

194
00:15:02,230 --> 00:15:05,680
the perspective that convolution is just an image modifier.

195
00:15:06,790 --> 00:15:13,060
The examples we use where blurring and edge detection, we took an image and then convolution modified

196
00:15:13,060 --> 00:15:14,170
it in some way.

197
00:15:14,590 --> 00:15:17,650
But now we have a deeper understanding of convolution.

198
00:15:17,860 --> 00:15:22,750
It's in fact, a pattern finder and a shared parameter matrix multiplication.

199
00:15:23,050 --> 00:15:25,120
It's just a feature transformer.

200
00:15:25,540 --> 00:15:30,140
In this way, it's just like the neural network layers we'd seen earlier in the course.

201
00:15:30,160 --> 00:15:36,310
And so having this perspective, it becomes pretty obvious how these filters or feature transformers

202
00:15:36,310 --> 00:15:38,620
will be found during the training process.

203
00:15:38,830 --> 00:15:43,480
The answer is that it's all going to happen automatically when we call model fit.

204
00:15:43,870 --> 00:15:49,510
This is going to end up doing gradient descent on our new weights and biases, which for convolution

205
00:15:49,510 --> 00:15:53,830
just happen to be a different shape than what we had before in the dense layer.
