1
00:00:11,570 --> 00:00:17,900
In this lecture, we are going to discuss convolution in the preparation for building convolutional

2
00:00:17,900 --> 00:00:23,720
neural networks, of course, given the name, it's pretty obvious that a convolutional neural network

3
00:00:23,930 --> 00:00:26,410
is a neural network with convolution.

4
00:00:26,840 --> 00:00:32,420
And so in order to understand CNN, we must first understand convolution.

5
00:00:37,550 --> 00:00:43,640
First, I want to mention that people sometimes treat convolution as this mysterious thing, but as

6
00:00:43,640 --> 00:00:48,690
with our discussion of machine learning, we want to take the dumb as possible approach.

7
00:00:49,190 --> 00:00:51,500
Sure, convolution can be complex.

8
00:00:51,800 --> 00:00:55,310
It's a central operation in a signal processing and computer vision.

9
00:00:55,820 --> 00:00:57,310
But in fact, it's quite simple.

10
00:00:57,830 --> 00:00:59,510
There are only two requirements.

11
00:00:59,810 --> 00:01:01,220
First, can you add.

12
00:01:01,790 --> 00:01:03,450
Second, can you multiply?

13
00:01:04,100 --> 00:01:10,670
If your answer to these two questions is yes, then you understand convolution, believe it or not,

14
00:01:10,670 --> 00:01:13,130
convolution is just adding and multiplying.

15
00:01:18,220 --> 00:01:20,530
So let's start by not doing any math.

16
00:01:20,740 --> 00:01:25,150
Let's just look at convolution qualitatively with convolution.

17
00:01:25,180 --> 00:01:27,460
There are three objects you want to pay attention to.

18
00:01:28,000 --> 00:01:29,530
First is the input image.

19
00:01:30,070 --> 00:01:31,660
Second, there is the filter.

20
00:01:32,230 --> 00:01:35,170
Note that another name for the filter is the kernel.

21
00:01:35,170 --> 00:01:38,080
So we're going to use those two terms interchangeably.

22
00:01:39,280 --> 00:01:41,110
Finally, there's the output image.

23
00:01:41,680 --> 00:01:47,020
The output image is what you get when you convolve the input image with the filter.

24
00:01:47,650 --> 00:01:54,010
In other words, if I perform the convolution operation on the input, image and the filter, I get

25
00:01:54,010 --> 00:01:54,940
the output image.

26
00:01:55,780 --> 00:01:59,680
The mathematical symbol for convolution is a star or asterisk.

27
00:01:59,950 --> 00:02:05,150
Not to be confused with the asterisk that we use in computer programming for multiplication.

28
00:02:05,830 --> 00:02:11,770
So if you're running convolution in code, it will not be an asterisk because in code we already use

29
00:02:11,770 --> 00:02:13,390
the asterisk for multiplication.

30
00:02:18,400 --> 00:02:23,050
It's helpful to think of some examples of convolution to understand what it can do.

31
00:02:23,890 --> 00:02:25,660
There are two examples I really like.

32
00:02:26,410 --> 00:02:28,000
The first example is blurring.

33
00:02:28,660 --> 00:02:34,630
So the input image is just the original image and the output image is a blurred version of that image.

34
00:02:35,230 --> 00:02:38,920
You might recognize this operation from applications such as Photoshop.

35
00:02:44,110 --> 00:02:50,800
The second example I really like is edge detection as input, we have the original image and as output

36
00:02:50,800 --> 00:02:56,670
we get white lines where there are edges in the original image and we get black where there are no edges.

37
00:02:57,160 --> 00:03:01,420
So the output image highlights all the edges in the original input image.

38
00:03:06,570 --> 00:03:12,450
So that's your first lesson on how to view convolution, this perspective is that convolution is an

39
00:03:12,450 --> 00:03:13,760
image modifier.

40
00:03:14,340 --> 00:03:20,630
The input is the original image and the output is a modified or transformed version of that image.

41
00:03:21,210 --> 00:03:25,940
In other words, you might think of this as a feature transformation on the input image.

42
00:03:26,670 --> 00:03:29,850
You know, that sounds curiously like what a neuron that work does.

43
00:03:30,510 --> 00:03:34,080
In any case, what makes a blurring and edge detection actually work?

44
00:03:34,290 --> 00:03:39,390
If they are both convolution, what makes one convolution different from another convolution?

45
00:03:40,200 --> 00:03:41,640
Well, the answer is the filter.

46
00:03:42,330 --> 00:03:45,390
When you use a Gaussian filter, this blurs the image.

47
00:03:45,810 --> 00:03:49,020
When you use an edge detection filter, you get edge detection.

48
00:03:49,020 --> 00:03:51,420
You can think of that as sharpening the image.

49
00:03:52,380 --> 00:03:56,960
Your next question might be how do we find or design these filters in the first place?

50
00:03:57,630 --> 00:03:59,780
That's something we'll discuss later in the session.

51
00:04:00,090 --> 00:04:05,520
But the answer, as you might have come to expect, is just another dumb as possible approach.

52
00:04:10,600 --> 00:04:13,940
OK, so let's get into the nitty gritty of how convolution works.

53
00:04:14,440 --> 00:04:18,250
I promised you that all you need to know is how to add and multiply.

54
00:04:18,790 --> 00:04:19,840
Let's see if that's true.

55
00:04:20,830 --> 00:04:21,610
We're going to use it.

56
00:04:21,610 --> 00:04:27,400
Very tiny images for this example, although in reality, the actual images we work with will be much

57
00:04:27,400 --> 00:04:27,820
bigger.

58
00:04:28,480 --> 00:04:31,660
This is just so we can feasibly do these calculations by hand.

59
00:04:32,710 --> 00:04:39,970
So let's start with an image zero 10, 10, zero, 20, 30, 30, 20, 10, 20, 2010 and zero five

60
00:04:39,970 --> 00:04:40,560
five zero.

61
00:04:40,960 --> 00:04:43,790
And we also have the filter one zero zero two.

62
00:04:44,470 --> 00:04:47,230
So how do we convolve this image with this filter?

63
00:04:52,290 --> 00:04:59,700
Let's imagine overlaying the filter at the upper left corner of the image, then all we need to do is

64
00:04:59,700 --> 00:05:03,530
eliminate Y's multiplication and add up all the results.

65
00:05:03,990 --> 00:05:13,270
So yet one time zero plus zero times 10 plus zero times 20 plus 30 times two, that is equal to 60.

66
00:05:13,950 --> 00:05:17,210
This gives us our first output in the output image.

67
00:05:17,760 --> 00:05:21,720
And as promised, all you needed to do was multiply and add.

68
00:05:26,820 --> 00:05:33,270
Now, let's move our filter over one space to the right and then let's do our element Y's multiplication

69
00:05:33,270 --> 00:05:41,550
and add up the results again so we get one times ten plus zero times ten plus zero times thirty plus

70
00:05:41,550 --> 00:05:42,500
two times thirty.

71
00:05:42,630 --> 00:05:43,600
That's 70.

72
00:05:44,280 --> 00:05:48,510
This gives us our second output, which goes to the right of the first output.

73
00:05:53,560 --> 00:05:59,350
Now, let's move the filter over one more space to the right and do the same operation again, we get

74
00:05:59,500 --> 00:06:06,390
one times ten plus zero zero plus zero times 30 plus two times 20, and that's 50.

75
00:06:07,060 --> 00:06:11,320
This gives us our third output, which goes to the right of the first to upwards.

76
00:06:16,380 --> 00:06:21,900
Now you realize that there's no more space to move our filter to the right anymore, so let's instead

77
00:06:21,900 --> 00:06:27,540
zigzag back to the left, but go down one row and then let's repeat our calculation.

78
00:06:28,140 --> 00:06:34,920
We get one times 20 plus zero times 30 plus zero times ten plus two times twenty.

79
00:06:34,950 --> 00:06:36,020
That's equal to 60.

80
00:06:36,840 --> 00:06:39,750
This gives us our first output in the second row.

81
00:06:41,340 --> 00:06:46,470
So you can see that the output location corresponds to where we've placed the filter along the original

82
00:06:46,470 --> 00:06:46,980
image.

83
00:06:52,090 --> 00:06:56,440
So we're not going to go through the rest of the calculation, since that would be a bit redundant,

84
00:06:56,830 --> 00:07:02,080
but what I want you to do is if you don't understand this, do the rest of the calculations by hand

85
00:07:02,470 --> 00:07:04,330
to make sure that these results are correct.

86
00:07:06,310 --> 00:07:11,200
So we have 60, 70, 50, 60, 70, 50, and then 20, 30, 20.

87
00:07:16,270 --> 00:07:23,140
One helpful exercise to get a better idea of how Convolution works is to write pseudocode and even real

88
00:07:23,140 --> 00:07:25,940
code to implement the algorithm we just described.

89
00:07:26,530 --> 00:07:30,520
This will help us uncover a few hidden details you may have not considered.

90
00:07:31,840 --> 00:07:33,150
Just by starting the code.

91
00:07:33,160 --> 00:07:37,240
You realize one thing you need to take care of, which is what size?

92
00:07:37,450 --> 00:07:39,850
Should I initialize the output arrays?

93
00:07:40,510 --> 00:07:47,410
If you recall in our example, the input image headline for while the kernel had length to the output,

94
00:07:47,430 --> 00:07:48,230
haling three.

95
00:07:49,000 --> 00:07:49,990
So what's the pattern?

96
00:07:50,830 --> 00:07:56,920
I claim that if you have an array of length NP and a filter of length K, then there are an A minus

97
00:07:56,920 --> 00:08:01,550
K plus one distinct possible positions you can put the filter into.

98
00:08:02,530 --> 00:08:06,550
You might want to draw this on paper if you don't see right away why this is true.

99
00:08:11,570 --> 00:08:14,910
So the first step in our pseudocode is to initialize the output array.

100
00:08:15,800 --> 00:08:21,680
But here's another detail you may have not considered the output height will be the input height minus

101
00:08:21,680 --> 00:08:22,790
Col height plus one.

102
00:08:23,270 --> 00:08:26,900
The output with will be the input with minus Kohona with plus one.

103
00:08:27,710 --> 00:08:31,130
But in our example, both our image and kernel were square.

104
00:08:31,850 --> 00:08:34,300
So you might be wondering, is this always the case?

105
00:08:34,520 --> 00:08:35,490
What's the convention?

106
00:08:36,170 --> 00:08:39,630
And the answer is that four images, usually these are not square.

107
00:08:40,310 --> 00:08:42,550
This just has to do with cameras and screens.

108
00:08:43,070 --> 00:08:49,520
Most screens like your computer screen and your TV screen are not square and therefore cameras do not

109
00:08:49,520 --> 00:08:50,770
take square pictures.

110
00:08:51,590 --> 00:08:57,740
Therefore, images we find in the wild that we want to use as data are typically also not square.

111
00:08:58,670 --> 00:09:01,980
Some narrow networks do use square images for convenience.

112
00:09:02,150 --> 00:09:04,920
So when you build the data set, you make them square.

113
00:09:05,330 --> 00:09:08,210
And one example of this is Imust, which we've already seen.

114
00:09:08,930 --> 00:09:12,980
On the other hand, kernels are almost always square by convention.

115
00:09:18,010 --> 00:09:24,100
Let's move on, the next step is just to fill in the output array by performing the convolution algorithm.

116
00:09:25,300 --> 00:09:29,800
So first we loop through zero up to output height with the index I.

117
00:09:31,870 --> 00:09:36,590
Inside that, we loop through zero up to output with with the index J.

118
00:09:37,570 --> 00:09:40,330
So the pair will always index our output.

119
00:09:41,470 --> 00:09:47,910
Then we lived through each position of the colonel, so we have I going from zero up to Colonel Height

120
00:09:47,950 --> 00:09:51,190
and we have JJ going from zero up to Colonel Width.

121
00:09:52,090 --> 00:09:58,540
Finally, inside all these loops, we have our main calculation, which, as promised, is just multiplication.

122
00:09:58,540 --> 00:10:07,030
And addition, we multiply the input image at position I plus I j plus JJ by the colonel, that position

123
00:10:07,030 --> 00:10:08,220
I j j.

124
00:10:08,710 --> 00:10:12,640
And we add this result to the output image at IJA.

125
00:10:13,750 --> 00:10:19,480
Now it's OK to accumulate using plus equals because we initialize the output image to be all zeros.

126
00:10:20,770 --> 00:10:26,080
As an exercise, you might want to try and put this into code so that you can confirm that it works

127
00:10:26,080 --> 00:10:27,650
and returns the expected result.

128
00:10:32,700 --> 00:10:38,680
The inner part of the pseudocode is key because it helps us understand the equation that defines convolution.

129
00:10:39,600 --> 00:10:47,370
So here we can see that the eigth entry of the output A involved with W is the sum over I prime and

130
00:10:47,370 --> 00:10:55,590
the sum of a gay pride of a add i plus I prime J plus J prime times w at my prime J Prime.

131
00:10:56,790 --> 00:11:01,650
Now you might ask, why are we looking at these complicated equations of Tenzer flow already?

132
00:11:01,650 --> 00:11:02,830
Does all this work for us.

133
00:11:03,450 --> 00:11:09,420
And the answer is that this will help you immensely in understanding the different perspectives on convolution

134
00:11:09,690 --> 00:11:11,180
that we are going to discuss later.

135
00:11:16,220 --> 00:11:22,100
Moreover, if you're a curious person and you go on Wikipedia to read about convolution, you'll see

136
00:11:22,100 --> 00:11:27,060
something very similar to this and this example you can think of as the filter.

137
00:11:27,230 --> 00:11:29,810
Why is the input image and zella's the output image?

138
00:11:34,880 --> 00:11:39,560
Now, you might notice something weird about this equation from Wikipedia, which is that instead of

139
00:11:39,560 --> 00:11:41,400
plus signs, we have minus signs.

140
00:11:41,750 --> 00:11:42,370
Why is that?

141
00:11:42,710 --> 00:11:44,280
Is the lazy programmer wrong?

142
00:11:45,080 --> 00:11:48,200
And the answer is, in fact, all of deep learning is wrong.

143
00:11:48,860 --> 00:11:51,710
But for better or worse, that's just the way we do things.

144
00:11:52,270 --> 00:11:57,290
In the end, it doesn't make a difference because the filters we use will be learned that using gradient

145
00:11:57,290 --> 00:11:59,240
descent, in other words, automatically.

146
00:12:00,560 --> 00:12:05,780
So the process of finding the filter will be done automatically using gradient descent, which will

147
00:12:05,780 --> 00:12:08,100
find the best values that optimize our loss function.

148
00:12:08,270 --> 00:12:11,210
So it doesn't matter if the filter is reversed or not.

149
00:12:16,170 --> 00:12:21,690
In fact, if you use a library like Zippi, you'll notice that it already has a function called Convolve

150
00:12:21,690 --> 00:12:22,120
Tuti.

151
00:12:23,070 --> 00:12:27,800
The problem is if you use this function as is, you'll get a totally different answer than we did.

152
00:12:29,010 --> 00:12:33,180
And as a side note, you might want to try that yourself to confirm that what I'm saying is true.

153
00:12:34,500 --> 00:12:40,320
Now, this is because Convolve Tuti does a proper convolution and not the deep learning version of convolution

154
00:12:40,590 --> 00:12:42,240
with plus instead of minus.

155
00:12:42,960 --> 00:12:49,080
In order to make Zippy's convolution work the same way, we have to flip the filter both horizontally

156
00:12:49,080 --> 00:12:52,470
and vertically and set the mode argument equal to valid.

157
00:12:53,960 --> 00:13:00,410
Now, as a side note, convolution is a commutative operation, a convolve with W is the same as W convulse

158
00:13:00,410 --> 00:13:04,310
with a, therefore it doesn't matter which input we flip.

159
00:13:09,390 --> 00:13:13,810
In fact, what we are doing in deep learning is actually known as the cross correlation.

160
00:13:14,460 --> 00:13:20,400
I actually think this is a much more helpful and descriptive name compared to convolution convolution

161
00:13:20,400 --> 00:13:21,210
in and of itself.

162
00:13:21,210 --> 00:13:25,020
Probably doesn't mean much to you, but the word correlation does.

163
00:13:25,560 --> 00:13:28,840
You probably think of the word correlated as sameness.

164
00:13:29,370 --> 00:13:35,100
So if I say X and Y are correlated, that means to you that there's some degree of similarity between

165
00:13:35,100 --> 00:13:35,760
X and Y.

166
00:13:36,420 --> 00:13:41,370
Therefore, you might think of CNN as a correlation neuron network rather than a convolutional known

167
00:13:41,370 --> 00:13:41,860
network.

168
00:13:42,450 --> 00:13:48,570
The only difference between correlation and convolution is that convolution reverses the orientation

169
00:13:48,570 --> 00:13:51,060
of the filter, whereas correlation does not.

170
00:13:56,110 --> 00:14:02,050
The final topic I want to talk about in this lecture is this argument to understand this, it's helpful

171
00:14:02,050 --> 00:14:06,020
to look at these animations which kind of summarize how evolution works.

172
00:14:06,760 --> 00:14:11,160
Basically, you're sliding the filter across every possible position in the input image.

173
00:14:11,860 --> 00:14:16,600
And this animation, the motion of the filter is bounded by the edges of the image.

174
00:14:17,170 --> 00:14:20,800
Because of this output, image is always smaller than the input image.

175
00:14:25,860 --> 00:14:31,650
But you might wonder, what if I want the output image to be the same size as the input image in this

176
00:14:31,650 --> 00:14:33,660
case, what you can do is add padding.

177
00:14:34,380 --> 00:14:39,660
This is equivalent to adding a virtual array of zeros around the input image so that the filter can

178
00:14:39,660 --> 00:14:41,250
extend out to those values.

179
00:14:42,760 --> 00:14:46,960
The reason I say it's virtual is because you wouldn't really want to allocate the space in code.

180
00:14:47,500 --> 00:14:52,270
There's no reason to, since you already know that anything multiplied by zero is still zero.

181
00:14:52,960 --> 00:14:58,090
So in other words, to quote unquote add padding, you can just pretend there are zeros surrounding

182
00:14:58,090 --> 00:15:04,630
the input image as many zeros as you need to ensure that the output size is equal to the input size.

183
00:15:09,790 --> 00:15:13,870
You may have noticed, however, that even with padding, we still lose some information.

184
00:15:14,530 --> 00:15:17,560
What I mean by that is there are outputs we could calculate.

185
00:15:17,830 --> 00:15:21,980
If we extended the padding even further, there would be non-zero outputs.

186
00:15:22,600 --> 00:15:28,150
So another thing you can do, which is not as common these days, is to extend the padding further so

187
00:15:28,150 --> 00:15:30,550
that you can catch all these non-zero outputs.

188
00:15:31,450 --> 00:15:34,630
This results in an output size of end plus K minus one.

189
00:15:34,900 --> 00:15:38,350
If your input image has lengthened and your filter has length K.

190
00:15:38,980 --> 00:15:42,880
Again, you should draw this out on paper if you're not convinced this is true.

191
00:15:47,980 --> 00:15:53,140
To summarize the three modes of convolution, let's look at this table, the first one we discussed

192
00:15:53,140 --> 00:15:54,550
was called valid convolution.

193
00:15:55,060 --> 00:16:00,760
This applies when the kernel can only touch the original input image, the output size is and a minus

194
00:16:00,760 --> 00:16:01,510
K plus one.

195
00:16:02,500 --> 00:16:05,170
The second one we discussed was called same convolution.

196
00:16:05,740 --> 00:16:11,380
In this scenario, we had some padding just enough so that the output size is the same as the input

197
00:16:11,380 --> 00:16:12,220
size N.

198
00:16:13,710 --> 00:16:19,230
The third mode we discussed is called full convolution in this scenario, we extend the filter out as

199
00:16:19,230 --> 00:16:25,010
far as possible so that at least one point on the filter overlaps with one point on the input image.

200
00:16:25,500 --> 00:16:28,230
The output size is N plus K minus one.

201
00:16:29,250 --> 00:16:33,090
Normally in deep learning, we use valid or same convolutions.