1
00:00:11,600 --> 00:00:17,900
In this lecture, we are going to discuss convolution in the preparation for building convolutional

2
00:00:17,900 --> 00:00:18,800
neural networks.

3
00:00:19,430 --> 00:00:25,070
Of course, given the name, it's pretty obvious that a convolutional neural network is a neural network

4
00:00:25,070 --> 00:00:26,390
with convolution.

5
00:00:26,840 --> 00:00:32,390
And so in order to understand scenes, we must first understand convolution.

6
00:00:37,550 --> 00:00:42,620
First, I want to mention that people sometimes treat convolution as this mysterious thing.

7
00:00:43,220 --> 00:00:48,680
But as with our discussion of machine learning, we want to take the dumb as possible approach.

8
00:00:49,190 --> 00:00:51,500
Sure, convolution can be complex.

9
00:00:51,800 --> 00:00:55,280
It's a central operation in a signal processing and computer vision.

10
00:00:55,820 --> 00:00:57,290
But in fact, it's quite simple.

11
00:00:57,800 --> 00:00:59,480
There are only two requirements.

12
00:00:59,810 --> 00:01:02,380
First, can you add second?

13
00:01:02,450 --> 00:01:03,440
Can you multiply?

14
00:01:04,160 --> 00:01:08,810
If your answer to these two questions is yes, then you understand convolution.

15
00:01:09,770 --> 00:01:13,100
Believe it or not, convolution is just adding and multiplying.

16
00:01:18,250 --> 00:01:20,470
So let's start by not doing any math.

17
00:01:20,710 --> 00:01:25,120
Let's just look at convolution qualitatively with convolution.

18
00:01:25,180 --> 00:01:27,460
There are three objects you want to pay attention to.

19
00:01:28,010 --> 00:01:29,530
First is the input image.

20
00:01:30,080 --> 00:01:35,710
Second, there's the filter note that another name for the filter is the kernel, so we're going to

21
00:01:35,710 --> 00:01:38,080
use those two terms interchangeably.

22
00:01:39,310 --> 00:01:41,050
Finally, there's the output image.

23
00:01:41,680 --> 00:01:46,990
The output image is what you get when you involve the input image with the filter.

24
00:01:47,680 --> 00:01:54,100
In other words, if I perform the convolution operation on the input image and the filter, I get the

25
00:01:54,100 --> 00:01:54,910
output image.

26
00:01:55,780 --> 00:01:59,650
The mathematical symbol for convolution is a star or asterisk.

27
00:01:59,950 --> 00:02:05,140
Not to be confused with the asterisk that we use in computer programming for a multiplication.

28
00:02:05,830 --> 00:02:11,770
So if you're writing convolution in code, it will not be an asterisk because in code, we already use

29
00:02:11,770 --> 00:02:13,360
the asterisk for multiplication.

30
00:02:18,430 --> 00:02:23,020
It's helpful to think of some examples of convolution to understand what it can do.

31
00:02:23,860 --> 00:02:25,630
There are two examples I really like.

32
00:02:26,410 --> 00:02:27,970
The first example is blurring.

33
00:02:28,660 --> 00:02:34,600
So the input image is just the original image, and the output image is a blurred version of that image.

34
00:02:35,200 --> 00:02:38,890
You might recognize this operation from applications such as Photoshop.

35
00:02:44,080 --> 00:02:48,200
The second example I really like is edge detection as input.

36
00:02:48,220 --> 00:02:53,770
We have the original image and as output, we get white lines where there are edges in the original

37
00:02:53,770 --> 00:02:56,630
image and we get black where there are no edges.

38
00:02:57,190 --> 00:03:01,390
So the output image highlights all the edges in the original input image.

39
00:03:06,570 --> 00:03:12,420
So that's your first lesson on how to view convolution, this perspective is that convolution is an

40
00:03:12,420 --> 00:03:19,800
image modifier, the input is the original image and the output is a modified or transformed version

41
00:03:19,800 --> 00:03:20,580
of that image.

42
00:03:21,240 --> 00:03:25,890
In other words, you might think of this as a feature of transformation on the input image.

43
00:03:26,670 --> 00:03:29,820
You know, that sounds curiously like what a neural network does.

44
00:03:30,540 --> 00:03:35,760
In any case, what makes the blurring and edge detection actually work if they are both convolution?

45
00:03:36,390 --> 00:03:39,360
What makes one convolution different from another convolution?

46
00:03:40,170 --> 00:03:43,950
Well, the answer is the filter when you use a Gaussian filter.

47
00:03:44,100 --> 00:03:47,700
This blurs the image when you use an edge detection filter.

48
00:03:47,730 --> 00:03:48,990
You get edge detection.

49
00:03:49,020 --> 00:03:51,390
You can think of that as sharpening the image.

50
00:03:52,380 --> 00:03:56,910
Your next question might be how do we find or design these filters in the first place?

51
00:03:57,630 --> 00:03:59,760
That's something we'll discuss later in the section.

52
00:04:00,090 --> 00:04:05,490
But the answer, as you might have come to expect, is just another dumb as possible approach.

53
00:04:10,660 --> 00:04:13,930
OK, so let's get into the nitty gritty of how convolution works.

54
00:04:14,440 --> 00:04:18,250
I promised you that all you need to know is how to add and multiply.

55
00:04:18,790 --> 00:04:19,810
Let's see if that's true.

56
00:04:20,800 --> 00:04:26,230
We're going to use that very tiny images for this example, although in reality, the actual images

57
00:04:26,230 --> 00:04:27,790
we work with will be much bigger.

58
00:04:28,450 --> 00:04:31,630
This is just so we can feasibly do these calculations by hand.

59
00:04:32,710 --> 00:04:39,940
So let's start with an image zero 10, 10, zero 20, 30, 30, 20, 10, 20, 20, 10 and zero five

60
00:04:39,940 --> 00:04:40,540
five zero.

61
00:04:40,990 --> 00:04:43,780
And we also have the filter one zero zero two.

62
00:04:44,500 --> 00:04:47,200
So how do we can evolve this image with this filter?

63
00:04:52,320 --> 00:04:57,000
Let's imagine overlaying the filter at the upper left corner of the image.

64
00:04:57,750 --> 00:05:03,480
Then all we need to do is element wise multiplication and add up all the results.

65
00:05:03,990 --> 00:05:11,500
So yeah, one time zero plus zero times, 10 plus zero times, 20 plus 30 times two.

66
00:05:11,910 --> 00:05:13,260
That is equal to 60.

67
00:05:13,950 --> 00:05:17,190
This gives us our first output in the output image.

68
00:05:17,760 --> 00:05:21,690
And as promised, all you needed to do was multiply and add.

69
00:05:26,820 --> 00:05:33,240
Now, let's move our filter over one space to the right, and then let's do our element wise multiplication

70
00:05:33,240 --> 00:05:34,680
and add up the results again.

71
00:05:35,250 --> 00:05:42,510
So we get one times 10 plus zero times, 10 plus zero times, 30 plus two times 30.

72
00:05:42,660 --> 00:05:43,590
That's 70.

73
00:05:44,280 --> 00:05:48,480
This gives us our second output, which goes to the right of the first output.

74
00:05:53,530 --> 00:05:58,210
Now, let's move the filter over one more space to the right and do the same operation again.

75
00:05:58,840 --> 00:06:05,260
We get one times 10 plus zero times zero plus zero times 30 plus two times 20.

76
00:06:05,410 --> 00:06:06,400
That's 50.

77
00:06:07,090 --> 00:06:11,260
This gives us a third output, which goes to the right of the first two outputs.

78
00:06:16,410 --> 00:06:21,810
Now you realize that there's no more space to move our filter to the right anymore, so let's instead

79
00:06:21,900 --> 00:06:29,250
zig zag back to the left, but go down one row and then let's repeat our calculation we get one times

80
00:06:29,250 --> 00:06:34,910
20 plus zero times, 30 plus zero times, 10 plus two times 20.

81
00:06:34,920 --> 00:06:36,030
That's equal to 60.

82
00:06:36,840 --> 00:06:39,750
This gives us our first output in the second row.

83
00:06:41,340 --> 00:06:46,440
So you can see that the output location corresponds to where we've placed the filter along the original

84
00:06:46,440 --> 00:06:46,950
image.

85
00:06:52,120 --> 00:06:56,410
So we're not going to go through the rest of the calculations, since that would be a bit redundant.

86
00:06:56,830 --> 00:07:02,080
But what I want you to do is if you don't understand this, do the rest of the calculations by hand

87
00:07:02,470 --> 00:07:04,300
to make sure that these results are correct.

88
00:07:06,310 --> 00:07:11,200
So we have 60, 70, 50, 60, 70, 50 and then 20, 30, 20.

89
00:07:16,270 --> 00:07:23,110
One helpful exercise to get a better idea of how convolution works is to write pseudocode and even real

90
00:07:23,110 --> 00:07:25,930
code to implement the algorithm we just described.

91
00:07:26,530 --> 00:07:33,130
This will help us uncover a few hidden details you may have not considered, just by starting the code,

92
00:07:33,130 --> 00:07:39,820
you realize one thing you need to take care of, which is what size should I initialize the output arrays?

93
00:07:40,540 --> 00:07:47,380
If you recall in our example, the input image handling for while the kernel had length to the output

94
00:07:47,380 --> 00:07:48,220
handling three.

95
00:07:49,030 --> 00:07:49,990
So what's the pattern?

96
00:07:50,860 --> 00:07:56,890
I claim that if you have an array of length n and a filter of length K, then there are and a minus

97
00:07:56,890 --> 00:07:59,950
K plus one distinct possible positions.

98
00:08:00,190 --> 00:08:01,570
You can put the filter into.

99
00:08:02,530 --> 00:08:06,520
You might want to draw this on paper if you don't see right away why this is true.

100
00:08:11,600 --> 00:08:14,930
So the first step in our pseudocode is to initialize the output array.

101
00:08:15,830 --> 00:08:18,140
But here's another detail you may have not considered.

102
00:08:19,010 --> 00:08:25,190
The output height will be the input height minus kernel height, plus one output width will be the input

103
00:08:25,190 --> 00:08:26,870
with minus kernel width plus one.

104
00:08:27,680 --> 00:08:31,160
But in our example, both our image and kernel was square.

105
00:08:31,880 --> 00:08:34,280
So you might be wondering, is this always the case?

106
00:08:34,520 --> 00:08:35,480
What's the convention?

107
00:08:36,169 --> 00:08:39,620
And the answer is that four images usually these are not square.

108
00:08:40,340 --> 00:08:42,530
This just has to do with cameras and screens.

109
00:08:43,070 --> 00:08:49,130
Most screens, like your computer screen and your TV screen, are not square, and therefore cameras

110
00:08:49,130 --> 00:08:50,720
do not take square pictures.

111
00:08:51,590 --> 00:08:57,740
Therefore, images we find in the wild that we want to use as data are typically also not square.

112
00:08:58,670 --> 00:09:01,970
Some neural networks do use square images for convenience.

113
00:09:02,180 --> 00:09:04,940
So when you build the data set, you make them square.

114
00:09:05,360 --> 00:09:08,210
And one example of this is amnesty, which we've already seen.

115
00:09:08,960 --> 00:09:12,950
On the other hand, kernels are almost always square by convention.

116
00:09:18,010 --> 00:09:24,070
Let's move on, the next step is just to fill in the output array by performing the convolution algorithm.

117
00:09:25,330 --> 00:09:29,800
So first, we loop through zero up to output height with the index I.

118
00:09:31,870 --> 00:09:36,580
Inside that, we loop through zero up to output width with the index, J.

119
00:09:37,600 --> 00:09:40,300
So the pair will always index our output.

120
00:09:41,470 --> 00:09:43,930
Then we lived through each position of the colonel.

121
00:09:44,770 --> 00:09:51,160
So we have III going from zero up to colonel height and we have JJ going from zero up to colonel with

122
00:09:52,090 --> 00:09:53,820
finally inside all these loops.

123
00:09:53,830 --> 00:09:59,230
We have our main calculation, which as promised, is just multiplication and addition.

124
00:09:59,920 --> 00:10:08,200
We multiply the input image at Position I plus II J plus JJ by the Colonel at Position II JJ.

125
00:10:08,740 --> 00:10:12,640
And we add this result to the output image at AJ.

126
00:10:13,780 --> 00:10:19,450
Now it's OK to accumulate using plus equals because we initialize the output image to be all zeros.

127
00:10:20,800 --> 00:10:26,050
As an exercise, you might want to try and put this into code so that you can confirm that it works

128
00:10:26,050 --> 00:10:27,640
and returns the expected result.

129
00:10:32,700 --> 00:10:38,670
The inner part of the pseudocode is key because it helps us understand the equation that defines convolution.

130
00:10:39,630 --> 00:10:47,430
So here we can see that the IGF entry of the output a involved with W is the sum over I prime and the

131
00:10:47,430 --> 00:10:55,560
sum of a j prime of a at I plus i prime j plus j prime times w at I Prime J Prime.

132
00:10:56,820 --> 00:11:01,620
Now you might ask why are we looking at these complicated equations of TensorFlow already?

133
00:11:01,620 --> 00:11:02,790
Does all this work for us?

134
00:11:03,480 --> 00:11:09,390
And the answer is that this will help you immensely in understanding the different perspectives on convolution

135
00:11:09,690 --> 00:11:11,160
that we are going to discuss later.

136
00:11:16,220 --> 00:11:22,070
Moreover, if you're a curious person and you go on Wikipedia to read about convolution, you'll see

137
00:11:22,070 --> 00:11:23,570
something very similar to this.

138
00:11:24,260 --> 00:11:29,780
And this example you can think of as the filter y as the input image and Z as the output image.

139
00:11:34,910 --> 00:11:39,560
Now, you might notice something weird about this equation from Wikipedia, which is that instead of

140
00:11:39,560 --> 00:11:41,390
plus science, we have minus science.

141
00:11:41,750 --> 00:11:42,350
Why is that?

142
00:11:42,740 --> 00:11:44,270
Is the lazy programming wrong?

143
00:11:45,080 --> 00:11:48,170
And the answer is, in fact, all of deep learning is wrong.

144
00:11:48,830 --> 00:11:51,680
But for better or worse, that's just the way we do things.

145
00:11:52,280 --> 00:11:57,260
In the end, it doesn't make a difference because the filters we use will be learned that using gradient

146
00:11:57,260 --> 00:11:59,210
descent, in other words, automatically.

147
00:12:00,560 --> 00:12:05,750
So the process of finding the filter will be done automatically using gradient descent, which will

148
00:12:05,750 --> 00:12:08,090
find the best values that optimize our lost function.

149
00:12:08,270 --> 00:12:11,180
So it doesn't matter if the filter is reversed or not.

150
00:12:16,200 --> 00:12:22,140
In fact, if you use a library like spy, you'll notice that it already has a function called code 2D.

151
00:12:23,040 --> 00:12:27,780
The problem is if you use this function as is, you'll get a totally different answer than we did.

152
00:12:29,040 --> 00:12:33,180
And as a side note, you might want to try that yourself to confirm that what I'm saying is true.

153
00:12:34,500 --> 00:12:36,060
Now this is because can evolve.

154
00:12:36,060 --> 00:12:42,180
2D does a proper convolution and not the deep learning version of convolution with plus instead of minus.

155
00:12:42,990 --> 00:12:49,080
In order to make CI Pies convolutional work the same way, we have to flip the filter both horizontally

156
00:12:49,080 --> 00:12:52,440
and vertically and set the mode argument equal to valid.

157
00:12:53,760 --> 00:13:00,380
And as a side note, convolution is a commutative operation, a convulse with W is the same as W involved

158
00:13:00,380 --> 00:13:00,860
with A.

159
00:13:01,340 --> 00:13:04,250
Therefore, it doesn't matter which input we flip.

160
00:13:09,420 --> 00:13:13,770
In fact, what we are doing in deep learning is actually known as the cross-correlation.

161
00:13:14,490 --> 00:13:18,660
I actually think this is a much more helpful and descriptive name compared to convolution.

162
00:13:19,620 --> 00:13:24,960
Convolution in and of itself probably doesn't mean much to you, but the word correlation does.

163
00:13:25,560 --> 00:13:28,800
You probably think of the word correlated as sameness.

164
00:13:29,400 --> 00:13:35,070
So if I say X and Y are correlated, that means to you that there's some degree of similarity between

165
00:13:35,070 --> 00:13:35,730
X and Y.

166
00:13:36,420 --> 00:13:41,340
Therefore, you might think of CNN's as a correlation neural network rather than a convolutional neural

167
00:13:41,340 --> 00:13:41,820
network.

168
00:13:42,420 --> 00:13:48,570
The only difference between correlation and convolution is that convolution reverses the orientation

169
00:13:48,570 --> 00:13:51,030
of the filter, whereas correlation does not.

170
00:13:56,080 --> 00:14:01,750
The final topic I want to talk about in this lecture is this mode argument to understand this, it's

171
00:14:01,750 --> 00:14:05,980
helpful to look at these animations, which kind of summarize how convolution works.

172
00:14:06,760 --> 00:14:11,130
Basically, you're sliding the filter across every possible position in the input image.

173
00:14:11,890 --> 00:14:16,570
And this animation, the motion of the filter is bounded by the edges of the image.

174
00:14:17,170 --> 00:14:20,770
Because of this output image is always smaller than the input image.

175
00:14:25,830 --> 00:14:31,620
But you might wonder what if I want the output image to be the same size as the input image in this

176
00:14:31,620 --> 00:14:33,660
case, what you can do is add padding.

177
00:14:34,350 --> 00:14:39,630
This is equivalent to adding a virtual array of zeros around the input image so that the filter can

178
00:14:39,630 --> 00:14:41,220
extend out to those values.

179
00:14:42,760 --> 00:14:46,930
The reason I say it's virtual is because you wouldn't really want to allocate the space in code.

180
00:14:47,500 --> 00:14:52,210
There's no reason to since you already know that anything multiplied by zero is still zero.

181
00:14:52,960 --> 00:14:58,090
So in other words, to quote unquote add padding, you can just pretend there are zeros surrounding

182
00:14:58,090 --> 00:15:04,600
the input image as many zeros as you need to ensure that the output size is equal to the input size.

183
00:15:09,760 --> 00:15:13,810
You may have noticed, however, that even with padding, we still lose some information.

184
00:15:14,470 --> 00:15:19,810
What I mean by that is there are outputs we could calculate if we extended the padding even further.

185
00:15:20,140 --> 00:15:21,970
There would be non-zero outputs.

186
00:15:22,600 --> 00:15:28,120
So another thing you can do, which is not as common these days, is to extend the padding further so

187
00:15:28,120 --> 00:15:30,520
that you can catch all these non-zero outputs.

188
00:15:31,390 --> 00:15:34,600
This results in an output size of N plus minus one.

189
00:15:34,900 --> 00:15:38,350
If your input image has length N and your filter has length K.

190
00:15:38,980 --> 00:15:42,880
Again, you should draw this out on paper if you're not convinced this is true.

191
00:15:47,950 --> 00:15:51,280
To summarize the three modes of convolution, let's look at this table.

192
00:15:52,000 --> 00:15:54,490
The first one we discussed was called valid convolution.

193
00:15:55,030 --> 00:15:58,510
This applies when the colonel can only touch the original input image.

194
00:15:59,020 --> 00:16:01,510
The output size is and the minus K plus one.

195
00:16:02,470 --> 00:16:05,140
The second one we discussed was called same convolution.

196
00:16:05,770 --> 00:16:11,380
In this scenario, we add some padding just enough so that the output size is the same as the input

197
00:16:11,380 --> 00:16:12,190
size and.

198
00:16:13,680 --> 00:16:19,200
The third mode we discussed is called full convolution in this scenario, we extend the filter out as

199
00:16:19,200 --> 00:16:24,960
far as possible so that at least one point on the filter overlaps with one point on the input image.

200
00:16:25,500 --> 00:16:28,200
The output size is ntx plus K minus one.

201
00:16:29,250 --> 00:16:33,060
Normally, in deep learning, we use a valid or same convolutions.