1
00:00:11,580 --> 00:00:17,610
Now that you understand how exactly a convolution layer works, including the bias term activation function,

2
00:00:17,940 --> 00:00:21,440
we can now consider the architecture of a convolutional neural network.

3
00:00:21,450 --> 00:00:22,620
And why is that way?

4
00:00:23,460 --> 00:00:29,430
So as a little bit of a history lesson, modern CNN's essentially all originated from the same model,

5
00:00:29,520 --> 00:00:30,060
Alina.

6
00:00:30,780 --> 00:00:36,630
This is named after Yann LeCun, one of the original deep learning pioneers, along with Geoffrey Hinton

7
00:00:36,630 --> 00:00:37,800
and Joshua Bengio.

8
00:00:38,520 --> 00:00:43,740
What we're going to do in this lecture is just start with what the architecture is and then go through

9
00:00:43,740 --> 00:00:47,310
it piece by piece so that you understand why it is the way it is.

10
00:00:52,420 --> 00:00:54,460
OK, so let's look at a typical CNN.

11
00:00:55,180 --> 00:00:57,400
A typical CNN has two stages.

12
00:00:57,940 --> 00:01:00,970
The first stage is a series of convolutional layers.

13
00:01:01,600 --> 00:01:06,910
Importantly, these convolutional layers are usually followed by pooling layers, which is another type

14
00:01:06,910 --> 00:01:08,290
of layer we'll discuss shortly.

15
00:01:09,810 --> 00:01:14,100
The second stage is a series of dense layers, also called the fully connected layers.

16
00:01:14,460 --> 00:01:18,510
These are just a regular feed for a neural network, as you saw in the previous section.

17
00:01:19,710 --> 00:01:24,390
If you recall, I said in the previous section that neural networks have this hierarchical structure.

18
00:01:24,930 --> 00:01:28,290
Each layer is always looking at features that were found in the previous layer.

19
00:01:29,730 --> 00:01:33,840
So you can even think of the first stage as just another feature, Transformer.

20
00:01:34,470 --> 00:01:39,090
It's a feature transformer that specifically works on images in finds image features.

21
00:01:39,840 --> 00:01:44,850
Then once it's found those image features, it passes those into a neural network and then the neural

22
00:01:44,850 --> 00:01:48,960
network does what it does best nonlinear classification or regression.

23
00:01:54,140 --> 00:01:56,750
So let's have a closer look at the convolutional stage.

24
00:01:57,500 --> 00:02:01,070
Now I just added a new layer without really explaining it the pooling layer.

25
00:02:01,790 --> 00:02:07,460
Why would we want to do that at a high level, pooling acts as a downsampling operation?

26
00:02:08,240 --> 00:02:09,259
What is downsampling?

27
00:02:09,979 --> 00:02:12,800
That's when you make a smaller image out of a bigger image.

28
00:02:13,580 --> 00:02:16,190
So imagine your image is 100 by 100.

29
00:02:16,910 --> 00:02:21,050
After pooling, if the pool size is too, then you would get a 50 by 50.

30
00:02:21,140 --> 00:02:22,760
That's downsampling by two.

31
00:02:23,600 --> 00:02:28,340
Let's go over it mechanically how this works, and then we'll explain why it works and why we would

32
00:02:28,340 --> 00:02:30,650
want to include it in our neural network.

33
00:02:35,770 --> 00:02:41,770
So there are two kinds of pull in Max, pulling an average pull in choosing which one to use is, as

34
00:02:41,770 --> 00:02:44,110
you probably guessed, a hyper parameter choice.

35
00:02:44,800 --> 00:02:48,580
Let's talk about max pooling first, since I think this one's a bit more common.

36
00:02:49,600 --> 00:02:51,940
The best way to see how it works is by example.

37
00:02:52,750 --> 00:02:58,120
So if we start with an image, let's say it's a four by four and contains these numbers one two three

38
00:02:58,120 --> 00:03:02,620
four five six seven eight 16 15 14 13 12 11 10 nine.

39
00:03:03,010 --> 00:03:06,880
So just the numbers one to 16, but organized nine sequentially.

40
00:03:08,420 --> 00:03:14,060
What Max Pooling will do is take each two by two square of which there are four and just return the

41
00:03:14,060 --> 00:03:16,070
maximum value in each of those squares.

42
00:03:17,920 --> 00:03:24,550
So the max of one two five six is six, the max of three, four, seven and eight is eight, the max

43
00:03:24,550 --> 00:03:30,730
of 16, 15, 12, 11 is 16 and the max of 14, 13, 10, nine is 14.

44
00:03:32,730 --> 00:03:35,580
So the output image is six, eight, 16, 14.

45
00:03:40,620 --> 00:03:45,150
As you might have guessed, average pulling a similar, but instead of taking the max, we take the

46
00:03:45,150 --> 00:03:45,690
average.

47
00:03:46,350 --> 00:03:51,660
So for the same input image, we have the following the average of one two, five and six is three point

48
00:03:51,660 --> 00:03:52,140
five.

49
00:03:52,770 --> 00:03:55,770
The average of three, four seven and eight is five point five.

50
00:03:56,370 --> 00:03:59,790
The average age of 16 15, 12 11 is thirteen point five.

51
00:04:00,120 --> 00:04:03,600
And the average of 13 14 10 nine is eleven point five.

52
00:04:04,140 --> 00:04:09,060
So the output image is three point five five point five thirteen point five eleven point five.

53
00:04:14,240 --> 00:04:17,510
Now that you know how pooling works, let's talk about why we should use it.

54
00:04:18,290 --> 00:04:20,390
The first advantage of pooling is practical.

55
00:04:20,959 --> 00:04:27,650
If we downsample the image, the image shrinks and therefore we have less data to process, less multiplications

56
00:04:27,650 --> 00:04:28,060
to do.

57
00:04:28,070 --> 00:04:29,690
And of course, that will speed things up.

58
00:04:30,500 --> 00:04:34,640
But the second and more important advantage has to do with image processing itself.

59
00:04:35,420 --> 00:04:38,000
Recall this idea of translational invariance.

60
00:04:40,460 --> 00:04:45,650
This is the idea that I don't care where in an image the feature occurred, I just care that it occurred.

61
00:04:46,190 --> 00:04:50,240
For example, both you and I can recognize this as an error.

62
00:04:51,290 --> 00:04:53,330
We also recognize this as an error.

63
00:04:53,990 --> 00:04:59,720
It isn't matter where I put the air on the screen, so translational and variance is important from

64
00:04:59,720 --> 00:05:02,810
a biological point of view and how we as humans learn.

65
00:05:03,350 --> 00:05:07,280
We don't have to learn how to recognize A's at each point in our field of vision.

66
00:05:07,820 --> 00:05:12,230
Once we know what something looks like, we can recognize it, no matter if it's up, down, left to

67
00:05:12,230 --> 00:05:12,530
right.

68
00:05:17,610 --> 00:05:18,930
So what is Max pooling doing?

69
00:05:19,590 --> 00:05:21,780
Well, consider a small two by two box.

70
00:05:22,380 --> 00:05:25,980
Remember that convolution tells us whether or not a feature has been found.

71
00:05:26,100 --> 00:05:27,090
It's a pattern finder.

72
00:05:27,750 --> 00:05:32,700
So imagine that in this two by two box, I have these flags pattern found, not found, not found,

73
00:05:32,730 --> 00:05:33,300
not found.

74
00:05:34,080 --> 00:05:37,530
So it's a high number and a small number and a small number and a small number.

75
00:05:38,280 --> 00:05:40,710
The pattern being found returns a higher number.

76
00:05:41,130 --> 00:05:44,190
So when I take the max, it says the pattern has been found.

77
00:05:44,850 --> 00:05:49,980
That's good because it tells me that the pattern has been found without caring where it was found.

78
00:05:50,640 --> 00:05:55,290
Average pulling is the same idea, but I think max pooling is a little more intuitive in this regard.

79
00:06:00,470 --> 00:06:06,050
As I say, no, keep in mind that in this lecture, I assume that we're using a pool size of two and

80
00:06:06,050 --> 00:06:07,340
they're at each pulling layer.

81
00:06:07,550 --> 00:06:09,980
We select boxes, which are two spaces apart.

82
00:06:11,480 --> 00:06:14,210
Now, pulling layers have some flexibility in this regard.

83
00:06:14,990 --> 00:06:18,620
First, it's not necessary for the pool size to be the same in both directions.

84
00:06:19,100 --> 00:06:24,680
For example, you can look at a two by three box or a three by two box, although generally that's unconventional.

85
00:06:25,610 --> 00:06:28,760
Second is that it's possible for the boxes to overlap.

86
00:06:29,420 --> 00:06:32,270
The type of parameter that controls this is called the stride.

87
00:06:32,840 --> 00:06:36,650
It tells us how many spaces to move the pulling box for each output.

88
00:06:38,670 --> 00:06:43,740
Previously, we looked at the situation where we had a pool size of two by two and the stride was also

89
00:06:43,740 --> 00:06:44,130
two.

90
00:06:44,700 --> 00:06:46,290
This is a pretty common scenario.

91
00:06:47,040 --> 00:06:50,310
If you had to straight of one, then hour two by two box would overlap.

92
00:06:50,910 --> 00:06:52,290
This is not really to comment.

93
00:06:52,980 --> 00:06:57,900
So while you have these options, which I want to make you aware of, they're are usually not changed

94
00:06:57,900 --> 00:06:58,380
too often.

95
00:06:59,100 --> 00:07:02,070
So what you see in this lecture is what we do most often.

96
00:07:02,490 --> 00:07:08,730
We use a pool size of two in each direction and we have a stride of two so that we don't have any overlapping

97
00:07:08,730 --> 00:07:09,360
boxes.

98
00:07:14,560 --> 00:07:17,170
So now that you know what pooling does and why it works.

99
00:07:17,470 --> 00:07:21,970
Let's discuss why the convolutional part of the CNN is organized the way it is.

100
00:07:22,570 --> 00:07:27,430
Why should we have a sequential pattern consisting of a conveyor followed by a pooling layer, followed

101
00:07:27,430 --> 00:07:30,280
by a conveyor followed by another pulling layer and so forth?

102
00:07:31,060 --> 00:07:36,730
If you recall, I told you earlier that one cool thing researchers discovered is that CNN's are able

103
00:07:36,730 --> 00:07:41,290
to learn features hierarchically, so the initial layers end up learning basic strokes.

104
00:07:41,710 --> 00:07:46,210
The next layer ends up learning individual facial features such as nose, eyes and lips.

105
00:07:46,840 --> 00:07:49,000
The next layer ends up learning whole faces.

106
00:07:54,070 --> 00:07:59,650
The key point to keep in mind is that after each pulling, the image shrinks, but the filter sizes

107
00:07:59,920 --> 00:08:02,630
generally stay around the same comment.

108
00:08:02,680 --> 00:08:06,910
Filter sizes are three by three, five by five and seven by seven.

109
00:08:07,840 --> 00:08:10,570
Generally speaking, they are much smaller than the input image.

110
00:08:11,200 --> 00:08:14,920
But what happens as the image shrinks as it passes through each layer?

111
00:08:15,640 --> 00:08:21,430
Well, let's suppose we start with a 32 by 32 image and we have four convolutional and pulling layers.

112
00:08:22,120 --> 00:08:26,110
Let's assume we do convolution and same mode, so it doesn't change the size of the image.

113
00:08:27,280 --> 00:08:30,580
And let's assume all our convolution filters are size three by three.

114
00:08:30,940 --> 00:08:32,710
Although this doesn't come into play just yet.

115
00:08:34,169 --> 00:08:39,780
In this case, after the first convolution and pulling, the image shrinks down to 16 by 16.

116
00:08:40,530 --> 00:08:44,730
After the second convolution and pull in, the image shrinks down to eight by eight.

117
00:08:45,480 --> 00:08:48,930
After the third convolution and pulling them, it shrinks down to four by four.

118
00:08:49,410 --> 00:08:51,780
And finally, after a fourth convolution are pulling.

119
00:08:52,110 --> 00:08:53,760
The image shrinks down to two by two.

120
00:08:55,950 --> 00:08:58,080
What happens after this is still a mystery.

121
00:08:58,110 --> 00:09:01,020
So let's just leave that for now and focus on the image shrinking.

122
00:09:06,140 --> 00:09:12,620
As you can see, if you take a three by three filter and place it over a 32 by 32 image, it occupies

123
00:09:12,620 --> 00:09:14,570
only a very small portion of the image.

124
00:09:15,200 --> 00:09:18,110
In other words, it's looking for very tiny patterns to match.

125
00:09:18,470 --> 00:09:21,320
Like, for example, simple edges and strokes.

126
00:09:22,190 --> 00:09:25,100
Then the image becomes size 16 by 16.

127
00:09:25,970 --> 00:09:31,070
Now, all of a sudden, the three by three filter takes up four times as much space in the image.

128
00:09:31,610 --> 00:09:35,360
It takes up twice as much space in each direction, and two times two is four.

129
00:09:36,020 --> 00:09:40,790
In any case, it occupies a much larger portion of the image relative to the first filter.

130
00:09:42,590 --> 00:09:48,080
So now it's looking for patterns which take up more space on the image patterns like nose, eyes and

131
00:09:48,080 --> 00:09:48,530
lips.

132
00:09:50,840 --> 00:09:56,300
Next, we get to an eight by eight image, a three by three filter occupies almost a quarter of the

133
00:09:56,300 --> 00:09:56,780
image.

134
00:09:58,530 --> 00:10:04,290
After that conversation and pulling, we get a four by four image now, a three by three filter takes

135
00:10:04,290 --> 00:10:05,490
out most of the image.

136
00:10:10,700 --> 00:10:11,750
So what have we learned?

137
00:10:12,380 --> 00:10:17,540
We've learned that in a convolutional neural network, convolution and pulling results in two things

138
00:10:17,540 --> 00:10:19,040
which are corollaries of each other.

139
00:10:19,700 --> 00:10:21,620
First, the image input shrinks.

140
00:10:22,250 --> 00:10:27,890
Second, since filters generally stay the same size, the filter looks for larger and larger patterns

141
00:10:27,890 --> 00:10:28,610
in the image.

142
00:10:29,120 --> 00:10:33,170
This is what leads to the CNN learning hierarchical features of the input.

143
00:10:33,710 --> 00:10:37,880
It first learns things that are relatively small compared to the total size of the image.

144
00:10:38,330 --> 00:10:41,570
Then it learns to find bigger patterns and bigger patterns and so forth.

145
00:10:46,580 --> 00:10:51,920
The last concepts regarding the convolutional layers we have to think about is, are we not losing information

146
00:10:51,920 --> 00:10:52,700
at every layer?

147
00:10:52,940 --> 00:10:58,880
If we shrink the image and the answer is yes, that's true, we are losing spatial information because

148
00:10:58,880 --> 00:10:59,990
the image keeps shrinking.

149
00:11:00,650 --> 00:11:05,720
We're saying we don't care where in the image the feature was found, just that it was found somewhere

150
00:11:05,750 --> 00:11:06,470
in the image.

151
00:11:06,860 --> 00:11:09,560
So we're losing information about where it was.

152
00:11:10,310 --> 00:11:15,110
But there's another component of the convolution process we haven't considered yet, which is the number

153
00:11:15,110 --> 00:11:15,950
of feature maps.

154
00:11:16,670 --> 00:11:22,280
Again, we have a general pattern of follow, which is that while the size of the image generally decreases,

155
00:11:22,610 --> 00:11:25,130
the number of feature maps generally increases.

156
00:11:25,700 --> 00:11:27,990
In other words, I don't care where the feature is found.

157
00:11:28,190 --> 00:11:29,690
I just care that it was found.

158
00:11:31,130 --> 00:11:34,340
But I do care about different possible features I could find.

159
00:11:35,590 --> 00:11:40,780
So while you're losing spatial information, you're gaining information in terms of what features were

160
00:11:40,780 --> 00:11:41,770
found in the image.

161
00:11:46,850 --> 00:11:49,280
One last note on this convolution and pooling pattern.

162
00:11:50,090 --> 00:11:54,920
One of the hardest things to grasp for new students in deep learning is that we have so many choices.

163
00:11:55,070 --> 00:12:00,350
We call these hyper parameters previously for Ian is we looked at the learning rate, the number of

164
00:12:00,350 --> 00:12:02,420
hidden layers and the number of hidden units.

165
00:12:02,870 --> 00:12:05,750
But now with scenes, we have a ton of more choices.

166
00:12:06,410 --> 00:12:09,050
With convolution, we have to choose the filter size.

167
00:12:09,320 --> 00:12:11,150
We have to choose the number of feature maps.

168
00:12:11,390 --> 00:12:13,700
We have to choose the pool size and so forth.

169
00:12:14,300 --> 00:12:19,100
But luckily, I think this is one instance where there's a general pattern that most people follow,

170
00:12:19,340 --> 00:12:21,320
so you're not going in completely blind.

171
00:12:22,920 --> 00:12:27,570
So stick with these guidelines, and you can be sure that you're at least doing things per accepted

172
00:12:27,570 --> 00:12:34,200
convention that is choosing small filters relative to the image like three by three, five by five or

173
00:12:34,200 --> 00:12:34,950
seven by seven.

174
00:12:35,880 --> 00:12:39,780
Another guideline is to repeat the pattern convolution, followed by pulling.

175
00:12:40,560 --> 00:12:44,460
Another guideline is to increase the number of feature maps at each convolution.

176
00:12:45,000 --> 00:12:50,160
So you might start with 32 and then 64 and 128, and maybe 128 again.

177
00:12:50,970 --> 00:12:56,310
The best way to learn about how other people are doing it is to read lots of papers and check out which

178
00:12:56,310 --> 00:12:57,660
hyper parameters they selected.

179
00:12:58,530 --> 00:13:02,880
Generally, you'll find that this is the pattern followed by most convolutional networks.

180
00:13:07,970 --> 00:13:13,400
Next, I want to mention something a little funny, but in fact, pooling is sometimes not what we are

181
00:13:13,400 --> 00:13:15,290
going to end up using in our scenes.

182
00:13:16,130 --> 00:13:20,600
Researchers have found that we can do something similar, but which is more efficient and sometimes

183
00:13:20,600 --> 00:13:23,040
works just as well in particular.

184
00:13:23,060 --> 00:13:28,610
Recall that for pooling, we have this concept of stride that tells us how far apart each box should

185
00:13:28,610 --> 00:13:30,260
be when we take the max.

186
00:13:30,980 --> 00:13:33,920
In fact, convolution has a stride option as well.

187
00:13:35,280 --> 00:13:39,120
So in this animation, you can see how convolution works with stride.

188
00:13:39,750 --> 00:13:45,660
Remember, convolution just means multiply an ad with a sliding window when we have convolution with

189
00:13:45,660 --> 00:13:46,650
a stripe parameter.

190
00:13:46,980 --> 00:13:50,100
The stride tells us how far apart each window should be.

191
00:13:50,850 --> 00:13:56,310
And of course, our intuition tells us that if we use a stride of two, then the output image length

192
00:13:56,310 --> 00:13:58,140
will be half of what the input was.

193
00:13:59,560 --> 00:14:05,320
In other words, we get the same reduction in size by using a striated convolution instead of convolution,

194
00:14:05,320 --> 00:14:06,190
followed by pulling.

195
00:14:11,250 --> 00:14:16,800
The intuition behind why this works is that if you consider what an image looks like, an image is just

196
00:14:16,800 --> 00:14:18,090
large patches of stuff.

197
00:14:18,840 --> 00:14:24,330
What I mean by that is each pixel is likely to have a similar value to its neighboring pixels.

198
00:14:24,930 --> 00:14:28,770
For example, suppose I'm looking at a red car, I find a red pixel.

199
00:14:29,400 --> 00:14:32,670
Now ask yourself, is the pixel above that also red?

200
00:14:33,210 --> 00:14:33,750
Yes.

201
00:14:34,020 --> 00:14:36,390
Is the pixel below that also probably red?

202
00:14:36,630 --> 00:14:37,200
Yes.

203
00:14:37,980 --> 00:14:40,770
The same can be said of the Pixel to the left and to the right.

204
00:14:41,220 --> 00:14:42,810
That's just the nature of images.

205
00:14:43,950 --> 00:14:46,770
Pixels near each other are often very highly correlated.

206
00:14:47,430 --> 00:14:52,290
You won't have this random jumping around of pixel values because something like that wouldn't look

207
00:14:52,290 --> 00:14:53,220
like an image at all.

208
00:14:53,520 --> 00:14:55,170
That would probably just be noise.

209
00:14:55,740 --> 00:15:00,690
And so using a stride of two simply means having your filter skipping every other pixel.

210
00:15:01,290 --> 00:15:07,140
It's saying we don't care what those two pixels are because they probably are very close to the pixels

211
00:15:07,140 --> 00:15:08,310
we were already looking at.

212
00:15:13,410 --> 00:15:19,410
So to summarize, the convolutional part of our convolutional neural network now looks like this, we

213
00:15:19,410 --> 00:15:25,170
can have a series of shredded conversations, all with approximately the same filter size with increasing

214
00:15:25,170 --> 00:15:26,450
feature maps at each layer.

215
00:15:27,120 --> 00:15:32,670
Or we can have a series of convolutions, followed by pooling again, all with approximately the same

216
00:15:32,670 --> 00:15:36,390
filter size and with increasing numbers of feature maps at each layer.

217
00:15:41,390 --> 00:15:46,010
The last part of this lecture will focus on the second half of the CNN, which is the feedforward neural

218
00:15:46,010 --> 00:15:46,640
network part.

219
00:15:48,080 --> 00:15:53,000
Now, there's not much to discuss here, except that we have to consider the shape of an image is not

220
00:15:53,000 --> 00:15:55,490
appropriate for a dance feed, for no network.

221
00:15:56,270 --> 00:16:01,130
Remember that when an image comes out of a competition, it's going to be three dimensional height by

222
00:16:01,130 --> 00:16:02,630
width by number of feature maps.

223
00:16:03,230 --> 00:16:07,730
But a field for a neural network takes in a feature vector a one dimensional object.

224
00:16:08,600 --> 00:16:11,000
Luckily, we've already discussed how this works.

225
00:16:12,380 --> 00:16:17,900
We can turn a three dimensional object into a one dimensional object using the flattened layer and caress.

226
00:16:23,050 --> 00:16:27,730
One alternative to the flag earlier, which I really like, is the global max pooling layer.

227
00:16:28,390 --> 00:16:31,690
This is a little different from the regular kind of pooling we discussed earlier.

228
00:16:32,680 --> 00:16:36,370
Consider the question What if we have different sized images?

229
00:16:36,970 --> 00:16:39,220
The largest source of images is the internet.

230
00:16:40,300 --> 00:16:44,620
Generally speaking, images we find on the internet are not all the same size.

231
00:16:45,220 --> 00:16:50,650
So it makes sense to wonder is it possible to build a convolutional neural network that is capable of

232
00:16:50,650 --> 00:16:52,240
handling different image sizes?

233
00:16:52,720 --> 00:16:53,920
And the answer is yes.

234
00:16:54,280 --> 00:16:55,930
Using a global max pooling layer.

235
00:17:00,990 --> 00:17:04,050
Let's consider first why a simple, flat layer would not work.

236
00:17:04,890 --> 00:17:10,050
Suppose again, we have an input image of size 32 by 32 and full convolutions.

237
00:17:11,700 --> 00:17:15,280
Assume again, we use convolution and same mode, and we have a stride of two.

238
00:17:16,089 --> 00:17:19,150
So after the first convolution, we get a 16 by 16.

239
00:17:19,510 --> 00:17:21,880
After the second convolution, we get an eight by eight.

240
00:17:22,270 --> 00:17:26,560
After the third convolution, we get a four by four and after the fourth convolution, we get a two

241
00:17:26,560 --> 00:17:27,040
by two.

242
00:17:29,110 --> 00:17:32,620
But now let's say we have an input image of CI 64 by 64.

243
00:17:33,280 --> 00:17:39,520
In this case, the output after four convolutions would be four by four, let's say for simplicity's

244
00:17:39,520 --> 00:17:41,980
sake, the number of output feature maps is 100.

245
00:17:43,460 --> 00:17:49,370
Then for the 32 by 32 image, doing a flattening operation would yield an output of size two by two

246
00:17:49,370 --> 00:17:55,280
by one hundred two times two times 100, which is 400 for 64 by 64.

247
00:17:55,310 --> 00:18:01,370
Image doing a flattened operation would yield an output size of four times four times 100, which is

248
00:18:01,370 --> 00:18:02,180
6500.

249
00:18:02,780 --> 00:18:08,150
And of course, a feedforward neural network is not capable of handling vectors of different sizes.

250
00:18:13,320 --> 00:18:16,050
Now, let's consider how global max pooling works.

251
00:18:16,650 --> 00:18:24,600
The idea is simple if our input image is H by W by C, then the output is one by one by C or equivalently

252
00:18:24,600 --> 00:18:25,980
just a vector of size C.

253
00:18:26,670 --> 00:18:32,640
In other words, it takes the MAX over the entire image along the spatial dimensions over each feature

254
00:18:32,640 --> 00:18:33,060
map.

255
00:18:33,750 --> 00:18:39,540
Since we have C feature maps, we end up with what is essentially a one dimensional vector of size C

256
00:18:39,930 --> 00:18:41,640
since the one dimensions are redundant.

257
00:18:43,690 --> 00:18:47,560
In fact, TensorFlow and Keras get rid of the redundant dimensions for us.

258
00:18:48,370 --> 00:18:53,680
This follows the same general idea of pooling where we're saying I don't care where in the image the

259
00:18:53,680 --> 00:18:56,350
feature was found just that it was found somewhere.

260
00:18:57,250 --> 00:19:02,380
Global max pooling is the extreme of this, saying that the feature can be found anywhere in the entire

261
00:19:02,380 --> 00:19:07,350
image across the entire height and width, since the output is always a vector of size.

262
00:19:08,170 --> 00:19:13,720
It does not depend on the size of the image, meaning that it allows the network to handle images of

263
00:19:13,720 --> 00:19:14,530
any size.

264
00:19:16,240 --> 00:19:21,340
As a side note, we also have global average pulling, which is just the global version of regular average

265
00:19:21,340 --> 00:19:21,730
pulling.

266
00:19:26,930 --> 00:19:32,060
The only exception to this is if your images are too small, in which case you'll just get an error

267
00:19:32,060 --> 00:19:33,230
when you try to run your code.

268
00:19:34,470 --> 00:19:39,630
For example, if your image starts out as a two by two, then it's not possible to reduce the dimensions

269
00:19:39,630 --> 00:19:41,280
by half four times.

270
00:19:41,880 --> 00:19:46,590
You'll notice this happens if you do something like add too many convolutions to your neural network.

271
00:19:51,820 --> 00:19:56,980
All right, so let's summarize the basic architecture of a modern convolutional neural network.

272
00:19:57,850 --> 00:20:03,670
The first stage is a series of Strider convolutions or optionally convolution, followed by pooling

273
00:20:03,670 --> 00:20:08,980
and repeating that pattern in the interface between convolution layers and dense layers.

274
00:20:09,010 --> 00:20:09,820
We have a flattened.

275
00:20:10,600 --> 00:20:15,970
We also optionally have a global max pooling that reduces the image to just the MAX for each feature

276
00:20:15,970 --> 00:20:16,360
map.

277
00:20:18,470 --> 00:20:23,930
Finally, once we flattened our data into a one dimensional vector, we have a regular feed for a neural

278
00:20:23,930 --> 00:20:27,230
network made up of a series of dense layers as usual.

279
00:20:28,100 --> 00:20:34,250
And as you learn in the previous section, the activation and number of nodes for the final layer depends

280
00:20:34,250 --> 00:20:35,690
on the type of task being done.

281
00:20:36,380 --> 00:20:41,630
So if you're doing scalar regression, then you'll only have one output node with no activation function.

282
00:20:42,200 --> 00:20:46,550
If you're doing binary classification, you could have one output node with a sigmoid.

283
00:20:47,240 --> 00:20:53,480
However, for K equals two or more classes, you can have kapwa nodes with a soft max activation.

