1
00:00:11,580 --> 00:00:17,070
Now that you understand how exactly a convolution layer works, including the bias term and activation

2
00:00:17,070 --> 00:00:22,290
function, we can now consider the architecture of a convolutional neural network and why it's that

3
00:00:22,290 --> 00:00:22,640
way.

4
00:00:23,460 --> 00:00:29,040
So as a little bit of a history lesson in modern science, essentially all originated from the same

5
00:00:29,040 --> 00:00:30,060
model Ilina.

6
00:00:30,780 --> 00:00:36,660
This is named after Yann Lican, one of the original deep learning pioneers, along with Geoffrey Hinton

7
00:00:36,660 --> 00:00:37,830
and Joshua Bengoa.

8
00:00:38,550 --> 00:00:43,770
What we're going to do in this lecture is just start with what the architecture is and then go through

9
00:00:43,770 --> 00:00:47,370
it piece by piece so that you understand why it is the way it is.

10
00:00:52,390 --> 00:00:57,430
OK, so let's look at a typical CNN, a typical CNN has two stages.

11
00:00:57,970 --> 00:01:00,980
The first stage is a series of convolutional layers.

12
00:01:01,570 --> 00:01:06,910
Importantly, these convolutional layers are usually followed by pooling layers, which is another type

13
00:01:06,910 --> 00:01:08,290
of layer we'll discuss shortly.

14
00:01:09,840 --> 00:01:14,100
The second stage is a series of dense layers, also called a fully connected layers.

15
00:01:14,460 --> 00:01:18,510
These are just a regular feed for neural network, as you saw in the previous section.

16
00:01:19,680 --> 00:01:24,410
If you recall, I said in the previous section that neural networks have this hierarchical structure.

17
00:01:24,930 --> 00:01:28,320
Each layer is always looking at features that were found in the previous layer.

18
00:01:29,730 --> 00:01:33,850
So you can even think of the first stage as just another feature transformer.

19
00:01:34,440 --> 00:01:39,120
It's a feature transformer that specifically works on images and finds image features.

20
00:01:39,840 --> 00:01:44,850
Then once it's found those image features, it passes those into a neural network and then the neural

21
00:01:44,850 --> 00:01:48,990
network does what it does best, nonlinear classification or regression.

22
00:01:54,110 --> 00:01:56,790
So let's have a closer look at the convolutional stage.

23
00:01:57,500 --> 00:02:01,080
Now, I just added a new layer without really explaining it, the pooling layer.

24
00:02:01,820 --> 00:02:07,490
Why would we want to do that at a high level, pooling access, a downsampling operation?

25
00:02:08,270 --> 00:02:09,270
What is downsampling.

26
00:02:10,010 --> 00:02:12,830
That's when you make a smaller image out of a bigger image.

27
00:02:13,580 --> 00:02:17,890
So imagine your images 100 by 100 after pulling.

28
00:02:17,900 --> 00:02:21,000
If the pool size is two, then you would get a 50 by 50.

29
00:02:21,140 --> 00:02:22,780
That's down sampling by two.

30
00:02:23,600 --> 00:02:28,520
Let's go over mechanically how this works and then we'll explain why it works and why we would want

31
00:02:28,520 --> 00:02:30,680
to include it in our neural network.

32
00:02:35,750 --> 00:02:41,510
So there are two kinds of pooling, Max, pulling an average pooling, choosing which one to use is,

33
00:02:41,540 --> 00:02:44,150
as you probably guessed, a hyper pragmatic choice.

34
00:02:44,810 --> 00:02:50,420
Let's talk about Max pooling first, since I think this one's a bit more common, the best way to see

35
00:02:50,420 --> 00:02:51,930
how it works is by example.

36
00:02:52,730 --> 00:02:57,950
So if we start with an image, let's say it's a four by four and contains these numbers, one, two,

37
00:02:57,950 --> 00:03:02,600
three, four, five, six, seven, eight, 16, 15, 14, 13, 12, 11, 10, nine.

38
00:03:02,990 --> 00:03:06,740
So just the number one to 16, but organized nonsequential.

39
00:03:08,420 --> 00:03:14,480
What Max will do is take each two by two square of which they're are for and just return the maximum

40
00:03:14,480 --> 00:03:16,100
value in each of those squares.

41
00:03:17,920 --> 00:03:24,310
So the max of one, two, five, six is six, the max of three, four, seven and eight is eight, the

42
00:03:24,310 --> 00:03:30,760
max of 16, 15, 12, 11 is 16 and the max of 14, 13, 10, nine is 14.

43
00:03:32,730 --> 00:03:35,610
So the output image is six, eight, 16, 14.

44
00:03:40,590 --> 00:03:45,180
As you might have guessed, average pulling a similar, but instead of taking the max, we take the

45
00:03:45,180 --> 00:03:45,740
average.

46
00:03:46,320 --> 00:03:51,690
So for the same input image, we have the following the average of one to five and six is three point

47
00:03:51,690 --> 00:03:52,170
five.

48
00:03:52,770 --> 00:03:55,830
The average of three, four, seven and eight is five point five.

49
00:03:56,400 --> 00:04:01,350
The average of sixteen, fifteen, twelve, eleven is thirteen point five and the average of thirteen,

50
00:04:01,350 --> 00:04:03,620
fourteen, ten, nine is eleven point five.

51
00:04:04,110 --> 00:04:09,090
So the output image is three point five five point five thirteen point five eleven point five.

52
00:04:14,240 --> 00:04:17,560
Now that you know how polling works, let's talk about why we should use it.

53
00:04:18,290 --> 00:04:20,410
The first advantage of polling is practical.

54
00:04:20,930 --> 00:04:27,680
If we downsampled the image, the image shrinks and therefore we have less data to process, less multiplications

55
00:04:27,680 --> 00:04:28,050
to do.

56
00:04:28,070 --> 00:04:29,740
And of course, that will speed things up.

57
00:04:30,530 --> 00:04:34,700
But the second and more important advantage has to do with image processing itself.

58
00:04:35,420 --> 00:04:38,060
Recall this idea of translational invariance.

59
00:04:40,490 --> 00:04:45,680
This is the idea that I don't care where in an image the feature occurred, I just care that it occurred.

60
00:04:46,190 --> 00:04:50,250
For example, both you and I can recognize this as an error.

61
00:04:51,290 --> 00:04:53,340
We also recognize this as an error.

62
00:04:53,960 --> 00:04:56,330
It doesn't matter where I put the ad on the screen.

63
00:04:56,960 --> 00:05:02,830
So translational invariance is important from a biological point of view and how we as humans learn.

64
00:05:03,380 --> 00:05:07,300
We don't have to learn how to recognize Azz at each point in our field of vision.

65
00:05:07,850 --> 00:05:12,560
Once we know what something looks like, we can recognize it no matter if it's up, down, left or right.

66
00:05:17,580 --> 00:05:18,960
So what is Max Pooling doing?

67
00:05:19,620 --> 00:05:25,200
Well, consider a small to buy two box, remember that convolution it tells us whether or not a feature

68
00:05:25,200 --> 00:05:26,070
has been found.

69
00:05:26,100 --> 00:05:27,110
It's a pattern finder.

70
00:05:27,720 --> 00:05:29,790
So imagine that in this to buy two box.

71
00:05:29,790 --> 00:05:31,680
I have these flags Pat and found.

72
00:05:31,680 --> 00:05:32,250
Not found.

73
00:05:32,250 --> 00:05:32,720
Not found.

74
00:05:32,730 --> 00:05:33,340
Not found.

75
00:05:34,050 --> 00:05:36,210
So it's a high number and a small number.

76
00:05:36,210 --> 00:05:36,840
And a small number.

77
00:05:36,840 --> 00:05:37,540
And a small number.

78
00:05:38,280 --> 00:05:40,740
The pattern being found returns a higher number.

79
00:05:41,130 --> 00:05:44,250
So when I take the max, it says the pattern has been found.

80
00:05:44,830 --> 00:05:49,990
That's good, because it tells me that the pattern has been found without caring where it was found.

81
00:05:50,610 --> 00:05:52,290
Average pulling is the same idea.

82
00:05:52,410 --> 00:05:55,320
But I think Max pooling is a little more intuitive in this regard.

83
00:06:00,440 --> 00:06:06,080
As a side note, keep in mind that in this lecture, I assumed that we're using a pool size of two and

84
00:06:06,080 --> 00:06:09,980
that at each pulling layer we select boxes which are two spaces apart.

85
00:06:11,480 --> 00:06:14,240
Now, pulling layers have some flexibility in this regard.

86
00:06:14,990 --> 00:06:18,680
First, it's not necessary for the pool size to be the same in both directions.

87
00:06:19,130 --> 00:06:23,870
For example, you could look at a two by three box or a three by two box, although generally that's

88
00:06:23,870 --> 00:06:24,650
unconventional.

89
00:06:25,610 --> 00:06:28,790
Second is that it's possible for the boxes to overlap.

90
00:06:29,420 --> 00:06:32,330
The hyper parameter that controls this is called the stride.

91
00:06:32,840 --> 00:06:36,670
It tells us how many spaces to move the pulling box for each output.

92
00:06:38,660 --> 00:06:43,760
Previously, we looked at the situation where we had a pool size of two by two and the stride was also

93
00:06:43,760 --> 00:06:44,150
too.

94
00:06:44,720 --> 00:06:46,310
This is a pretty common scenario.

95
00:06:47,030 --> 00:06:50,340
If you had to try of one, then our two by two blocks would overlap.

96
00:06:50,900 --> 00:06:52,290
This is not really too common.

97
00:06:52,970 --> 00:06:58,100
So while you have these options, which I want to make you aware of, they are usually not changed too

98
00:06:58,100 --> 00:06:58,420
often.

99
00:06:59,090 --> 00:07:02,090
So what you see in this lecture is what we do most often.

100
00:07:02,510 --> 00:07:08,750
We use a pool size of two in each direction and we have a strike of two so that we don't have any overlapping

101
00:07:08,750 --> 00:07:09,410
boxes.

102
00:07:14,530 --> 00:07:19,540
So now that you know what pooling does and why it works, let's discuss why the convolutional part of

103
00:07:19,540 --> 00:07:22,010
the CNN is organized the way it is.

104
00:07:22,570 --> 00:07:27,460
Why should we have a sequential pattern consisting of a conveyor followed by a pooling layer, followed

105
00:07:27,460 --> 00:07:30,290
by a conveyor, followed by another pooling layer and so forth?

106
00:07:31,030 --> 00:07:36,760
If you recall, I told you earlier that one cool thing researchers discovered is that CNN's are able

107
00:07:36,760 --> 00:07:41,320
to learn features hierarchically so the initial layers end up learning basic strokes.

108
00:07:41,680 --> 00:07:46,240
The next layer ends up learning individual facial features such as nose, eyes and lips.

109
00:07:46,810 --> 00:07:49,030
The next layer ends up learning whole faces.

110
00:07:54,100 --> 00:07:59,680
The key point to keep in mind is that after each pulling, the image shrinks, but the filter sizes

111
00:07:59,950 --> 00:08:02,710
generally stay around the same come in.

112
00:08:02,710 --> 00:08:06,940
Filter sizes are three by three, five by five and seven by seven.

113
00:08:07,870 --> 00:08:10,590
Generally speaking, they are much smaller than the input image.

114
00:08:11,200 --> 00:08:14,920
But what happens as the image shrinks, as it passes through each layer?

115
00:08:15,670 --> 00:08:21,460
Well, let's suppose we start with a 32 by 32 image and we have four convolutional and pulling layers.

116
00:08:22,150 --> 00:08:26,140
Let's assume we do convolution in and same mode so it doesn't change the size of the image.

117
00:08:27,280 --> 00:08:32,020
And let's assume all our convolution filters are size three by three, although this doesn't come into

118
00:08:32,020 --> 00:08:32,730
play just yet.

119
00:08:34,150 --> 00:08:40,990
In this case, after the first convolution and pulling, the image shrinks down to 16 by 16 after the

120
00:08:40,990 --> 00:08:44,740
second convolution and pull in, the image shrinks down to eight by eight.

121
00:08:45,460 --> 00:08:48,960
After the third convolutional pulling the image shrinks down to four by four.

122
00:08:49,390 --> 00:08:53,770
And finally, after a fourth convolution and pulling, the image shrinks down to two by two.

123
00:08:55,970 --> 00:08:58,080
What happens after this is still a mystery?

124
00:08:58,100 --> 00:09:01,070
So let's just leave that for now and focus on the image shrinking.

125
00:09:06,200 --> 00:09:12,620
As you can see, if you take a three by three filter and place it over a 32 by 32 image, it occupies

126
00:09:12,620 --> 00:09:14,610
only a very small portion of the image.

127
00:09:15,170 --> 00:09:20,750
In other words, it's looking for very tiny patterns to match, like, for example, simple edges and

128
00:09:20,750 --> 00:09:21,370
strokes.

129
00:09:22,190 --> 00:09:25,130
Then the image becomes size 16 by 16.

130
00:09:25,970 --> 00:09:31,100
Now, all of a sudden, the three by three filter takes up four times as much space in the image.

131
00:09:31,580 --> 00:09:35,360
It takes up twice as much space in each direction and two times to his four.

132
00:09:35,990 --> 00:09:40,790
In any case, it occupies a much larger portion of the image relative to the first filter.

133
00:09:42,590 --> 00:09:48,110
So now it's looking for patterns which take up more space on the image patterns like nose, eyes and

134
00:09:48,110 --> 00:09:48,560
lips.

135
00:09:50,840 --> 00:09:56,300
Next, we get to an eight by eight image, a three by three filter occupies almost a quarter of the

136
00:09:56,300 --> 00:09:56,840
image.

137
00:09:58,540 --> 00:10:04,300
After that conversation and pulling, we get a four by four image now, a three by three filter takes

138
00:10:04,300 --> 00:10:05,530
up most of the image.

139
00:10:10,670 --> 00:10:11,790
So what have we learned?

140
00:10:12,380 --> 00:10:17,570
We've learned that in a convolutional neural network, convolution and pulling results in two things

141
00:10:17,570 --> 00:10:19,050
which are corollaries of each other.

142
00:10:19,700 --> 00:10:21,670
First, the image input shrinks.

143
00:10:22,250 --> 00:10:27,890
Second, since filters generally stay the same size, the filter looks for larger and larger patterns

144
00:10:27,890 --> 00:10:28,640
in the image.

145
00:10:29,150 --> 00:10:33,190
This is what leads to the CNN learning hierarchical features of the input.

146
00:10:33,680 --> 00:10:37,910
It first learns things that are relatively small compared to the total size of the image.

147
00:10:38,350 --> 00:10:41,590
Then it learns to find bigger patterns and bigger patterns and so forth.

148
00:10:46,580 --> 00:10:51,950
The last concepts regarding the convolutional is we have to think about is are we not losing information

149
00:10:51,950 --> 00:10:54,220
at every layer if we shrink the image?

150
00:10:54,710 --> 00:10:56,450
And the answer is yes, that's true.

151
00:10:56,930 --> 00:11:00,020
We are losing spatial information because the image keeps shrinking.

152
00:11:00,650 --> 00:11:05,750
We're saying we don't care where in the image the feature was found, just that it was found somewhere

153
00:11:05,750 --> 00:11:06,520
in the image.

154
00:11:06,860 --> 00:11:09,580
So we're losing information about where it was.

155
00:11:10,310 --> 00:11:15,140
But there's another component of the convolution process we haven't considered yet, which is the number

156
00:11:15,140 --> 00:11:15,960
of feature maps.

157
00:11:16,640 --> 00:11:22,310
Again, we have a general pattern to follow, which is that while the size of the image generally decreases,

158
00:11:22,610 --> 00:11:25,160
the number of feature maps generally increases.

159
00:11:25,700 --> 00:11:29,720
In other words, I don't care where the feature is found, I just care that it was found.

160
00:11:31,140 --> 00:11:34,380
But I do care about different possible features I could find.

161
00:11:35,590 --> 00:11:40,810
So while you're losing spatial information, you're gaining information in terms of what features were

162
00:11:40,810 --> 00:11:41,800
found in the image.

163
00:11:46,850 --> 00:11:52,370
One last note on this convolution and pulling pattern, one of the hardest things to grasp for new students

164
00:11:52,370 --> 00:11:55,050
in deep learning is that we have so many choices.

165
00:11:55,070 --> 00:11:56,570
We call these hyper parameters.

166
00:11:57,140 --> 00:12:01,520
Previously, for instance, we looked at the learning rate, the number of hidden layers and the number

167
00:12:01,520 --> 00:12:02,440
of hidden units.

168
00:12:02,870 --> 00:12:07,480
But now with CNN, we have a ton of more choices with convolution.

169
00:12:07,520 --> 00:12:09,080
We have to choose the filter size.

170
00:12:09,320 --> 00:12:13,790
We have to choose the number of feature maps, we have to choose the pool size and so forth.

171
00:12:14,300 --> 00:12:19,080
But luckily, I think this is one instance where there's a general pattern that most people follow.

172
00:12:19,340 --> 00:12:21,410
So you're not going in completely blind.

173
00:12:22,890 --> 00:12:28,230
So stick with these guidelines and you can be sure that you're at least doing things per accepted convention

174
00:12:28,860 --> 00:12:34,620
that is choosing small filters relative to the image like three by three, five by five or seven by

175
00:12:34,620 --> 00:12:34,980
seven.

176
00:12:35,850 --> 00:12:39,780
Another guideline is to repeat the pattern convolutional, followed by Pouilly.

177
00:12:40,530 --> 00:12:44,510
Another guideline is to increase the number of feature maps at each convolution.

178
00:12:45,000 --> 00:12:49,750
So you might start with 32 and then 64 and 128 and maybe 128.

179
00:12:49,770 --> 00:12:55,980
Again, the best way to learn about how other people are doing it is to read lots of papers and check

180
00:12:55,980 --> 00:12:57,720
out which hyper parameters they selected.

181
00:12:58,530 --> 00:13:02,940
Generally, you'll find that this is the pattern followed by most convolutional networks.

182
00:13:07,970 --> 00:13:13,430
Next, I want to mention something a little funny, but in fact, poing is sometimes not what we are

183
00:13:13,430 --> 00:13:15,070
going to end up using in our CNN.

184
00:13:16,130 --> 00:13:20,630
Researchers have found that we can do something similar, but which is more efficient and sometimes

185
00:13:20,630 --> 00:13:21,550
works just as well.

186
00:13:22,220 --> 00:13:28,010
In particular, recall that for pooling, we have this concept of stride that tells us how far apart

187
00:13:28,010 --> 00:13:30,290
each box should be when we take the max.

188
00:13:30,980 --> 00:13:33,910
In fact, Convolution has a stride option as well.

189
00:13:35,300 --> 00:13:41,300
So in this animation, you can see how convolution works with stride, remember, convolution just means

190
00:13:41,300 --> 00:13:43,420
multiplying ad with a sliding window.

191
00:13:44,240 --> 00:13:49,790
When we have convolution with a straight parameter, the stria tells us how far apart each window should

192
00:13:49,790 --> 00:13:50,110
be.

193
00:13:50,840 --> 00:13:56,330
And of course, our intuition tells us that if we use a string of two, then the output image length

194
00:13:56,330 --> 00:13:58,190
will be half of what the input was.

195
00:13:59,530 --> 00:14:05,350
In other words, we get the same reduction in size by using a striated convolution instead of convolution,

196
00:14:05,350 --> 00:14:06,220
followed by Pouilly.

197
00:14:11,250 --> 00:14:16,830
The intuition behind why this works is that if you consider what an image looks like, an image is just

198
00:14:16,830 --> 00:14:18,160
large patches of stuff.

199
00:14:18,840 --> 00:14:24,340
What I mean by that is each pixel is likely to have a similar value to its neighboring pixels.

200
00:14:24,930 --> 00:14:28,770
For example, suppose I'm looking at a red car, I find a red pixel.

201
00:14:29,370 --> 00:14:32,680
Now, ask yourself, is the pixel above that also red?

202
00:14:33,210 --> 00:14:33,760
Yes.

203
00:14:34,020 --> 00:14:36,390
Is the pixel below that also probably red?

204
00:14:36,610 --> 00:14:37,230
Yes.

205
00:14:37,980 --> 00:14:40,800
The same can be said of the pixel to the left, ends of the right.

206
00:14:41,190 --> 00:14:42,840
That's just the nature of images.

207
00:14:43,980 --> 00:14:46,770
Pixels near each other are often very highly correlated.

208
00:14:47,430 --> 00:14:52,320
You won't have this random jumping around of pixel values because something like that wouldn't look

209
00:14:52,320 --> 00:14:53,230
like an image at all.

210
00:14:53,490 --> 00:14:55,180
That would probably just be noise.

211
00:14:55,740 --> 00:15:00,710
And so using a string of two simply means having your filter, skipping every other pixel.

212
00:15:01,260 --> 00:15:07,290
It's saying we don't care what those pixels are because they probably are very close to the pixels we

213
00:15:07,290 --> 00:15:08,310
were already looking at.

214
00:15:13,380 --> 00:15:19,440
So to summarize, the convolutional part of our convolutional neural network now looks like this, we

215
00:15:19,440 --> 00:15:25,200
can have a series of strato convolutions, all with approximately the same filter size, with increasing

216
00:15:25,200 --> 00:15:26,460
feature maps at each layer.

217
00:15:27,120 --> 00:15:32,670
Or we can have a series of convolutions followed by pooling again, all with approximately the same

218
00:15:32,670 --> 00:15:36,390
filter size and with increasing numbers of feature maps at each layer.

219
00:15:41,390 --> 00:15:46,040
The last part of this lecture will focus on the second half of the CNN, which is the feed for neural

220
00:15:46,040 --> 00:15:46,640
network part.

221
00:15:48,060 --> 00:15:53,010
Now, there's not much to discuss here except that we have to consider the shape of an image is not

222
00:15:53,010 --> 00:15:55,500
appropriate for a dense field for known that work.

223
00:15:56,220 --> 00:16:01,140
Remember that when an image comes out of a convolution, it's going to be three dimensional height by

224
00:16:01,140 --> 00:16:07,110
width, by a number of feature maps, but a feel for where it takes in a feature vector, a one dimensional

225
00:16:07,110 --> 00:16:07,750
object.

226
00:16:08,580 --> 00:16:11,040
Luckily, we've already discussed how this works.

227
00:16:12,410 --> 00:16:17,960
We can turn a three dimensional object into a one dimensional object using the flattened layer in Kharas.

228
00:16:23,050 --> 00:16:28,810
One alternative to the flannelette, which I really like, is the global max pooling there, this is

229
00:16:28,810 --> 00:16:31,730
a little different from the regular kind of pooling we discussed earlier.

230
00:16:32,680 --> 00:16:36,410
Consider the question, what if we have different sized images?

231
00:16:37,000 --> 00:16:39,270
The largest source of images is the Internet.

232
00:16:40,330 --> 00:16:44,650
Generally speaking, images we find on the Internet are not all the same size.

233
00:16:45,200 --> 00:16:51,130
So it makes sense to wonder is it possible to build a convolutional network that is capable of handling

234
00:16:51,130 --> 00:16:52,280
different image sizes?

235
00:16:52,690 --> 00:16:53,950
And the answer is yes.

236
00:16:54,310 --> 00:16:55,960
Using a global max, pooling their.

237
00:17:01,000 --> 00:17:04,090
Let's consider first why a simple, flattened layer would not work.

238
00:17:04,900 --> 00:17:10,090
Suppose again, we have an input image of size 32 by 32 and four convolutions.

239
00:17:11,640 --> 00:17:17,640
Assume again, we use convolution and say mode and we have a straight up to so after the first convolution,

240
00:17:17,640 --> 00:17:19,140
we get a 16 by 16.

241
00:17:19,500 --> 00:17:21,870
After the second convolution, we get an eight by eight.

242
00:17:22,230 --> 00:17:24,530
After the third convolution, we get a four by four.

243
00:17:24,810 --> 00:17:27,060
And after the fourth convolution, we get a two by two.

244
00:17:29,110 --> 00:17:32,630
But now let's say we have an input image of, say, 64 by 64.

245
00:17:33,250 --> 00:17:37,110
In this case, the output after four convolutions would be four by four.

246
00:17:38,380 --> 00:17:42,010
Let's say for simplicity's sake, the number of output feature maps is one hundred.

247
00:17:43,460 --> 00:17:49,400
Then for the 32 by 32 image, doing a flattened operation would yield an output of size, two by two

248
00:17:49,400 --> 00:17:56,750
by one hundred two times, two times 100, which is four hundred for 64 by 64 image doing a flattened

249
00:17:56,750 --> 00:18:02,210
operation would yield an output size four times, four times 100, which is six hundred.

250
00:18:02,750 --> 00:18:08,210
And of course, a feat for a neural network is not capable of handling vectors of different sizes.

251
00:18:13,320 --> 00:18:16,050
Now, let's consider how global mass pooling works.

252
00:18:16,680 --> 00:18:24,630
The idea is simple if our input image is by WBC, then the output is one by one by sea or equivalently

253
00:18:24,630 --> 00:18:25,990
just a vector of Sisay.

254
00:18:26,670 --> 00:18:32,670
In other words, it takes the max over the entire image, along the spatial dimensions, over each feature

255
00:18:32,670 --> 00:18:33,080
map.

256
00:18:33,750 --> 00:18:40,230
Since we have feature maps, we end up with what is essentially a one dimensional vector of Sisay since

257
00:18:40,230 --> 00:18:41,670
the one dimensions are redundant.

258
00:18:43,660 --> 00:18:49,990
In fact, tents flown keris get rid of the redundant dimensions for us, this follows the same general

259
00:18:49,990 --> 00:18:55,240
idea of pooling where we're saying I don't care where in the image the feature was found, just that

260
00:18:55,240 --> 00:18:56,410
it was found somewhere.

261
00:18:57,250 --> 00:19:02,410
Global mass pooling is the extreme of this, saying that the feature can be found anywhere in the entire

262
00:19:02,410 --> 00:19:04,750
image across the entire height and width.

263
00:19:05,380 --> 00:19:11,410
Since the output is always a vector of Sisay, it does not depend on the size of the image, meaning

264
00:19:11,410 --> 00:19:14,560
that it allows the network to handle images of any size.

265
00:19:16,210 --> 00:19:21,370
As a side note, we also have global average pully, which is just the global version of regular average

266
00:19:21,370 --> 00:19:21,760
Pouilly.

267
00:19:26,960 --> 00:19:32,060
The only exception to this is if your images are too small, in which case you'll just get an error

268
00:19:32,060 --> 00:19:33,260
when you try to run your code.

269
00:19:34,470 --> 00:19:39,630
For example, if your image starts out as a two by two, then it's not possible to reduce the dimensions

270
00:19:39,630 --> 00:19:41,290
by half four times.

271
00:19:41,850 --> 00:19:46,620
You'll notice this happens if you do something like add too many convolutions to your neural network.

272
00:19:51,250 --> 00:19:56,560
Now, for most of this lecture we've been discussing, CNN's in the context of images, which is more

273
00:19:56,560 --> 00:19:59,210
natural and it's also where CNN's originated.

274
00:19:59,770 --> 00:20:04,920
We could use these four time series, but a more direct way is to use one's convolutions.

275
00:20:06,400 --> 00:20:12,960
So just for completion sake, let's look at the shape progression for a time series using one D convolutions.

276
00:20:14,350 --> 00:20:19,920
So suppose that we start with some multivariate or univariate time series of shape TBD.

277
00:20:20,440 --> 00:20:23,170
If it's UNIVARIATE, then D is just equal to one.

278
00:20:24,400 --> 00:20:28,170
Let's also suppose for simplicity that C is equal to one twenty eight.

279
00:20:28,960 --> 00:20:31,910
The next step is to pass this through some convolution layer.

280
00:20:32,800 --> 00:20:37,060
Suppose that we use a filter of like three with thirty two output feature maps.

281
00:20:37,570 --> 00:20:41,370
Let's also suppose for simplicity that we are using same mode convolution.

282
00:20:42,070 --> 00:20:47,920
In this case, the output of this convolutional layer will be of size T by thirty two or one, twenty

283
00:20:47,920 --> 00:20:49,000
eight by thirty two.

284
00:20:49,870 --> 00:20:55,030
The next step is to pass this through a pooling layer with the pool size of two and a straight of two.

285
00:20:55,690 --> 00:21:01,210
This will cut the time dimension in half, so the result is now sixty four by thirty two.

286
00:21:02,770 --> 00:21:05,740
The next step is to pass this through another convolution.

287
00:21:06,220 --> 00:21:12,760
Suppose that this again is same mode convolution with a filter size of sixty for the output of this

288
00:21:12,760 --> 00:21:15,220
layer will be sixty four by sixty four.

289
00:21:16,790 --> 00:21:20,870
We again pass this through a pooling layer, which reduces the time dimension by half.

290
00:21:21,050 --> 00:21:25,340
Once again, the output of this is thirty two by sixty four.

291
00:21:27,290 --> 00:21:32,450
Let's suppose that we have one final convolution there with one hundred twenty eight feature maps and

292
00:21:32,450 --> 00:21:37,910
everything else, the same as before, the output of this layer will be thirty two by one, twenty eight.

293
00:21:38,690 --> 00:21:42,490
Now, let's suppose at this point we're ready to apply a dense neural network.

294
00:21:42,890 --> 00:21:46,700
So instead of regular pooling, we're going to apply a global max pooling.

295
00:21:47,420 --> 00:21:51,520
As you recall, global max pooling reduces the time dimension to one.

296
00:21:52,280 --> 00:21:57,540
So the output of this will be one by one twenty eight or equivalently, just one twenty eight.

297
00:21:58,850 --> 00:22:04,310
After this, we passed the 128 dimensional feature vector as normal through a regular end.

298
00:22:04,900 --> 00:22:06,650
OK, so hopefully that was easy.

299
00:22:06,920 --> 00:22:10,490
It's the same as a CNN four images just with one less dimension.

300
00:22:15,200 --> 00:22:20,460
All right, so let's summarize the basic architecture of a modern, convolutional neural network.

301
00:22:21,320 --> 00:22:27,110
The first stage is a series of striated convolutions or optionally convolution, followed by pooling

302
00:22:27,110 --> 00:22:32,430
and repeating that pattern in the interface between convolution layers and dense layers.

303
00:22:32,450 --> 00:22:38,810
We have a flat and we also optionally have a global max pooling that reduces the image to just the max

304
00:22:38,810 --> 00:22:39,820
for each feature map.

305
00:22:41,910 --> 00:22:47,370
Finally, once we flattened our data into a one dimensional vector, we have a regular feed for a neural

306
00:22:47,390 --> 00:22:50,670
network made up of a series of dense layers, as usual.

307
00:22:51,510 --> 00:22:57,690
And as you learn in the previous section, the activation and number of nodes for the final layer depends

308
00:22:57,690 --> 00:22:59,130
on the type of task being done.

309
00:22:59,790 --> 00:23:05,070
So if you're doing scalar regression, then you'll only have one output node with no activation function

310
00:23:05,610 --> 00:23:07,620
if you're doing binary classification.

311
00:23:07,890 --> 00:23:09,970
You could have one output node with a sigmoid.

312
00:23:10,650 --> 00:23:13,440
However, four K equals two or more classes.

313
00:23:13,650 --> 00:23:16,980
You can have Ka'aba nodes with a softmax activation.