1
00:00:11,670 --> 00:00:17,070
Now that you understand how exactly a convolution layer works including the bias term and activation

2
00:00:17,070 --> 00:00:23,450
function we can now consider the architecture of a convolution or neural network and why it's that way.

3
00:00:23,490 --> 00:00:29,520
So as a little bit of a history lesson modern CNN is essentially all originated from the same model.

4
00:00:29,580 --> 00:00:36,210
The Lean In this is named after Yan LA -- one of the original Deep Learning pioneers along with Jeffrey

5
00:00:36,210 --> 00:00:38,280
Hinton and Joshua NGO.

6
00:00:38,610 --> 00:00:43,770
What we're going to do in this lecture is just start with what the architecture is and then go through

7
00:00:43,770 --> 00:00:47,690
it piece by piece so that you understand why it is the way it is.

8
00:00:52,510 --> 00:00:54,880
Okay so let's look at a typical CNN.

9
00:00:55,210 --> 00:00:57,900
A typical CNN has two stages.

10
00:00:58,000 --> 00:01:01,680
The first stage is a series of convolution of layers.

11
00:01:01,690 --> 00:01:06,910
Importantly these convolution of layers are usually followed by pooling layers which is another type

12
00:01:06,910 --> 00:01:09,760
of layer we'll discuss shortly.

13
00:01:09,880 --> 00:01:14,500
The second stage is a series of dense layers also called a fully connected layers.

14
00:01:14,500 --> 00:01:16,740
These are just a regular feet for a neural network.

15
00:01:16,750 --> 00:01:19,610
As you saw in the previous section.

16
00:01:19,780 --> 00:01:24,840
If you recall I said in the previous section that neural networks have this hierarchical structure.

17
00:01:25,000 --> 00:01:30,750
Each layer is always looking at features that were found in the previous layer so you can even think

18
00:01:30,750 --> 00:01:34,500
of the first stage as just another feature transformer.

19
00:01:34,560 --> 00:01:39,800
It's a feature transformer that specifically works on images and finds IMAGE FEATURES.

20
00:01:39,900 --> 00:01:44,850
Then once it's found those image features it passes those into a neuron that work and then the neural

21
00:01:44,850 --> 00:01:48,990
network does what it does best nonlinear classification or regression

22
00:01:54,200 --> 00:01:57,580
so let's have a closer look at the convolution of stage.

23
00:01:57,590 --> 00:02:00,220
Now I just added a new layer without really explaining it.

24
00:02:00,260 --> 00:02:01,530
The pooling layer.

25
00:02:01,880 --> 00:02:05,050
Why would we want to do that at a high level.

26
00:02:05,060 --> 00:02:09,820
Pooling as a down sampling operation what is down sampling.

27
00:02:10,040 --> 00:02:12,860
That's when you make a smaller image out of a bigger image.

28
00:02:13,640 --> 00:02:16,630
So imagine your images one hundred by one.

29
00:02:17,000 --> 00:02:17,930
After pooling.

30
00:02:17,930 --> 00:02:21,190
If the pool size is two then you would get a 50 by 50.

31
00:02:21,200 --> 00:02:23,520
That's down sampling by 2.

32
00:02:23,660 --> 00:02:28,520
Let's go over mechanically how this works and then we'll explain why it works and why we would want

33
00:02:28,520 --> 00:02:30,680
to include it in our neural network

34
00:02:35,840 --> 00:02:41,870
so there are two kinds of pooling Max pulling an average pooling choosing which one to use is as you

35
00:02:41,870 --> 00:02:44,790
probably guessed a hyper pragmatic choice.

36
00:02:44,870 --> 00:02:49,570
Let's talk about Max pooling first since I think this one's a bit more common.

37
00:02:49,670 --> 00:02:52,820
The best way to see how it works is by example.

38
00:02:52,820 --> 00:02:58,160
So if we start with an image let's say it's a four by four and contains these numbers one two three

39
00:02:58,160 --> 00:03:03,840
four five six seven eight sixteen fifteen fourteen thirteen twelve eleven ten nine so just the numbers

40
00:03:03,860 --> 00:03:11,580
one the sixteen but organized non sequentially what Max pulling will do is take each two by two square

41
00:03:11,850 --> 00:03:19,160
of which there are four and just return the maximum value in each of those squares so the max of 1 2

42
00:03:19,160 --> 00:03:28,010
5 6 is 6 the max of 3 4 7 and 8 is 8 the max of sixteen fifteen twelve eleven is sixteen and the max

43
00:03:28,010 --> 00:03:35,580
of fourteen thirteen Ten Nine is fourteen so the output image is six eight sixteen fourteen

44
00:03:40,710 --> 00:03:45,740
as you might have guessed average pulling a similar but instead of taking the max we take the average

45
00:03:46,410 --> 00:03:51,690
so for the same input image we have the following the average of one two five and six is three point

46
00:03:51,690 --> 00:03:59,340
five the average of three four seven and eight is 5.5 the average of sixteen 15 2011 is thirteen point

47
00:03:59,340 --> 00:04:03,990
five and the average of thirteen fourteen Ten Nine is eleven point five.

48
00:04:04,200 --> 00:04:09,030
So the output image is three point five five point five thirteen point five eleven point five

49
00:04:14,330 --> 00:04:15,880
now that you know how pooling works.

50
00:04:15,890 --> 00:04:20,740
Let's talk about why we should use it the first advantage of pooling is practical.

51
00:04:21,020 --> 00:04:27,680
If we down sample the image the image shrinks and therefore we have less data to process less multiplications

52
00:04:27,680 --> 00:04:33,020
to do and of course that will speed things up but the second and more important advantage has to do

53
00:04:33,020 --> 00:04:41,710
with image processing itself recall this idea of translational and variance This is the idea that I

54
00:04:41,710 --> 00:04:44,120
don't care where in an image the feature occurred.

55
00:04:44,200 --> 00:04:47,230
I just care that it occurred for example.

56
00:04:47,410 --> 00:04:54,050
Both you and I can recognize this as in a we also recognize this as an A.

57
00:04:54,050 --> 00:04:59,780
It doesn't matter where I put the A on the screen so translational and variance is important from a

58
00:04:59,780 --> 00:05:03,170
biological point of view and how we as humans learn.

59
00:05:03,440 --> 00:05:07,740
We don't have to learn how to recognize A's at each point in our field of vision.

60
00:05:07,910 --> 00:05:10,470
Once we know what something looks like we can recognize it.

61
00:05:10,490 --> 00:05:17,650
No matter if it's up down left or right.

62
00:05:17,670 --> 00:05:19,090
So what is Max pulling doing.

63
00:05:19,680 --> 00:05:22,440
Well consider a small two by two box.

64
00:05:22,440 --> 00:05:26,160
Remember that convolution it tells us whether or not a feature has been found.

65
00:05:26,160 --> 00:05:27,560
It's a pattern finder.

66
00:05:27,810 --> 00:05:32,970
So imagine that in this two by two box I have these flags pattern found not found not found and not

67
00:05:32,970 --> 00:05:34,140
found.

68
00:05:34,140 --> 00:05:38,260
So it's a high number and a small number and a small number and a small number.

69
00:05:38,340 --> 00:05:39,550
The pattern being found.

70
00:05:39,570 --> 00:05:41,120
Returns a higher number.

71
00:05:41,190 --> 00:05:44,870
So when I take the max it says the pattern has been found.

72
00:05:44,910 --> 00:05:50,650
That's good because it tells me that the pattern has been found without carrying where it was found.

73
00:05:50,730 --> 00:05:55,340
Average pooling is the same idea but I think Max pooling is a little more intuitive in this regard.

74
00:06:00,560 --> 00:06:06,050
As a side note keep in mind that in this lecture I've assumed that we're using a pool size of two and

75
00:06:06,050 --> 00:06:13,480
that at each pooling layer we select boxes which are two spaces apart now pulling layers have some flexibility

76
00:06:13,480 --> 00:06:14,940
in this regard.

77
00:06:15,010 --> 00:06:19,140
First it's not necessary for the pool size to be the same in both directions.

78
00:06:19,150 --> 00:06:25,660
For example you could look at a two by three box or a three by two box although generally that's unconventional.

79
00:06:25,660 --> 00:06:31,510
Second is that it's possible for the boxes to overlap the hyper parameter that controls this is called

80
00:06:31,510 --> 00:06:32,810
the stride.

81
00:06:32,920 --> 00:06:40,440
It tells us how many spaces to move the pulling box for each output previously we looked at the situation

82
00:06:40,470 --> 00:06:44,720
where we had a pool size of two by two and a stride was also two.

83
00:06:44,760 --> 00:06:46,860
This is a pretty common scenario.

84
00:06:47,130 --> 00:06:50,910
If you add a string of one then or two by two box would overlap.

85
00:06:50,940 --> 00:06:52,320
This is not really too common.

86
00:06:53,040 --> 00:06:59,030
So while you have these options which I want to make you aware of they are usually not changed too often.

87
00:06:59,160 --> 00:07:05,190
So what you see in this lecture is what we do most often we use the pool size of two in each direction

88
00:07:05,640 --> 00:07:14,500
and we have a string of two so that we don't have any overlapping boxes.

89
00:07:14,590 --> 00:07:19,630
So now that you know what pooling does and why it works let's discuss why the convolution part of the

90
00:07:19,630 --> 00:07:22,600
CNN is organized the way it is.

91
00:07:22,660 --> 00:07:27,460
Why should we have a sequential pattern consisting of a convoy layer followed by a pooling layer followed

92
00:07:27,460 --> 00:07:31,120
by a convoy layer followed by another pooling layer and so forth.

93
00:07:31,120 --> 00:07:36,910
If you recall I told you earlier that one cool thing researchers discovered is that CNN are able to

94
00:07:36,910 --> 00:07:41,770
learn features hierarchically so the initial layers end up learning basic strokes.

95
00:07:41,770 --> 00:07:46,900
The next layer ends up learning individual facial features such as nose eyes and lips.

96
00:07:46,900 --> 00:07:49,030
The next layer ends up learning whole faces

97
00:07:54,150 --> 00:08:00,450
the key points to keep in mind is that after each pulling the image shrinks but the filter sizes generally

98
00:08:00,450 --> 00:08:07,890
stay around the same comment filter sizes are 3 by 3 5 by 5 and 7 by 7.

99
00:08:07,890 --> 00:08:13,290
Generally speaking they are much smaller than the input image but what happens as the image shrinks

100
00:08:13,410 --> 00:08:14,940
as it passes through each layer.

101
00:08:15,720 --> 00:08:22,170
Well let's suppose we start with a 32 by 32 image and we have four convolution and pulling layers.

102
00:08:22,200 --> 00:08:28,010
Let's assume we do convolution and same mode so it doesn't change the size of the image and let's assume

103
00:08:28,040 --> 00:08:30,700
all our convolution filters are size 3 by 3.

104
00:08:31,010 --> 00:08:32,800
Although this doesn't come into play just yet.

105
00:08:34,250 --> 00:08:40,260
In this case after the first convolution and pulling the image shrinks down to 16 by 16.

106
00:08:40,610 --> 00:08:45,220
After the second convolution and pull in the image shrinks down to a by 8.

107
00:08:45,560 --> 00:08:49,490
After the third convolution and pulling the image shrinks down a four by four.

108
00:08:49,490 --> 00:08:56,580
And finally after a fourth convolution and pulling the image shrinks down to two by two what happens

109
00:08:56,580 --> 00:08:58,140
after this is still a mystery.

110
00:08:58,140 --> 00:09:01,050
So let's just leave that for now and focus on the image shrinking

111
00:09:06,210 --> 00:09:12,840
as you can see if you take a three by three filter and place it over a 32 by 32 image it occupies only

112
00:09:12,840 --> 00:09:15,260
a very small portion of the image.

113
00:09:15,270 --> 00:09:22,260
In other words is looking for very tiny patterns to match like for example simple edges and strokes.

114
00:09:22,260 --> 00:09:26,020
Then the image becomes size 16 by 16.

115
00:09:26,040 --> 00:09:32,040
Now all of a sudden in the three by three filter takes up four times as much space in the image it takes

116
00:09:32,040 --> 00:09:36,110
up twice as much space in each direction and two times to his four.

117
00:09:36,120 --> 00:09:42,560
In any case it occupies a much larger portion of the image relative to the first filter.

118
00:09:42,640 --> 00:09:50,840
So now it's looking for patterns which take up more space on the image patterns like nose eyes and lips.

119
00:09:50,890 --> 00:09:56,830
Next we get to an eight by eight image a three by three filter occupies almost a quarter of the image

120
00:09:58,630 --> 00:10:00,060
after that convolution and pulling.

121
00:10:00,060 --> 00:10:10,620
We get a four by four image now a three by three filter takes up most of the image.

122
00:10:10,750 --> 00:10:12,280
So what have we learned.

123
00:10:12,460 --> 00:10:17,530
We've learned that in a convolution on their own that work convolution and pooling results in two things

124
00:10:17,530 --> 00:10:22,200
which are corollaries of each other first that the image input shrinks.

125
00:10:22,300 --> 00:10:28,010
Second since filters generally stay the same size the filter looks for larger and larger patterns in

126
00:10:28,030 --> 00:10:29,170
image.

127
00:10:29,170 --> 00:10:33,640
This is what leads to the CNN learning hierarchical features of the input.

128
00:10:33,790 --> 00:10:38,410
It first learns things that are relatively small compared to the total size of the image.

129
00:10:38,410 --> 00:10:41,560
Then it learns to find bigger patterns and bigger patterns and so forth.

130
00:10:46,620 --> 00:10:51,930
The last concepts regarding the convolution of layers we have to think about is are we not losing information

131
00:10:51,930 --> 00:10:53,040
at every layer.

132
00:10:53,040 --> 00:10:57,000
If we shrink the image and the answer is yes that's true.

133
00:10:57,000 --> 00:11:00,720
We are losing spatial information because the image keeps shrinking.

134
00:11:00,720 --> 00:11:05,700
We're saying we don't care where in the image the feature was found just that it was found somewhere

135
00:11:05,820 --> 00:11:06,930
in the image.

136
00:11:06,930 --> 00:11:09,600
So we're losing information about where it was.

137
00:11:10,350 --> 00:11:15,120
But there's another component of the convolution process we haven't considered yet which is the number

138
00:11:15,120 --> 00:11:16,740
of feature maps.

139
00:11:16,740 --> 00:11:22,320
Again we have a general pattern to follow which is that while the size of the image generally decreases

140
00:11:22,650 --> 00:11:25,770
the number of feature maps generally increases.

141
00:11:25,800 --> 00:11:28,230
In other words I don't care where the feature is found.

142
00:11:28,260 --> 00:11:33,610
I just care that it was found but I do care about different possible features.

143
00:11:33,610 --> 00:11:35,490
I could find.

144
00:11:35,660 --> 00:11:40,790
So while you're losing spatial information you're gaining information in terms of what features were

145
00:11:40,790 --> 00:11:41,810
found in the image

146
00:11:46,910 --> 00:11:50,150
one last note on this convolution and pulling pattern.

147
00:11:50,150 --> 00:11:55,140
One of the hardest things to grasp for new students and deep learning is that we have so many choices.

148
00:11:55,160 --> 00:11:58,510
We call these hyper parameters previously for a minute.

149
00:11:58,520 --> 00:12:02,710
We looked at the learning rate the number of hidden layers and the number of hidden units.

150
00:12:02,900 --> 00:12:07,570
But now with CNN is we have a ton of more choices with convolution.

151
00:12:07,580 --> 00:12:09,410
We have to choose the filter size.

152
00:12:09,410 --> 00:12:13,960
We have to choose the number of feature maps we have to choose the pool size and so forth.

153
00:12:14,330 --> 00:12:19,320
But luckily I think this is one instance where there is a general pattern that most people follow.

154
00:12:19,370 --> 00:12:25,800
So you're not going in completely blind so stick with these guidelines and you can be sure that you're

155
00:12:25,800 --> 00:12:32,550
at least doing things per accepted convention that is choosing small filters relative to the image like

156
00:12:32,580 --> 00:12:35,950
three by three five by five or seven by seven.

157
00:12:35,970 --> 00:12:40,420
Another guideline is to repeat the pattern convolution followed by pooling.

158
00:12:40,620 --> 00:12:45,030
Another guideline is to increase the number of feature maps at each convolution.

159
00:12:45,030 --> 00:12:50,950
So you might start with 32 and then 64 then 128 and maybe 128 again.

160
00:12:51,030 --> 00:12:56,340
The best way to learn about how other people are doing it is to read lots of papers and check out which

161
00:12:56,340 --> 00:12:58,590
hyper parameters they selected.

162
00:12:58,590 --> 00:13:02,940
Generally you'll find that this is the pattern followed by most convolution or networks

163
00:13:08,060 --> 00:13:08,610
next.

164
00:13:08,630 --> 00:13:13,700
I want to mention something a little funny but in fact pooling is sometimes not what we are going to

165
00:13:13,700 --> 00:13:16,090
end up using in our CNN.

166
00:13:16,160 --> 00:13:20,870
Researchers have found that we can do something similar but which is more efficient and sometimes works

167
00:13:20,870 --> 00:13:22,150
just as well.

168
00:13:22,340 --> 00:13:28,220
In particular recall that for pooling we have this concept of stride that tells us how far apart each

169
00:13:28,220 --> 00:13:31,070
box should be when we take the max.

170
00:13:31,070 --> 00:13:38,450
In fact convolution has a stride option as well so in this animation you can see how convolution works

171
00:13:38,450 --> 00:13:39,680
with Stride.

172
00:13:39,830 --> 00:13:45,750
Remember convolution just means multiply an ad with a sliding window when we have convolution with a

173
00:13:45,750 --> 00:13:46,880
stripe parameter.

174
00:13:47,010 --> 00:13:50,700
The stride tells us how far apart each window should be.

175
00:13:50,970 --> 00:13:56,460
And of course our intuition tells us that if we use a strike of two then the output image length will

176
00:13:56,460 --> 00:13:59,640
be half of what the input was.

177
00:13:59,650 --> 00:14:05,350
In other words we get the same reduction in size by using a striking convolution instead of convolution

178
00:14:05,350 --> 00:14:06,190
followed by pooling

179
00:14:11,300 --> 00:14:16,820
the intuition behind why this works is that if you consider what an image looks like an image is just

180
00:14:16,820 --> 00:14:18,780
large patches of stuff.

181
00:14:18,920 --> 00:14:24,980
What I mean by that is each pixel is likely to have a similar value to its neighboring pixels.

182
00:14:24,980 --> 00:14:27,440
For example suppose I'm looking at a red car.

183
00:14:27,620 --> 00:14:29,450
I find a red pixel.

184
00:14:29,450 --> 00:14:33,280
Now ask yourself is the pixel above that also red.

185
00:14:33,290 --> 00:14:34,080
Yes.

186
00:14:34,100 --> 00:14:36,710
Is the pixel below that also probably red.

187
00:14:36,710 --> 00:14:37,690
Yes.

188
00:14:38,030 --> 00:14:41,210
The same can be said of the pixel to the left and to the right.

189
00:14:41,270 --> 00:14:47,430
That's just the nature of images pixels near each other are often very highly correlated.

190
00:14:47,490 --> 00:14:52,320
You won't have this random jumping around of pixel values because something like that wouldn't look

191
00:14:52,320 --> 00:14:53,550
like an image at all.

192
00:14:53,580 --> 00:14:59,850
That would probably just be noise and so using a stride of two simply means having your filter skipping

193
00:14:59,850 --> 00:15:01,380
every other pixel.

194
00:15:01,380 --> 00:15:06,750
It's saying we don't care what those skipped pixels are because they probably are very close to the

195
00:15:06,750 --> 00:15:08,310
pixels we were already looking at.

196
00:15:13,460 --> 00:15:19,310
So to summarize the convolution part of our convolution on their own that work now looks like this.

197
00:15:19,310 --> 00:15:25,160
We can have a series of static convolutions all with approximately the same filter size with increasing

198
00:15:25,160 --> 00:15:27,200
feature maps at each layer.

199
00:15:27,200 --> 00:15:32,960
Or we can have a series of convolutions followed by pooling again all with approximately the same filter

200
00:15:32,960 --> 00:15:36,380
size and with increasing numbers of feature maps at each layer

201
00:15:41,450 --> 00:15:46,030
the last part of this lecture will focus on the second half of the CNN which is the feed for neural

202
00:15:46,040 --> 00:15:48,100
network part.

203
00:15:48,150 --> 00:15:53,580
Now there's not much to discuss here except that we have to consider the shape of an image is not appropriate

204
00:15:53,610 --> 00:15:56,340
for a dense field for no network.

205
00:15:56,340 --> 00:16:01,110
Remember that when an image comes out of a convolution it's going to be three dimensional height by

206
00:16:01,110 --> 00:16:07,110
width by number of feature maps but a field for a neural network takes in a feature vector a one dimensional

207
00:16:07,110 --> 00:16:08,660
object.

208
00:16:08,670 --> 00:16:12,060
Luckily we've already discussed how this works.

209
00:16:12,060 --> 00:16:16,950
We can turn a 3D object into a one B object using the view function in PI torch

210
00:16:23,140 --> 00:16:28,420
one alternative to the flan layer which I really like is the global Max pooling layer.

211
00:16:28,450 --> 00:16:33,770
This is a little different from the regular kind of pooling we discussed earlier consider the question.

212
00:16:33,790 --> 00:16:36,900
What if we have different sized images.

213
00:16:37,050 --> 00:16:39,320
The largest source of images is the internet.

214
00:16:40,370 --> 00:16:46,550
Generally speaking images we find on the Internet are not all the same size so it makes sense to wonder

215
00:16:46,730 --> 00:16:52,750
is it possible to build a convolution or known network that is capable of handling different image sizes.

216
00:16:52,790 --> 00:16:54,240
And the answer is yes.

217
00:16:54,350 --> 00:16:55,940
Using a global Max pooling layer

218
00:17:01,060 --> 00:17:04,940
Let's consider first why a simple flat in layer would not work.

219
00:17:04,960 --> 00:17:13,330
Suppose again we have an input image of size 32 by 32 and for convolutions assume again we use convolution

220
00:17:13,330 --> 00:17:14,110
and same mode.

221
00:17:14,110 --> 00:17:16,130
And we have a stride of two.

222
00:17:16,150 --> 00:17:19,570
So after the first convolution We get a 16 by 16.

223
00:17:19,600 --> 00:17:21,910
After the second convolution We get an eight by eight.

224
00:17:22,330 --> 00:17:24,790
After the third convolution We get a four by four.

225
00:17:24,910 --> 00:17:28,830
And after the fourth convolution We get a two by two.

226
00:17:29,170 --> 00:17:33,370
But now let's say we have an input image of size 64 by 64.

227
00:17:33,370 --> 00:17:38,430
In this case the output after four convolutions would be four by four.

228
00:17:38,440 --> 00:17:44,710
Let's say for simplicity's sake the number of output feature maps is one hundred then for the 32 by

229
00:17:44,710 --> 00:17:46,010
32 image.

230
00:17:46,030 --> 00:17:51,460
Doing a flat in operation would yield an output of size two by two by one hundred two times two times

231
00:17:51,470 --> 00:17:56,080
100 which is four hundred four sixty four by sixty for image.

232
00:17:56,080 --> 00:18:02,750
Doing a flat in operation would yield an output size four times four times 100 which is sixteen hundred.

233
00:18:02,860 --> 00:18:13,380
And of course a feat for a neural network is not capable of handling vectors of different sizes.

234
00:18:13,410 --> 00:18:16,680
Now let's consider how global mass pooling works.

235
00:18:16,710 --> 00:18:24,570
The idea is simple if our input image is h by W. by C then the output is one by one by C or equivalently

236
00:18:24,630 --> 00:18:26,710
just a vector of size C.

237
00:18:26,760 --> 00:18:33,530
In other words it takes a max over the entire image along the spatial dimensions over each feature map.

238
00:18:33,810 --> 00:18:40,230
Since we have C feature maps we end up with what is essentially a one dimensional vector of size C since

239
00:18:40,230 --> 00:18:41,660
the 1 dimensions are redundant.

240
00:18:43,610 --> 00:18:48,890
This follows the same general idea of pooling where we're saying I don't care where in the image the

241
00:18:48,890 --> 00:18:52,520
feature was found just that it was found somewhere.

242
00:18:52,520 --> 00:18:57,590
Global mass pooling is the extreme of this saying that the feature can be found anywhere in the entire

243
00:18:57,590 --> 00:19:04,370
image across the entire height and width since the output is always a vector of size C it does not depend

244
00:19:04,400 --> 00:19:11,310
on the size of the image meaning that it allows the network to handle images of any size.

245
00:19:11,540 --> 00:19:16,550
As a side note we also have global average Pooley which is just the global version of regular average

246
00:19:16,550 --> 00:19:22,090
pulling.

247
00:19:22,180 --> 00:19:27,430
The only exception to this is if your images are too small in which case you'll just get an error when

248
00:19:27,430 --> 00:19:28,480
you try to run your code.

249
00:19:29,700 --> 00:19:34,830
For example if your image starts out as a two by two then it's not possible to reduce the dimensions

250
00:19:34,830 --> 00:19:37,140
by half four times.

251
00:19:37,140 --> 00:19:47,040
You'll notice this happens if you do something like add too many convolutions to your neural network.

252
00:19:47,090 --> 00:19:53,240
All right so let's summarize the basic architecture of a modern convolution all knowing that we're the

253
00:19:53,240 --> 00:19:59,540
first stage is a series of stride convolutions or optionally convolution followed by pooling and repeating

254
00:19:59,540 --> 00:20:04,250
that pattern in the interface between convolution layers and dense layers.

255
00:20:04,250 --> 00:20:05,530
We have a flatten.

256
00:20:05,870 --> 00:20:11,180
We also optionally have a global Max pooling that reduces the image to just the max for each feature

257
00:20:11,180 --> 00:20:11,600
map.

258
00:20:13,730 --> 00:20:19,170
Finally once we flattened our data into a one dimensional vector we have a regular feed for a neural

259
00:20:19,170 --> 00:20:25,850
network made up of a series of dense layers as usual and as you learned in the previous section the

260
00:20:25,850 --> 00:20:30,950
activation and number of nodes for the final layer depends on the type of task being done.

261
00:20:31,640 --> 00:20:37,350
So if you're doing scalar regression then you'll only have one output node with no activation function.

262
00:20:37,460 --> 00:20:42,380
If you're doing binary classification you could have one output node with a sigmoid.

263
00:20:42,500 --> 00:20:48,680
However for k equals two or more classes you can have K output nodes with a soft Max activation.