1
00:00:00,210 --> 00:00:02,400
We are now ready to build our model.

2
00:00:02,400 --> 00:00:06,660
And up to now, neural networks have been performing quite well.

3
00:00:06,780 --> 00:00:15,690
We had seen previously that if we have three neurons, that's 1 to 3 in the input and in the output

4
00:00:15,690 --> 00:00:24,630
we have to re neurons to, then there'll be nine different connections here and hence nine different

5
00:00:24,630 --> 00:00:26,520
weights plus the bias.

6
00:00:27,240 --> 00:00:29,580
But for this example, let's consider only the weights.

7
00:00:29,580 --> 00:00:34,830
So we have nine different parameters for three inputs, three outputs.

8
00:00:35,640 --> 00:00:41,000
And if here we have five construction only the weights we would have three times five.

9
00:00:41,010 --> 00:00:44,160
That's 15 different parameters.

10
00:00:44,370 --> 00:00:54,690
Now, if we're dealing with an image like this one, so let's consider this image, which is 224 by

11
00:00:54,720 --> 00:01:03,310
224 by three, Image three for the three different red, green and blue channels.

12
00:01:03,330 --> 00:01:06,180
So we have this input image now.

13
00:01:06,180 --> 00:01:15,030
So unlike previous examples where we would have features or specifically in this case input features

14
00:01:15,030 --> 00:01:22,230
which not contain as many elements as this, we now have this case where if we want to count the number

15
00:01:22,230 --> 00:01:33,550
of features or the number of pixels, we would have 150,528 different input features to take into consideration.

16
00:01:33,570 --> 00:01:44,700
Hence, instead of having a total number of three right here, we are going to have 150,528 different

17
00:01:44,700 --> 00:01:45,360
values.

18
00:01:45,720 --> 00:01:52,500
And if we want to do the same competition which got us, this number of parameters would have 150,000

19
00:01:52,500 --> 00:01:53,540
times three.

20
00:01:53,550 --> 00:01:58,500
That's approximately five 450,000 different parameters.

21
00:01:59,340 --> 00:02:07,500
Now, what if we modify this number of neurons right here and take a thousand neurons in this output

22
00:02:07,770 --> 00:02:08,730
right here?

23
00:02:08,730 --> 00:02:18,510
We'll see that we'll move from 150,000, or rather from 450000 to 150 million different parameters.

24
00:02:19,400 --> 00:02:24,920
Where each and every parameter has to be trained and optimized.

25
00:02:25,370 --> 00:02:34,370
This becomes clear that deep neural networks like this are better still than layers of fully connected

26
00:02:34,370 --> 00:02:36,980
layers aren't scalable.

27
00:02:37,400 --> 00:02:44,330
Since when we increase the number of features, the total number of parameters also increase considerably.

28
00:02:44,360 --> 00:02:53,600
Hence, we need to build a type of layer, unlike this one where each neuron isn't connected to all

29
00:02:53,600 --> 00:03:00,500
the previous neurons, and this layer happens to be the convolutional layer.

30
00:03:00,590 --> 00:03:07,150
In order to better visualize this, we'll use this demo platform from Rice and University.

31
00:03:07,160 --> 00:03:15,230
So here we'll put in a figure, let's say four, and then we'll see exactly how this confidence work.

32
00:03:15,230 --> 00:03:19,250
So we have this and then we have this input right here.

33
00:03:19,460 --> 00:03:26,170
So we get this input, we have some weights and then this output features.

34
00:03:26,180 --> 00:03:31,730
Note how to obtain this particular pixel right here.

35
00:03:31,970 --> 00:03:37,520
Only a few of these inputs are used.

36
00:03:37,520 --> 00:03:41,670
And so here we call this the receptive field.

37
00:03:41,690 --> 00:03:45,110
As you could see, we have the receptive field right here.

38
00:03:46,040 --> 00:03:51,740
And if we take this other example you see is on receptive field, you see to get this value only these

39
00:03:51,740 --> 00:03:57,920
four values you see below, actually this one, this, this and this.

40
00:03:57,950 --> 00:04:00,240
So you could see here, let's scroll back.

41
00:04:00,270 --> 00:04:00,770
Okay.

42
00:04:00,770 --> 00:04:08,900
So only those four values have a role to play when it comes to giving us the corresponding value we

43
00:04:08,900 --> 00:04:09,800
have here.

44
00:04:10,250 --> 00:04:17,030
And so unlike with a dense layer where to obtain this value, we needed to link this to each and every

45
00:04:17,030 --> 00:04:18,060
previous neuron.

46
00:04:18,080 --> 00:04:27,530
Now, only just some neurons in this neurons receptive field play a role in getting this value.

47
00:04:27,600 --> 00:04:34,550
Another great tool for better understanding the convolutional layer is this CNN explainer.

48
00:04:34,550 --> 00:04:37,740
Actually, CNN stands for Convolutional Neural Networks.

49
00:04:37,760 --> 00:04:45,630
This is created by Jay Wong, Robert Omar, Pak Di's Fried Kang and Pablo.

50
00:04:45,650 --> 00:04:47,970
We are really grateful for this tool.

51
00:04:47,990 --> 00:04:52,160
Before getting back to the explainer, let's take this example right here.

52
00:04:52,160 --> 00:04:56,530
We have a 4x4 image, so we have 16 different pixels.

53
00:04:56,540 --> 00:05:03,260
If we flatten out those pixels that if we put those pixels out like this, such that we have those inputs

54
00:05:03,260 --> 00:05:11,210
and then in the output we have four different neurons, then here we would have 16 by four connections

55
00:05:11,210 --> 00:05:17,030
that will have 64 different parameters excluding the bias.

56
00:05:17,180 --> 00:05:27,170
But with the conf layer, we could leave from this 4x4 to just a two by two so we could leave from this

57
00:05:27,170 --> 00:05:29,660
4x4 to these two by two.

58
00:05:30,460 --> 00:05:34,420
With just nine parameters.

59
00:05:34,420 --> 00:05:39,870
So we'll have what we call a kernel here or filter.

60
00:05:39,880 --> 00:05:45,640
So we have this filter which is three by three, which actually corresponds to our weights, which we've

61
00:05:45,640 --> 00:05:46,570
seen already.

62
00:05:46,570 --> 00:05:56,170
This kernel here, kernel size three will produce this output of two by two, which when we flatten

63
00:05:56,170 --> 00:05:59,450
out, can give us this output right here.

64
00:05:59,470 --> 00:06:04,450
And so instead of working with 64 parameters, we're working with nine parameters.

65
00:06:04,450 --> 00:06:11,320
If we want to replicate that same example in the CNN explainer, here is where we get we have this input

66
00:06:11,320 --> 00:06:14,290
right here which produces this output.

67
00:06:14,320 --> 00:06:20,680
Now notice how we specify that the kernel size equal three, and because the kernel size equal to three,

68
00:06:20,680 --> 00:06:22,660
we are able to get this output.

69
00:06:22,660 --> 00:06:25,020
But how is this output gotten?

70
00:06:25,280 --> 00:06:32,140
You'll see that at this top left corner we are going to feed our kernel, which is of size three by

71
00:06:32,140 --> 00:06:32,620
three.

72
00:06:32,620 --> 00:06:40,690
So we put in our kernel right here and then we take each and every value of our kernel and multiply

73
00:06:40,690 --> 00:06:45,670
it with the corresponding value in the input to obtain the first value.

74
00:06:45,700 --> 00:06:49,150
You see how we pass this kernel on the input.

75
00:06:49,150 --> 00:06:52,000
So at this top left position with the kernel.

76
00:06:52,030 --> 00:06:58,300
Notice how all we have a three by three kernel which is passed on this input and we have the output

77
00:06:58,300 --> 00:06:59,070
right here.

78
00:06:59,080 --> 00:07:03,370
Then to get the next position to get this next output.

79
00:07:03,370 --> 00:07:09,910
You see how the kernel is passed on this next part of the input.

80
00:07:09,910 --> 00:07:14,920
And the way we get this next part is by simply sliding to the true the image.

81
00:07:14,920 --> 00:07:16,390
You see how we left from this.

82
00:07:16,390 --> 00:07:18,610
We slide that through the image and we got this.

83
00:07:18,610 --> 00:07:19,960
We got this next output.

84
00:07:19,960 --> 00:07:25,210
And since we've got into the end, we move to the next position, which is this.

85
00:07:25,210 --> 00:07:29,410
We get this next we get this next value right here.

86
00:07:29,410 --> 00:07:33,760
And then from here we slide again to this end and we finally get this value.

87
00:07:33,760 --> 00:07:41,140
So that's how we get all this outputs from the inputs and the kernel or the filter.

88
00:07:41,680 --> 00:07:44,200
We can now go ahead and increase this input size.

89
00:07:44,200 --> 00:07:46,150
Let's take, for example, seven.

90
00:07:46,150 --> 00:07:53,500
You see we have input size of seven, seven by seven input image and then we have this output five by

91
00:07:53,500 --> 00:07:54,130
five.

92
00:07:54,400 --> 00:07:58,240
The reason why we have this five by five is because we have this kernel size of three.

93
00:07:58,240 --> 00:08:02,220
If we get to increase the size of five, you see our output reduces.

94
00:08:02,230 --> 00:08:05,670
You see when the kernel size gets to six, our output reduces.

95
00:08:05,680 --> 00:08:08,350
Let's take this back to three.

96
00:08:08,350 --> 00:08:13,930
So you take back to three, and this input of seven by seven gives an output of five by five.

97
00:08:13,930 --> 00:08:17,380
We see how when we get to this, we have that output move.

98
00:08:17,380 --> 00:08:25,420
Slide, slide, slide, slide, move to the next here, slide, slide, slide and so on and so forth

99
00:08:25,600 --> 00:08:27,400
right up to the end.

100
00:08:27,400 --> 00:08:35,770
So from here we see how the size of the receptive field of each of these outputs equal to three, which

101
00:08:35,770 --> 00:08:37,720
is actually our kernel size.

102
00:08:37,990 --> 00:08:47,770
The next thing to notice is reducing the kernel size permits us extract more features from the inputs.

103
00:08:48,010 --> 00:08:53,350
You will see that since we have this input seven by seven with a kernel size of three, we have the

104
00:08:53,350 --> 00:08:54,520
output five by five.

105
00:08:54,520 --> 00:09:01,330
So we've extracted much more features from this input as compared to when we push a kernel size of six

106
00:09:01,330 --> 00:09:05,710
here, we'll extract less features from this inputs.

107
00:09:05,800 --> 00:09:15,850
Now though, using a smaller kernel size parameters, extract much more complex information or complex

108
00:09:15,850 --> 00:09:24,430
features from the inputs working with larger kernels parameters extract larger input features.

109
00:09:24,430 --> 00:09:33,010
One logical question which may come to your mind is how do I get the size of this output feature map

110
00:09:33,010 --> 00:09:33,970
right here?

111
00:09:34,420 --> 00:09:44,250
To get this, we'll use this formula where this output width is equals the input width minus the filter

112
00:09:44,260 --> 00:09:45,580
size plus one.

113
00:09:46,030 --> 00:09:53,860
If we take this example, we have an input weight of seven minus a field size of three plus one.

114
00:09:53,860 --> 00:09:55,480
This gives us five.

115
00:09:55,480 --> 00:09:57,370
So that's how we obtain this right here.

116
00:09:57,670 --> 00:10:05,530
Now if we take, say, kernel size, let's take the kernel size to be four for example.

117
00:10:06,100 --> 00:10:16,210
In that case we have seven minus the kernel size four, which is three plus one, which gives us an

118
00:10:16,210 --> 00:10:17,230
output of four.

119
00:10:17,230 --> 00:10:19,750
And that's how we obtain this output right here.

120
00:10:20,140 --> 00:10:27,190
Now, in a case where you're designing a convolutional layer and you want to get a particular feature

121
00:10:27,190 --> 00:10:28,060
map size.

122
00:10:28,060 --> 00:10:29,920
So for example, in this case, if you.

123
00:10:30,000 --> 00:10:39,240
Want to get an output size of three, then all you need to do is specify the scanner size to be equal

124
00:10:39,240 --> 00:10:41,880
five and you should get this output.

125
00:10:42,330 --> 00:10:50,790
Now, in some cases you may want to have a particular feature map shape, but using just the kernel

126
00:10:50,790 --> 00:10:53,310
size, you wouldn't be able to get that.

127
00:10:53,700 --> 00:11:00,390
And so to match up with the particular output shape would include a pattern.

128
00:11:00,540 --> 00:11:02,010
Let's look at this pattern.

129
00:11:02,040 --> 00:11:04,260
You see, we have an seven by seven.

130
00:11:04,260 --> 00:11:11,250
And when we take this to one, you'll see that we go from seven by seven to now, nine by nine.

131
00:11:11,280 --> 00:11:12,460
Notice how it's written here.

132
00:11:12,480 --> 00:11:14,460
After pattern, we have nine by nine.

133
00:11:14,460 --> 00:11:16,920
So we leave from this seven by seven.

134
00:11:16,920 --> 00:11:20,160
Let's take this in this box right here, this internal box.

135
00:11:20,160 --> 00:11:22,500
And then the pattern is us nine by nine.

136
00:11:22,530 --> 00:11:30,060
Notice how after doing the pattern and if we play around with our kernel size, we go from one one to

137
00:11:30,060 --> 00:11:31,320
the eight eight.

138
00:11:31,320 --> 00:11:38,580
So we go from 11288, but with zero pattern and input size of seven.

139
00:11:39,480 --> 00:11:42,780
You see we go to three, four and six.

140
00:11:42,780 --> 00:11:44,600
So we cannot go up to eight, eight.

141
00:11:44,610 --> 00:11:48,690
We're going to have an output of eight eight when the pattern is zero.

142
00:11:48,690 --> 00:11:55,770
Whereas when we increase the pattern by one here, increase the pattern by one, that's one.

143
00:11:55,980 --> 00:12:04,110
You see that we could go from we could get an output of say, let's put out a three, we could get an

144
00:12:04,110 --> 00:12:12,060
output of seven of eight, you see up to one.

145
00:12:12,210 --> 00:12:14,220
So that's it now.

146
00:12:14,220 --> 00:12:16,920
Or we understand how this pattern works.

147
00:12:16,920 --> 00:12:25,250
Basically, we just have our input and then we add this surrounding elements right here.

148
00:12:25,260 --> 00:12:29,080
We just pad our input image.

149
00:12:29,100 --> 00:12:36,240
Now the pattern generally we use is the zero pattern, though the other pattern methods, but the zero

150
00:12:36,240 --> 00:12:42,870
pattern is one of the most common since it's easy to use and it's computationally less expensive to

151
00:12:42,870 --> 00:12:43,590
work with.

152
00:12:43,620 --> 00:12:45,870
So that said, there our pattern.

153
00:12:45,870 --> 00:12:52,110
And then we also have another advantage of working with the pattern, which is that of ensuring that

154
00:12:52,110 --> 00:12:59,250
the color pixels have an influence on our output features which are generated.

155
00:12:59,280 --> 00:13:01,370
Now let's take this back to zero pattern.

156
00:13:01,380 --> 00:13:05,550
We have zero pattern and we have this input image right here.

157
00:13:06,210 --> 00:13:12,720
If we're doing an image in which most of the information or most of the relevant information is centered,

158
00:13:12,720 --> 00:13:21,990
then there is actually no issue since this filter right here will go through each and every pixel we

159
00:13:21,990 --> 00:13:24,990
have where we have this person in the image.

160
00:13:25,200 --> 00:13:31,620
Now, if we modify this image such that we have a person's let's suppose we have a person's face here

161
00:13:31,620 --> 00:13:38,580
just at the corner of the image, you'll see that unlike with this image where this person was centered

162
00:13:38,580 --> 00:13:47,190
and that our channel was able, our filter was able to pass through each and every part of this image.

163
00:13:47,190 --> 00:13:51,090
With this one, we have a different scenario.

164
00:13:51,900 --> 00:13:55,200
We've just modified this so it looks similar to what we had here.

165
00:13:55,290 --> 00:14:05,670
Now what we could do is if we monitor the number of times this filter goes to the person's head would

166
00:14:05,670 --> 00:14:14,040
see that we would have your one time because our filter passes here once the next time after the sliding,

167
00:14:14,040 --> 00:14:15,480
we'll have this.

168
00:14:15,480 --> 00:14:21,420
So second time, the next time after this other sliding will lead to this four times.

169
00:14:21,630 --> 00:14:26,370
So unless it passes through the head and then yeah, we have five times.

170
00:14:26,370 --> 00:14:28,950
No, we have one, two, three, four times actually.

171
00:14:28,950 --> 00:14:36,750
But in this case, since we are on the borders, you see we have this one time and that's practically

172
00:14:36,750 --> 00:14:37,150
all.

173
00:14:37,170 --> 00:14:42,120
So we see that in this example or in this case where we are on the borders.

174
00:14:42,120 --> 00:14:53,790
This influences the outputs in a smaller way or exerts less influence on the values we get in this feature

175
00:14:53,790 --> 00:14:55,320
maps which are generated.

176
00:14:55,620 --> 00:15:02,040
So in an example where this let's suppose an example where all this wasn't there and that where our

177
00:15:02,040 --> 00:15:06,480
information lies most is this you'll find that it would have been better.

178
00:15:08,410 --> 00:15:14,410
To at least pass through this head region, just as we did with this year, where we pass through the

179
00:15:14,410 --> 00:15:15,700
head region four times.

180
00:15:15,700 --> 00:15:23,050
And we're able to extract very useful information from this image because our filter goes through this

181
00:15:23,050 --> 00:15:24,310
many more times.

182
00:15:24,520 --> 00:15:27,400
Now, to remedy the situation, we have the pattern.

183
00:15:27,400 --> 00:15:29,170
You see that when we increase the pattern.

184
00:15:29,170 --> 00:15:30,880
So let's take this to one.

185
00:15:30,880 --> 00:15:32,260
We've increased the pattern.

186
00:15:32,260 --> 00:15:36,730
Let's increase the pattern to all years fixed to just one anyway.

187
00:15:36,730 --> 00:15:40,840
We just increase the pattern to one and then we'll retake that example.

188
00:15:41,560 --> 00:15:43,450
Now we're taking the example.

189
00:15:43,480 --> 00:15:51,040
We'll see that if we maintain our filter size of two, this filter here will at least touch the hair,

190
00:15:51,040 --> 00:15:52,150
although slightly.

191
00:15:52,150 --> 00:15:55,540
And then the second one does this twice.

192
00:15:55,840 --> 00:16:02,140
So in this case, our filter touches the head twice as compared to previously, where it touches the

193
00:16:02,140 --> 00:16:11,170
head only once and hence this useful information has more influence on the output features which are

194
00:16:11,350 --> 00:16:16,870
being generated right here and which is very important as practically that's what we're trying to do.

195
00:16:16,870 --> 00:16:21,850
We're trying to extract information from this input and pass to the output.

196
00:16:22,960 --> 00:16:26,560
Another hyper parameter which we could look at is a stride.

197
00:16:27,010 --> 00:16:31,270
Now note that for now we've looked at this one, this and this.

198
00:16:31,270 --> 00:16:34,090
So here's what we call the hyper parameters.

199
00:16:34,750 --> 00:16:38,680
We have this stride and we'll understand actually how this works.

200
00:16:38,680 --> 00:16:45,490
For now, we'll dealing with a stride of 1y1 simply because you just slide in through one step before

201
00:16:45,490 --> 00:16:46,360
going to the next.

202
00:16:46,360 --> 00:16:50,080
So you just slide one, one, one, and so on and so forth.

203
00:16:50,110 --> 00:16:56,230
Now, if you move this to two, we start from this position right here, fix that, you start from this

204
00:16:56,230 --> 00:16:59,890
position, there is a problem.

205
00:17:00,730 --> 00:17:04,310
We get back to one and back to two.

206
00:17:04,420 --> 00:17:08,520
You notice that as we go to one, this turns blue and then to two it turns red.

207
00:17:08,530 --> 00:17:16,390
And the reason why it turns red is because there is no valid output or there is no whole number, which

208
00:17:16,690 --> 00:17:19,870
with a slight stride of two, can produce an output.

209
00:17:19,870 --> 00:17:23,980
So in fact, what we're going to do is we modify this input right here.

210
00:17:23,980 --> 00:17:26,140
So modify this input is works now.

211
00:17:26,140 --> 00:17:30,720
So it's possible for us to leave from this input of six to this output, but with a set of two.

212
00:17:30,730 --> 00:17:32,860
So let's now understand how stripes work.

213
00:17:32,860 --> 00:17:37,900
We start with this notice now how we're going to skip two steps instead of just one step.

214
00:17:37,900 --> 00:17:44,470
So as we go, you see we skip two steps notice two, two, and that's it.

215
00:17:44,470 --> 00:17:49,450
So we move to the next and now even moving downward, you see, it's not like with a straight one where

216
00:17:49,450 --> 00:17:52,000
we just had one step, like we just tried one.

217
00:17:52,000 --> 00:17:57,790
We just did this one step, but with a stride two, we're going two steps below.

218
00:17:57,790 --> 00:17:58,420
So that's it.

219
00:17:58,690 --> 00:18:09,370
Now, that said, increasing the size of our stride actually reduces the size of our output and hence

220
00:18:09,370 --> 00:18:13,930
reduces the amount of information we extract from the inputs.

221
00:18:14,740 --> 00:18:20,560
And so in general, we get better results by working with smaller kernels and smaller stride values.

222
00:18:20,560 --> 00:18:25,510
Since we're able to extract more information from our inputs.

223
00:18:26,290 --> 00:18:29,350
Generally the kernel sizes use as three.

224
00:18:29,350 --> 00:18:33,640
The kernel size of three is generally used in practice and a stride of one.

225
00:18:33,760 --> 00:18:37,180
So we may decide to use pattern or not.

226
00:18:38,440 --> 00:18:44,890
And the new formula, when we include the pattern and the stride is given as such.

227
00:18:44,890 --> 00:18:53,860
So we have the output equal to input size minus the field size plus two times the pattern divided by

228
00:18:53,860 --> 00:18:57,460
the stride plus one.

229
00:18:57,670 --> 00:19:04,780
In this case, if we have an input size of six, so here we have six, we have six minus field size

230
00:19:04,780 --> 00:19:05,500
three.

231
00:19:06,310 --> 00:19:07,690
Plus two times.

232
00:19:07,690 --> 00:19:09,700
The pattern was our pattern one.

233
00:19:09,700 --> 00:19:10,930
So it's plus two.

234
00:19:11,380 --> 00:19:13,810
Divide that by the stride.

235
00:19:13,900 --> 00:19:16,980
Stride is one and plus one.

236
00:19:16,990 --> 00:19:17,680
There we go.

237
00:19:17,680 --> 00:19:22,990
We have an answer of three plus two, which is five and then plus one, which is six.

238
00:19:22,990 --> 00:19:26,200
So that's how we obtain this output size, Right.

239
00:19:27,100 --> 00:19:34,060
Also note that one good thing with working with a library like TensorFlow is when you don't know the

240
00:19:34,060 --> 00:19:41,560
exact pattern to use such that you have particular output size, you could specify the pattern to be

241
00:19:41,560 --> 00:19:42,190
valid.

242
00:19:42,190 --> 00:19:48,220
Once you specify the pattern to be valid, TensorFlow automatically calculates the pattern for you such

243
00:19:48,220 --> 00:19:51,970
that the output you want matches up.

244
00:19:52,150 --> 00:19:58,030
Up to this point, we've been supposing that our input image is two dimensional.

245
00:19:58,030 --> 00:20:00,500
That is, we have a height and a width.

246
00:20:00,520 --> 00:20:09,070
Now what if we use the kinds of images we have in real life that is three D images where we have the

247
00:20:09,070 --> 00:20:14,740
right channel, the Green channel, and the blue channel.

248
00:20:14,740 --> 00:20:22,450
So if we have this RGV image right here, RGV image, we'll see how we get the output.

249
00:20:22,660 --> 00:20:25,990
Now, the way this is done is quite straightforward.

250
00:20:25,990 --> 00:20:29,170
So what we have is in this case we include this pattern.

251
00:20:29,170 --> 00:20:30,700
So you see zero palette.

252
00:20:30,700 --> 00:20:35,320
We've included this pattern and then we have our channel right here.

253
00:20:35,350 --> 00:20:43,960
Now notice how we have this channel, and then it goes through this first part of this top left corner.

254
00:20:43,960 --> 00:20:47,860
And what we do is we multiply each value right here.

255
00:20:47,860 --> 00:20:54,040
We have negative one times zero plus negative one times zero, plus negative one times zero, and so

256
00:20:54,040 --> 00:20:54,790
on and so forth.

257
00:20:54,820 --> 00:20:57,040
Up to plus negative one times zero.

258
00:20:57,040 --> 00:20:59,290
Right here you have negative one times four.

259
00:20:59,290 --> 00:21:02,740
So basically what we're doing here is kind of like a dot product.

260
00:21:02,740 --> 00:21:07,960
So we're taking all these elements multiplied by the corresponding elements and then adding all this

261
00:21:07,960 --> 00:21:08,410
up.

262
00:21:08,410 --> 00:21:11,560
Once all this added up, we should have 41.

263
00:21:11,560 --> 00:21:15,550
So you could take this as a simple exercise and then you should be able to get 41.

264
00:21:15,550 --> 00:21:20,110
And then we have this other channel right here where we repeat the same process.

265
00:21:20,110 --> 00:21:24,430
We have this and then we have this kernel, we have this.

266
00:21:24,430 --> 00:21:30,970
So here what we do is we have we take this, we obtain this value, take this in is value.

267
00:21:30,970 --> 00:21:32,080
This often is value.

268
00:21:32,080 --> 00:21:35,800
And then we add all this up to get this angle of 41.

269
00:21:35,800 --> 00:21:38,860
Now note that we also have the bias which we've included.

270
00:21:38,860 --> 00:21:40,720
So for now we have this 41.

271
00:21:40,720 --> 00:21:45,340
Now we move to the next step, so we slide through next step.

272
00:21:45,340 --> 00:21:47,410
Yeah, the stride is equal one actually.

273
00:21:47,410 --> 00:21:51,010
From here we do the same process negative one times zero.

274
00:21:51,010 --> 00:21:54,460
This negative one times is zero added up and an all of that.

275
00:21:54,460 --> 00:21:57,490
So all that added up, all this added up, all this added up.

276
00:21:57,490 --> 00:21:58,600
We have 12.

277
00:21:58,600 --> 00:22:01,570
We repeat the same process right up to the end.

278
00:22:01,570 --> 00:22:04,480
So that's how we obtain our output right here.

279
00:22:04,750 --> 00:22:11,230
You could also check out on this other more visually appealing example in the and Explorer website.

280
00:22:11,230 --> 00:22:16,840
So here we double click on this and then we see exactly what's going on.

281
00:22:18,210 --> 00:22:18,850
We have the.

282
00:22:19,350 --> 00:22:25,470
You see how we forming this output by sliding through the channels over the inputs.

283
00:22:26,310 --> 00:22:32,040
So you're I click on this, you see a slide through this and you see how all these values are multiplied

284
00:22:32,040 --> 00:22:35,340
and then add it together to form the output.

285
00:22:35,640 --> 00:22:36,330
So that's it.

286
00:22:37,410 --> 00:22:41,610
From this point we move to this explained visually.

287
00:22:41,610 --> 00:22:43,650
Project by Cetus.

288
00:22:44,870 --> 00:22:48,400
Filtering is a core part of image processing.

289
00:22:48,410 --> 00:22:54,860
And since we're dealing with image data here, it's important for us to better understand how this works.

290
00:22:54,860 --> 00:23:02,210
So right here we'll choose a kernel or filter, or let's pick this sharpen filter right here on this

291
00:23:02,210 --> 00:23:03,380
shopping channel.

292
00:23:03,410 --> 00:23:09,540
Notice how once we pick a kernel, this values change that the values of the filters actually changed.

293
00:23:09,560 --> 00:23:11,000
Let's take this outline.

294
00:23:11,120 --> 00:23:12,100
Notice a change.

295
00:23:12,110 --> 00:23:22,100
Now let's keep this outline and then we notice that every time we hover over this input image, we have

296
00:23:22,100 --> 00:23:26,150
the values right here that's to the right.

297
00:23:26,150 --> 00:23:28,070
So for now, we have this.

298
00:23:28,070 --> 00:23:34,140
And then let's notice that here we have this negative one that's negative one times at this position.

299
00:23:34,160 --> 00:23:43,610
Notice at this position we have the value to 155, so to 55 times negative one plus 255 to -1 plus or

300
00:23:43,640 --> 00:23:46,010
non value and none here because we have the borders.

301
00:23:46,010 --> 00:23:55,130
So there is no value and then we have to 49 times -1 to 55 times 8 to 33 times -1 to 55 times negative

302
00:23:55,130 --> 00:23:55,700
one.

303
00:23:55,700 --> 00:24:02,030
And then all this sums up to give us a given value since we have this unknown values here is unknown,

304
00:24:02,030 --> 00:24:03,260
so we have no value there.

305
00:24:03,260 --> 00:24:09,680
But if we change this position and then we fit around the eye region, you see we have a value of -533.

306
00:24:09,680 --> 00:24:20,030
So we notice that no matter the image we're put in, your will always get an output where the outlines

307
00:24:20,540 --> 00:24:22,340
are being highlighted.

308
00:24:23,150 --> 00:24:29,810
So if you take out this image which we just imported of Elon Musk, you see that the output is this

309
00:24:31,040 --> 00:24:34,820
image right here which highlights the edges.

310
00:24:35,120 --> 00:24:41,960
And so as is explained here, the highlight large differences in pixel values which generally occur

311
00:24:41,960 --> 00:24:42,620
at the edges.

312
00:24:42,620 --> 00:24:49,310
So around this region you see this large differences have been outlined as compared to this zone here

313
00:24:49,310 --> 00:24:50,390
where there is no difference.

314
00:24:50,390 --> 00:24:54,470
And since there is no difference, we just have this black region.

315
00:24:55,520 --> 00:24:58,940
So here again you see that we have the filter.

316
00:24:58,940 --> 00:24:59,870
That's it.

317
00:24:59,960 --> 00:25:02,300
This is the exact same value we have right here.

318
00:25:02,300 --> 00:25:03,620
We have the filter.

319
00:25:03,620 --> 00:25:09,560
And then this example shows us how this output is gotten.

320
00:25:10,880 --> 00:25:16,790
Now, the major difference between what we're doing right here and the convolutional neural networks

321
00:25:16,790 --> 00:25:21,200
is that with this we know this channel values.

322
00:25:21,200 --> 00:25:25,850
So we know that we have this matrix negative one, negative one, negative one, negative one eight,

323
00:25:25,850 --> 00:25:30,260
negative one, negative one, negative one, negative one, which is an edge detector.

324
00:25:30,260 --> 00:25:34,220
So we know this and then we know the obvious output.

325
00:25:34,460 --> 00:25:42,650
But with a convolutional neural network or with a convolution layer to be specific, what we do is initially

326
00:25:42,650 --> 00:25:52,580
we just initialize this values and we let the model do in training to learn these values automatically.

327
00:25:52,580 --> 00:25:57,650
So these values are learned by the model during training automatically.

328
00:25:58,250 --> 00:26:06,410
One of the very first convolutional neural networks are called Nets, was built by Yann LeCun in 1989.

329
00:26:07,610 --> 00:26:13,580
And right here we have the structure of this concept known as the Loonette.

330
00:26:13,850 --> 00:26:16,460
This Loonette takes in an image.

331
00:26:16,460 --> 00:26:20,480
Here we have a 28 by 28 by one image.

332
00:26:20,480 --> 00:26:23,240
Yeah, we just have one channel, black and white image.

333
00:26:23,240 --> 00:26:29,330
And then we pass this to a convolutional layer, which we've just seen after passing through the convolutional

334
00:26:29,330 --> 00:26:32,240
layer, we have the sigmoid.

335
00:26:32,240 --> 00:26:38,210
But also note that here we have a five by five kernel, we have plus two pattern.

336
00:26:38,210 --> 00:26:47,570
So if it's zero pattern will add zeros around our input and then add another zero or another group of

337
00:26:47,570 --> 00:26:49,250
zeros around our features.

338
00:26:49,250 --> 00:26:53,720
Then from here we have this output 28 by 28 by six.

339
00:26:53,720 --> 00:26:57,320
So shortly we'll see exactly how this output is gotten.

340
00:26:57,320 --> 00:27:02,810
So we've seen we have the sigmoid activation from here, we have the pooling layer.

341
00:27:02,810 --> 00:27:06,590
Now this is a subsampling layer and we'll understand how it works.

342
00:27:06,590 --> 00:27:15,290
So we have two by two average pulling Kanno and with a straight of two from this we have another convolutional

343
00:27:15,290 --> 00:27:17,900
layer with five by five kernel.

344
00:27:17,930 --> 00:27:23,870
Here is no pattern and we have this output activation pulling flattening with flattening.

345
00:27:23,870 --> 00:27:32,750
All the features have been modified so we'll leave from this three d tensor to a1d tensor.

346
00:27:32,870 --> 00:27:39,800
And then we have this dense, fully connected layer followed by a sigmoid, followed by another dense,

347
00:27:39,800 --> 00:27:41,930
fully connected layer followed by a sigmoid.

348
00:27:41,930 --> 00:27:44,060
And finally, we have this.

349
00:27:44,520 --> 00:27:48,270
Layer with output ten neurons.

350
00:27:48,930 --> 00:27:57,510
And the exact reason why we have this output of ten classes or ten neurons since we have ten classes

351
00:27:57,510 --> 00:28:03,210
is because we were predicting whether an input is either a one.

352
00:28:03,210 --> 00:28:08,760
So these inputs are images of handwritten digits.

353
00:28:08,760 --> 00:28:16,080
So we want to predict whether the 100 digit is a one or two or three up to nine.

354
00:28:16,410 --> 00:28:20,340
And so here we have this and then we also have a zero.

355
00:28:20,340 --> 00:28:21,440
So that makes ten.

356
00:28:21,450 --> 00:28:24,870
So 0 to 9 gives us ten possibilities.

357
00:28:24,870 --> 00:28:27,450
And that's why we have ten different classes right here.

358
00:28:27,600 --> 00:28:29,370
Now for the next.

359
00:28:30,700 --> 00:28:39,670
It was built to correctly classify whether an input image belongs to one of a thousand classes in the

360
00:28:39,670 --> 00:28:41,860
image net data set.

361
00:28:41,890 --> 00:28:47,920
So here we have this Alex net with a different architecture as we could see right here.

362
00:28:49,090 --> 00:28:54,640
And now we'll go ahead and understand how all this outputs are gotten.

363
00:28:55,660 --> 00:29:03,370
And so we'll be rebuilding the net architecture, but this time around, controlling an input 64 by

364
00:29:03,370 --> 00:29:04,520
64 by three.

365
00:29:04,540 --> 00:29:18,550
So right here we have the B channels, we have our G and B, if we pass this input now through this

366
00:29:18,550 --> 00:29:22,060
convolutional layer, we are going to have this output.

367
00:29:22,060 --> 00:29:23,860
And how do we get this output?

368
00:29:24,100 --> 00:29:29,080
To get this output, we have to take into consideration this parameters which have been given to us

369
00:29:29,080 --> 00:29:30,730
for a convolutional layer.

370
00:29:31,060 --> 00:29:34,370
That said, the filter size is equal five.

371
00:29:34,390 --> 00:29:37,030
So we have your filter size of five.

372
00:29:37,060 --> 00:29:38,840
There is no pattern.

373
00:29:38,860 --> 00:29:40,780
Stride equal one.

374
00:29:40,780 --> 00:29:46,870
And then the number of filters equals six who calculate the number of parameters.

375
00:29:47,470 --> 00:29:51,430
And the way these are the four most important parameters.

376
00:29:52,000 --> 00:29:56,830
So here we have this, and this is what we call our filter.

377
00:29:56,830 --> 00:29:58,270
We have this filter.

378
00:29:58,270 --> 00:30:05,170
You see that we have the DOT products which are computed to get the outputs as usual.

379
00:30:05,170 --> 00:30:12,610
And so we take this like this how like are we have this, then we follow through with this and then

380
00:30:12,610 --> 00:30:13,690
we follow with this.

381
00:30:13,690 --> 00:30:20,680
So we take this and then computed the products and with the other two.

382
00:30:20,680 --> 00:30:31,300
And then from this we add all this up to obtain each and every value for this very first feature map

383
00:30:31,300 --> 00:30:32,170
right here.

384
00:30:32,200 --> 00:30:35,590
So to obtain this feature map, we are using this filter.

385
00:30:36,190 --> 00:30:44,470
Now, this shows us clearly that when we specify that a number of filters equals six, it doesn't mean

386
00:30:44,470 --> 00:30:47,290
we actually have just six of this.

387
00:30:47,290 --> 00:30:55,510
As you may feel like a number of filters equals six means we have six filters stack like this.

388
00:30:55,510 --> 00:30:57,970
Here we have five and then six.

389
00:30:57,970 --> 00:31:02,560
So we feel like we have just six stacked like this, which is not actually the case.

390
00:31:02,560 --> 00:31:12,040
What happens is we have three does this three channels, and then each of our filters also have the

391
00:31:12,040 --> 00:31:13,930
three channels, as you could see.

392
00:31:14,320 --> 00:31:20,800
And so what we call a filter here is this three or rather this five by five by two.

393
00:31:20,800 --> 00:31:24,890
We all right here.

394
00:31:24,940 --> 00:31:30,160
Now, we've done the computation and we've had for this we repeat the same process.

395
00:31:30,160 --> 00:31:36,700
So we take this for the are we repeat this, we repeat this, and then we obtain this next.

396
00:31:36,700 --> 00:31:39,910
We do all this the same way up to this position.

397
00:31:39,910 --> 00:31:41,800
And then we have this right here.

398
00:31:41,800 --> 00:31:47,230
And that's why when we have six filters, we're always going to have six channels.

399
00:31:48,340 --> 00:31:52,780
Then if you do five by five by three, and all that time six.

400
00:31:52,780 --> 00:32:04,030
So if you have five by five times three, that 75, 75 times six should give you a total of 450.

401
00:32:04,300 --> 00:32:11,350
And since for each of these filters, we add the bias that is we have an extra bias.

402
00:32:11,350 --> 00:32:12,490
For each of these filters.

403
00:32:12,490 --> 00:32:15,550
We have one, two, three, four, five, six biases.

404
00:32:15,550 --> 00:32:23,860
So add in this plus six, we have 456 parameters to be trained.

405
00:32:23,860 --> 00:32:27,760
So here we have 450 weights and six biases.

406
00:32:28,540 --> 00:32:32,500
That said, we've understood why we have this six channels now.

407
00:32:32,500 --> 00:32:34,510
How do we obtain this 60 by 60?

408
00:32:34,510 --> 00:32:38,650
The way we obtain the 60 by 60 is by applying this formula, which we've seen already.

409
00:32:38,650 --> 00:32:45,910
So here we just have the output equal 64 filter size five.

410
00:32:45,910 --> 00:32:49,120
So we have minus five pattern, no pattern.

411
00:32:49,120 --> 00:32:51,010
So pattern is zero stride one.

412
00:32:51,010 --> 00:32:58,570
So we have 64 minus five and then plus one, this gives us 64 minus four, which is equals 60.

413
00:32:58,570 --> 00:33:01,420
And that's how we have this output right here.

414
00:33:03,720 --> 00:33:07,770
From here we move to the subsampling pooling layer.

415
00:33:08,010 --> 00:33:14,790
For the pooling layer, we have these two parameters that we have the number of filters, and then already

416
00:33:14,790 --> 00:33:18,360
we have the filter size, not the number of filters, the filter size.

417
00:33:18,360 --> 00:33:24,220
And then we have the number of stripes to obtain the dimension for the pooling.

418
00:33:24,240 --> 00:33:29,550
The formula is slightly different, obviously, because here we don't have the pattern, so we have

419
00:33:29,550 --> 00:33:37,980
X minus F divided by x plus one, and then we'll leave from this to this feature map right here.

420
00:33:38,010 --> 00:33:41,730
Notice how we still maintain the number of channels, but our.

421
00:33:42,510 --> 00:33:49,360
Input feature map has been sampled for the particular case of max pooling.

422
00:33:49,380 --> 00:33:54,210
If you want to understand how this works, let's be let's say we have this position, let's take this

423
00:33:54,210 --> 00:33:54,780
position.

424
00:33:54,780 --> 00:33:56,710
Let's take a position where we have some values.

425
00:33:56,730 --> 00:34:01,310
Notice how as we pick this value, you see the max is 0.02.

426
00:34:01,320 --> 00:34:03,330
So if you look around, look at this.

427
00:34:03,330 --> 00:34:08,470
So we see here the max is 0.08 and then and so on and so forth.

428
00:34:08,490 --> 00:34:13,170
So basically what we're doing is we're just simply sliding through the whole image.

429
00:34:13,170 --> 00:34:20,310
And then for every since our canvas size is equal to we have a two by two kernel size here.

430
00:34:20,310 --> 00:34:26,700
So for every four values, we are going to pick out one to represent them.

431
00:34:26,700 --> 00:34:30,150
And in this case, this one is the max.

432
00:34:30,150 --> 00:34:35,670
In some cases, we will instead take the average of this that's known as average pulling.

433
00:34:35,670 --> 00:34:42,960
But for this max pooling, which is the most commonly used, we take the max of all these different

434
00:34:42,960 --> 00:34:46,410
values that set to obtain this.

435
00:34:46,410 --> 00:34:50,280
We have X that is in this case we have x equals 60.

436
00:34:50,280 --> 00:34:51,900
So let's take this off.

437
00:34:52,110 --> 00:34:53,520
X equals 60.

438
00:34:55,250 --> 00:35:05,030
Minus f, f equal to divided by the stride or stride is two and then plus one.

439
00:35:05,030 --> 00:35:12,690
So 60 -258 divided by two, 29 plus one gives us 30.

440
00:35:12,710 --> 00:35:14,000
That's how we get this.

441
00:35:14,000 --> 00:35:16,220
We still maintain the number of channels.

442
00:35:17,240 --> 00:35:23,540
Here's another example showing even more clearly how this max pooling works or pulling in general works.

443
00:35:23,540 --> 00:35:28,520
So if we have this input, so we're picking out just one of these supposed to be picking out just one

444
00:35:28,520 --> 00:35:33,290
of this channels and we have this, we are going to have our channel two by two.

445
00:35:33,320 --> 00:35:34,010
That's it.

446
00:35:34,010 --> 00:35:37,220
And then what get what we get is output is zero.

447
00:35:37,220 --> 00:35:41,580
Since the max of 000 is zero, we do we have a stride of two.

448
00:35:41,600 --> 00:35:46,490
So notice how we've shifted my two positions, stride two and then the max here is two.

449
00:35:46,490 --> 00:35:48,050
We move again.

450
00:35:48,060 --> 00:35:48,860
Strike two.

451
00:35:48,860 --> 00:35:49,940
Max now is two.

452
00:35:49,970 --> 00:35:53,000
We move, We see Max is one move.

453
00:35:53,000 --> 00:35:54,050
Max is zero.

454
00:35:54,050 --> 00:35:57,110
And then you go to the next Max three and so on and so forth.

455
00:35:57,110 --> 00:36:01,350
So that's how we obtain this new feature map right here.

456
00:36:01,370 --> 00:36:08,720
Notice how we have this is five, ten, five, ten, and then here we have five, five.

457
00:36:08,720 --> 00:36:11,930
So we leave from ten by 10 to 5 by five.

458
00:36:13,040 --> 00:36:16,250
After this subsampling layer, we have the activation.

459
00:36:16,250 --> 00:36:18,860
We've already looked at activation in the previous section.

460
00:36:18,890 --> 00:36:24,740
We've seen the sigmoid and we've seen the value within the tank and we've seen the leaky rail.

461
00:36:24,740 --> 00:36:30,830
So that said, we understand that and now we move to the next convolutional layer or filter size, equal

462
00:36:30,830 --> 00:36:38,630
five pattern or pattern Stripe one Number of filters 16 number of parameters 2416 So you could take

463
00:36:38,630 --> 00:36:46,490
this as an exercise to be able to show that the total number of parameters we have is 2416.

464
00:36:46,490 --> 00:36:51,470
That said, we're going to have this output right here or we have the railroad added and then we have

465
00:36:51,470 --> 00:36:52,190
this output.

466
00:36:52,190 --> 00:36:56,900
So we have the output 26 or 26 by 26 by 16.

467
00:36:56,900 --> 00:37:02,900
We call a number of filters dictates the number of channels right here and not the filter size is the

468
00:37:02,900 --> 00:37:04,010
number of filters.

469
00:37:04,010 --> 00:37:08,600
And then this 26 by 26 is gotten by using this same formula we've seen already.

470
00:37:08,810 --> 00:37:12,800
From here we have another subsampling.

471
00:37:13,100 --> 00:37:16,190
We've understood that already and then we flatten all this out.

472
00:37:16,190 --> 00:37:20,870
So after doing the sub sampling, we have this 13 by turn by 16.

473
00:37:20,870 --> 00:37:28,340
When you multiply all this should give you 2704 and this is what we call the flattened layer.

474
00:37:28,340 --> 00:37:31,150
So we pass this to a flattened layer to obtain this.

475
00:37:31,160 --> 00:37:38,600
Now this takes each and every value we have here in our feature map and then just simply places it in

476
00:37:38,600 --> 00:37:42,500
this one dimensional output right here.

477
00:37:42,620 --> 00:37:48,050
Then from here we have the dense layer, which we're used to working with already, and then we have

478
00:37:48,050 --> 00:37:51,260
a thousand neurons, the output.

479
00:37:51,260 --> 00:37:53,720
And finally we have 200 neurons.

480
00:37:53,720 --> 00:38:01,850
In our case, we should have two since we're actually predicting whether it's a parasite or it's non

481
00:38:01,850 --> 00:38:02,480
parasite.

482
00:38:02,480 --> 00:38:03,740
So we should just have two.

483
00:38:03,740 --> 00:38:11,480
Anyways, we understand all this now and we shall be ready to dive into the code before moving on to

484
00:38:11,480 --> 00:38:12,080
the code.

485
00:38:12,080 --> 00:38:18,800
Let's get to see what the future maps of a trained convolutional neural network looks like.

486
00:38:18,920 --> 00:38:26,150
As you could see right here from this example from the Stanford website, we have this car which is

487
00:38:26,150 --> 00:38:30,290
passed into a convolutional neural network and then its output.

488
00:38:30,290 --> 00:38:32,780
We have it predicting a car.

489
00:38:34,250 --> 00:38:45,570
As you can notice this, first layers actually serve as filters for low level features like the edges.

490
00:38:45,590 --> 00:38:57,110
So you'll notice that this input or this feature maps from the first layers produce visually interpretable

491
00:38:57,110 --> 00:38:58,190
outputs.

492
00:38:58,640 --> 00:39:07,430
Now, you will notice also that as we go farther or deeper, we have this outputs which are less visually

493
00:39:07,430 --> 00:39:16,790
interpretable and the focus more on high level features like the car parts, which parameters correctly

494
00:39:16,790 --> 00:39:21,140
classify that this image contains a car?

495
00:39:22,100 --> 00:39:31,010
And so the first layers are for feature extraction, and then the last layers are for classification.

496
00:39:31,610 --> 00:39:39,650
Both subsections actually need each other as if you do feature extraction and you don't have layers

497
00:39:39,650 --> 00:39:48,410
which are able to parameters correctly classify this image, then we will not achieve our goal.

498
00:39:48,770 --> 00:40:00,020
And in the same sense, if we have a good classifier, but we get this kind of inputs directly where

499
00:40:00,020 --> 00:40:07,260
we've not extracted useful features from them yet, then our classifier is not going to perform well.

500
00:40:07,280 --> 00:40:15,290
You could also check this Covenant MIS demo by Andreas Capaldi, where he trains a model on the mixed

501
00:40:15,290 --> 00:40:16,070
data set.

502
00:40:16,100 --> 00:40:20,300
Here we have our model, the input column, pull, com pull and self max.

503
00:40:20,300 --> 00:40:26,190
So here we have this on this data set and then we're having this passed in, as you could see.

504
00:40:26,210 --> 00:40:35,780
Notice how in this first layers, the feature maps actually contain much more visual content as compared

505
00:40:35,780 --> 00:40:39,100
to this final layers right here.

506
00:40:39,110 --> 00:40:47,570
See, if you look at this final layers, you see that we have this feature maps which instead contain

507
00:40:47,570 --> 00:40:54,680
content which permits us say, whether a particular input belongs to a particular class.

508
00:40:55,280 --> 00:41:01,760
If you want to build convolutional layers with TensorFlow, you could make use of this TensorFlow Keras

509
00:41:01,760 --> 00:41:03,200
model right here.

510
00:41:03,440 --> 00:41:10,070
So that said, we're just going to come to the layers to that cars and then we pick layers and then

511
00:41:10,070 --> 00:41:13,190
we'll find the conf 2d layer.

512
00:41:13,190 --> 00:41:20,600
So there is a layer we're searching for double click on that and then here we go, we have the arguments.

513
00:41:20,600 --> 00:41:23,090
That's what we pass in this layer.

514
00:41:23,750 --> 00:41:27,470
The filters right here corresponds to the number of filters.

515
00:41:27,470 --> 00:41:28,670
So let's take this off.

516
00:41:28,700 --> 00:41:36,200
There is a number of filters within this already and F and then the kernel size corresponds to the filter

517
00:41:36,200 --> 00:41:36,980
size.

518
00:41:37,370 --> 00:41:41,750
From there we move to the strikes which take topple.

519
00:41:42,050 --> 00:41:45,010
Let's take this off and check this out.

520
00:41:45,020 --> 00:41:48,050
So here we have this tuple, actually.

521
00:41:48,050 --> 00:41:55,730
And then what's important to note here is the fact that if you want to do striding of, say, one,

522
00:41:55,730 --> 00:41:59,000
two, that is one two, then.

523
00:42:00,810 --> 00:42:08,040
In the height dimension, your style is going to be equal one, and then the width dimensions try to

524
00:42:08,040 --> 00:42:09,000
be equal to.

525
00:42:09,030 --> 00:42:16,380
So if we have this feature map right here and we pass a colonel, we are going to be sliding.

526
00:42:18,060 --> 00:42:25,230
And skipping two steps in this horizontal direction while when we come in or are we going downwards,

527
00:42:25,230 --> 00:42:27,630
we are going to be skipping only one step.

528
00:42:27,630 --> 00:42:28,960
So that's what this means.

529
00:42:28,980 --> 00:42:35,610
Now, in the case where we are having exactly the same number of steps sliding horizontally and vertically,

530
00:42:35,610 --> 00:42:40,530
then the striding like in this case, can just be given as equal one.

531
00:42:41,430 --> 00:42:43,080
From there we have the pattern.

532
00:42:43,080 --> 00:42:45,390
By default, the pattern is valid here.

533
00:42:45,390 --> 00:42:52,830
We're told that for the pattern, if the pattern is valid, it means there's actually no pattern.

534
00:42:52,830 --> 00:42:54,210
So we have zero pattern.

535
00:42:54,210 --> 00:43:01,200
And then when the pattern is same, the result of the pattern, although we pad it with zeros to ensure

536
00:43:01,200 --> 00:43:05,250
that the input dimension equal the output dimension.

537
00:43:06,480 --> 00:43:13,560
Also note that this is only possible when the stride is equal one So that set or if we have a feature

538
00:43:13,560 --> 00:43:23,520
map like this, let's say we have 63 by 60 and then we want to pass this to a conf layer to get an output

539
00:43:23,520 --> 00:43:24,390
feature map.

540
00:43:24,390 --> 00:43:31,470
And if one does our feature map to be the same dimension as the input that is 60 by 60, then all we

541
00:43:31,470 --> 00:43:35,100
need to do is specify that the pattern is same.

542
00:43:35,100 --> 00:43:39,330
So once we have this, the output will be the same as that of the input.

543
00:43:40,080 --> 00:43:42,390
From here we have the data format.

544
00:43:42,420 --> 00:43:46,100
Now note that there are generally two kinds of data formats.

545
00:43:46,110 --> 00:43:48,030
That is, we have height.

546
00:43:48,480 --> 00:43:53,640
So for an image, we have the height, width and then the channel.

547
00:43:54,030 --> 00:44:08,280
So if we have 224 by 224 by three image, then we actually making use of this data format.

548
00:44:08,460 --> 00:44:12,330
Now we could also use this format where we start with the channel.

549
00:44:12,330 --> 00:44:17,490
So we have the channel by height and by width of.

550
00:44:19,100 --> 00:44:22,550
And so here we have three by 224 by 224.

551
00:44:22,580 --> 00:44:25,370
Note that by default we have the channel.

552
00:44:25,680 --> 00:44:29,110
So by default, we have this first convention last right here.

553
00:44:29,120 --> 00:44:35,630
But if you want to take this convention, all you need to do is to specify that you're working with

554
00:44:35,630 --> 00:44:37,400
the channels first.

555
00:44:38,570 --> 00:44:41,170
Then from here we have the dilation rate.

556
00:44:41,180 --> 00:44:51,200
So we'll look at this GitHub repository by V2 where he uses animations to explain how convolutions work.

557
00:44:52,220 --> 00:44:59,060
You could scroll down and you'll find the dilated convolution animation which when we click on we have

558
00:44:59,060 --> 00:44:59,660
this.

559
00:44:59,660 --> 00:45:00,970
So there we go.

560
00:45:00,980 --> 00:45:06,500
We have this example in which the dilation rate is equal to.

561
00:45:06,830 --> 00:45:16,070
So the way this dilation works is we have this kernel which initially had no spaces between his values.

562
00:45:16,070 --> 00:45:20,900
So we had something like this, we had this three by three kernel.

563
00:45:20,900 --> 00:45:24,260
And now what we do is we have this pieces.

564
00:45:24,260 --> 00:45:28,760
All this holds between these different values.

565
00:45:28,760 --> 00:45:35,020
And so now what we get is some sort of five by five kernel right here.

566
00:45:35,030 --> 00:45:42,380
If we're working with a dilation rate, R equals three, then instead of a single whole year we are

567
00:45:42,380 --> 00:45:45,080
going to fit in two other holes.

568
00:45:45,080 --> 00:45:46,610
So we're going to have this.

569
00:45:46,610 --> 00:45:53,060
We just have one, two and then we fit this, then we have one, two and then we fit this.

570
00:45:53,360 --> 00:45:56,030
So that's how we, we build this out.

571
00:45:56,370 --> 00:45:59,090
We have this and then that.

572
00:45:59,240 --> 00:46:04,190
Now, to get the shape, we have one, two, three, four, five, six, seven.

573
00:46:04,190 --> 00:46:08,210
So we have a seven by seven filter.

574
00:46:08,210 --> 00:46:16,220
Now it should be noted that the dilated convolutions I use in problems where we want to keep increasing

575
00:46:16,220 --> 00:46:25,090
the receptive field as we keep going deeper in the network while maintaining the number of parameters.

576
00:46:25,100 --> 00:46:32,180
The next argument is the groups, and according to the documentation, this is a positive integer specifying

577
00:46:32,180 --> 00:46:36,620
the number of groups in which the input is split along the channel axis.

578
00:46:36,620 --> 00:46:42,230
This means that in this case, for example, we could break this up into three different groups.

579
00:46:42,230 --> 00:46:46,550
So we have one, two and then two regroups.

580
00:46:46,550 --> 00:46:54,350
Then each group has its own group filters and then the output is a concatenation of all the group results

581
00:46:54,350 --> 00:46:56,240
across the channel axis.

582
00:46:57,080 --> 00:47:03,500
From here we have the activation, which we've seen already, and then we'll say whether we're going

583
00:47:03,500 --> 00:47:05,360
to use the bias or not.

584
00:47:05,600 --> 00:47:11,990
We have this kernel that is the weight initialization and then we have the bias initialization.

585
00:47:11,990 --> 00:47:20,390
We have this regular rises which will see subsequently and then we have the kernel and bias constraints.
