1
00:00:00,270 --> 00:00:06,420
Hello, everyone, and welcome to this new and exciting session in which we shall be building our own

2
00:00:06,750 --> 00:00:08,190
model from scratch.

3
00:00:09,060 --> 00:00:16,620
In the previous section we saw how this transform our model, which previously was used for NLP.

4
00:00:16,620 --> 00:00:22,950
Tasks could be used in computer vision with proper preparations.

5
00:00:23,880 --> 00:00:32,820
And so in this section we'll see how to convert or create patches of this image and then carry out those

6
00:00:32,830 --> 00:00:39,210
linear projections and pass this output from the linear projections into the transform encoder to then

7
00:00:39,210 --> 00:00:46,680
create or to then train and enter and model, which learns how to say whether and input that image is

8
00:00:46,680 --> 00:00:50,040
that of an angry person, a happy person, or sad.

9
00:00:50,190 --> 00:00:58,050
Given that we'll be working with 256 by 256 images if we have to split up these images into 16 by six

10
00:00:58,050 --> 00:01:03,660
and images, then we would have 256 different images.

11
00:01:03,660 --> 00:01:12,590
Because here if we have 256 here and then break this up, you're into 16 by 16.

12
00:01:12,600 --> 00:01:21,210
So you have 16 pixels by 16 pixels, then you would have to go you have to do this 16 times horizontally

13
00:01:21,210 --> 00:01:26,640
and 16 times vertically to form a 256 by 256 image.

14
00:01:26,640 --> 00:01:33,330
And so from here you would have 256 different patches like this one.

15
00:01:34,050 --> 00:01:40,500
And so what we'll use to create this patches will be this extract patches method right here.

16
00:01:40,530 --> 00:01:48,180
Now this extract patches takes in the image, it takes in the sizes, takes on the stripes and the rates

17
00:01:48,180 --> 00:01:49,530
with also the pattern.

18
00:01:49,530 --> 00:01:53,760
So let's look at how this works from here.

19
00:01:53,760 --> 00:02:00,690
We have this picture here or we have this output which will help us picture how this works.

20
00:02:00,690 --> 00:02:05,040
But before looking at that, let's look at this arguments.

21
00:02:05,040 --> 00:02:07,010
We have the sizes or we have the images.

22
00:02:07,010 --> 00:02:08,730
It's a 4D tensor.

23
00:02:08,730 --> 00:02:15,150
We have this sizes which specifies the patch sizes.

24
00:02:15,150 --> 00:02:20,190
So in our case we have 16 by 16 patches, so our patch size will be 16.

25
00:02:20,190 --> 00:02:25,530
We also have the stripes which tells us how much we should shift while creating these patches.

26
00:02:25,530 --> 00:02:30,590
And then we have the rates which will better understand with some figure.

27
00:02:30,600 --> 00:02:38,910
So let's, let's now get back to this year and then we see that we have been we yeah, we pass in the

28
00:02:38,910 --> 00:02:41,520
sizes of one, three, three, one.

29
00:02:41,520 --> 00:02:48,540
So this list, we pass it in here we got this from your so you see we told it must be this list where

30
00:02:48,540 --> 00:02:53,880
we have one the size row, the size column and one but since the size row, equal size color equals

31
00:02:53,880 --> 00:02:57,280
16, then here we would have 16 by 16.

32
00:02:57,300 --> 00:03:04,410
Now, in the case where we have three three, it you could see your let's have this here, you could

33
00:03:04,410 --> 00:03:06,360
see your let's take this off.

34
00:03:06,360 --> 00:03:08,820
You could see that we have this three by three.

35
00:03:08,820 --> 00:03:15,180
So your on line with our example where we're working with 16 by 16 pixels here they have three by three

36
00:03:15,180 --> 00:03:16,140
pixels.

37
00:03:16,140 --> 00:03:22,530
And then the next one you have the strides five by five.

38
00:03:22,560 --> 00:03:26,250
Now you will notice that let's scroll this way.

39
00:03:26,250 --> 00:03:34,710
You will notice that when this box or when we want to get the next patch, we shift or we go five steps

40
00:03:34,710 --> 00:03:41,250
to the right and then get this next position or this next patch right here.

41
00:03:41,250 --> 00:03:44,400
So you can see after shifting, we get to this.

42
00:03:44,400 --> 00:03:49,130
So originally we had this image here and then we're trying to get this different patches.

43
00:03:49,140 --> 00:03:55,500
Now the patches are what we have in starts, we can see and then again we shift five steps downward.

44
00:03:55,500 --> 00:03:58,680
We go downward again and then we get to this.

45
00:03:59,430 --> 00:04:03,990
You will see that if you can count this one, two, three, four, five steps.

46
00:04:03,990 --> 00:04:07,160
And then we get to this and then we shift again and so on and so forth.

47
00:04:07,170 --> 00:04:12,240
So that's how we obtain all this three, this four patches right here.

48
00:04:12,270 --> 00:04:18,530
Now the next thing we will look at will be this rates.

49
00:04:18,570 --> 00:04:27,060
Now, the rates again, you have this one and this one, and then you specify this two rate values here.

50
00:04:27,060 --> 00:04:35,310
Now we'll look at how we understand this two rate values by looking at dilation or by looking at a dilated

51
00:04:35,310 --> 00:04:36,960
convolution operation.

52
00:04:36,960 --> 00:04:44,550
We can see this visualization from this medium post by C called Sun, where unlike with the usual convolutions,

53
00:04:44,550 --> 00:04:49,860
where when we have this filter which passes through the image, you see we have this three by three

54
00:04:49,860 --> 00:04:51,120
filter, which is compact.

55
00:04:51,120 --> 00:04:57,630
Here you see there's no spacing between each element of the filter, but with the dilated convolution,

56
00:04:57,630 --> 00:04:59,220
you see the spacing here.

57
00:04:59,430 --> 00:04:59,880
So.

58
00:04:59,990 --> 00:05:06,020
Is kind of similar to what we have here as let's get back to documentation.

59
00:05:06,260 --> 00:05:17,600
As with this, if we decide to have patches where there is some spacing between the pixels, then we

60
00:05:17,600 --> 00:05:22,020
could specify how much space we want with this rate right here.

61
00:05:22,040 --> 00:05:28,490
Now the pattern here is set to valid, and so this means that if some pattern needs to be done to match

62
00:05:28,490 --> 00:05:32,690
up with the extraction, then it will be done automatically.

63
00:05:32,990 --> 00:05:40,720
Now, that said, we could copy this all out, copy this simply, and then we piss it out right here.

64
00:05:40,730 --> 00:05:45,140
So here we have this and then we'll call this patches.

65
00:05:45,380 --> 00:05:51,620
So we have this patches and then we'll pass in our test image right here.

66
00:05:52,130 --> 00:05:52,820
There we go.

67
00:05:52,820 --> 00:05:54,410
We have test image.

68
00:05:54,410 --> 00:05:55,340
That's fine.

69
00:05:55,340 --> 00:05:58,990
And then here we have 16 by 16.

70
00:05:59,000 --> 00:06:04,940
Now this is the patch size, so we could do well to put that in the configuration so we could, we could

71
00:06:04,940 --> 00:06:12,380
add that to the configuration and the strident to is 16 because we want to let's get back here, the

72
00:06:12,380 --> 00:06:12,810
strider.

73
00:06:12,860 --> 00:06:15,100
Let's see why we need this trying to be 16.

74
00:06:15,110 --> 00:06:17,360
Now you see that we're interested.

75
00:06:17,360 --> 00:06:21,230
If we have this kind of image we're interested in, we are not interested in skipping this part, so

76
00:06:21,230 --> 00:06:22,520
we do not want to keep any parts.

77
00:06:22,520 --> 00:06:23,810
We want to have this.

78
00:06:24,500 --> 00:06:26,810
Then we want to have this next.

79
00:06:27,500 --> 00:06:28,910
You see, this will want to do.

80
00:06:28,910 --> 00:06:30,710
Then we want to have this next.

81
00:06:31,700 --> 00:06:35,620
And this next obviously will be pattern year and year.

82
00:06:35,660 --> 00:06:41,620
So account for the space here and then we want to have this next and so on and so forth.

83
00:06:41,630 --> 00:06:47,840
So because we want to have this will make this value to be the same as this, so that when we have this

84
00:06:47,840 --> 00:06:52,920
three by three patch, we skip three steps and move to this next and so on and so forth.

85
00:06:52,940 --> 00:06:59,570
So that said, if you want to have this kind of compact patch extraction, then you want to have this

86
00:06:59,570 --> 00:07:03,680
this the value here for the stripe, the same as the size.

87
00:07:03,680 --> 00:07:11,870
So with that now let's configuration patch size, we have that, the 16 by 16, that's fine.

88
00:07:11,870 --> 00:07:14,060
And then now let's run this here.

89
00:07:14,060 --> 00:07:17,420
Let's also run, let's run this here, this image test image.

90
00:07:17,420 --> 00:07:18,500
Then that's fine.

91
00:07:18,500 --> 00:07:20,720
We run this, we get this error.

92
00:07:20,750 --> 00:07:23,210
Let's check out It must be four dimensional.

93
00:07:23,210 --> 00:07:28,220
Okay, So what we'll do is instead of taking this test image, we will do expand DBMS.

94
00:07:28,220 --> 00:07:36,050
So we create this extra dimension and then we add this to the zero two axis.

95
00:07:36,050 --> 00:07:39,140
So we add this extra dimension, we run that again.

96
00:07:39,140 --> 00:07:40,700
You see, we have our patches.

97
00:07:40,730 --> 00:07:47,450
Now look at what we're going to have when we print out the patches, shape, patches, dot shape, what

98
00:07:47,450 --> 00:07:48,290
do we get?

99
00:07:49,160 --> 00:07:52,460
See, we have this 16 by 16 by 768.

100
00:07:52,490 --> 00:07:54,350
Now let's explain why we have this.

101
00:07:54,350 --> 00:08:00,530
Now recall that we have a 256 by two, 56 by three image, meaning that we're going to have three channels

102
00:08:00,530 --> 00:08:07,240
like this, 256 by 256, then by three.

103
00:08:07,250 --> 00:08:10,730
Now, since we are dealing with patches, I think this is preferable.

104
00:08:10,730 --> 00:08:12,590
We should take this at once.

105
00:08:12,590 --> 00:08:17,900
So this is 16 by 16 patch, 16 by 16 patch.

106
00:08:19,010 --> 00:08:21,230
Yeah, we have this 16 by 16 patch.

107
00:08:21,740 --> 00:08:26,200
And then, yeah, again we have this 16 by 16 patch right here.

108
00:08:26,210 --> 00:08:29,210
So now we have this 316 by 16 patches.

109
00:08:30,080 --> 00:08:37,430
And then given that each and every one of those patches is 16 by 16 and 16 by 16 is 256, you'll see

110
00:08:37,430 --> 00:08:47,450
that if we pick a given patch like your 16 by, if we pick a 16 by 16 patch, then this third dimension

111
00:08:47,450 --> 00:08:56,120
will be several hundred and 68 simply because for each patch we have 256 pixels per channel.

112
00:08:56,120 --> 00:09:01,730
So the year we have 16 by 16 256 pixels which make up this patch.

113
00:09:02,530 --> 00:09:07,360
Now this other one, 256, and this one, 256.

114
00:09:07,360 --> 00:09:14,440
And now if you sum this up, it gives you 768, and that's how this value is gotten.

115
00:09:15,610 --> 00:09:19,110
Now, to plot this out, we are going to go through each and every patch.

116
00:09:19,120 --> 00:09:24,610
So we'll take this both vertically and horizontally.

117
00:09:24,610 --> 00:09:34,030
We create the subplots 16 by 16 because we will have 16 by 16 different subplots.

118
00:09:34,030 --> 00:09:41,350
And then for each subplot we have, this image is here we pick IJ, we pick a given patch out of this

119
00:09:41,350 --> 00:09:42,460
256 patches.

120
00:09:42,460 --> 00:09:47,710
We pick that patch and then we're going to reshape this because when you pick this patch, you're left

121
00:09:47,710 --> 00:09:50,380
with this 768 year.

122
00:09:50,380 --> 00:09:58,090
When you pick the patch, you're left with this pixels, which have all been flattened out to this 768

123
00:09:59,230 --> 00:10:02,230
dimensional vector that we calculated here.

124
00:10:02,230 --> 00:10:10,240
And so now what we need to do is reshape this back into a 16 by 16 by three pixel, obviously 16 by

125
00:10:10,240 --> 00:10:10,900
16 by three.

126
00:10:10,900 --> 00:10:12,640
We give you 768.

127
00:10:12,760 --> 00:10:13,670
And so that's it.

128
00:10:13,690 --> 00:10:14,830
Let's run this now.

129
00:10:14,830 --> 00:10:19,840
You could always increase this figure size or reduce it, depending on or how you want to view this.

130
00:10:19,840 --> 00:10:21,220
So run this.

131
00:10:21,220 --> 00:10:22,840
This is what we get is output.

132
00:10:23,740 --> 00:10:24,520
There we go.

133
00:10:24,520 --> 00:10:27,640
We see this image which initially was compact.

134
00:10:27,670 --> 00:10:31,140
Now we've broken this up into several patches.

135
00:10:31,150 --> 00:10:39,070
Now, before we go on to create our patch encoder, which is this whole section right here, what we'll

136
00:10:39,070 --> 00:10:42,990
have to do is convert or reshape this patches here.

137
00:10:43,000 --> 00:10:51,370
So recall that in this paper after doing this here, that's after creating this patches, we have to

138
00:10:51,370 --> 00:10:56,650
take each and every one of them and this will be considered as one element of the sequence.

139
00:10:56,650 --> 00:11:02,650
So here we have nine patches and here we have nine values which have been passed, all nine different

140
00:11:02,650 --> 00:11:04,900
inputs or vectors which will be passed here.

141
00:11:04,900 --> 00:11:13,930
And so if we get back here, since we have 256 of this, then we'll need to reshape this such that each

142
00:11:13,930 --> 00:11:19,090
and every one of these will be considered as part of our whole sequence.

143
00:11:19,300 --> 00:11:25,510
So here would print out, let's print out or let's let's reshape that.

144
00:11:25,510 --> 00:11:36,010
So we have patches, equals reshape, and then we have the patches and then this will give us patches.

145
00:11:36,010 --> 00:11:43,630
Does a patch dimension patches shape zero, then the next one with negative one and then yeah, we'll

146
00:11:43,630 --> 00:11:45,910
have 768.

147
00:11:45,940 --> 00:11:49,930
Now how do we put negative one or we explicitly write to 56?

148
00:11:49,930 --> 00:11:58,100
So this means that we have decided to put this now into 256 sequence length tensor.

149
00:11:58,120 --> 00:12:04,040
Now with this let's print out the patches, shape and you see what that gives us.

150
00:12:04,070 --> 00:12:05,630
So here we have 2 to 6.

151
00:12:05,650 --> 00:12:12,100
Now if we put negative one year, it's going to automatically give us this 256 and that's it.

152
00:12:12,100 --> 00:12:18,190
So we we have this oh, let's, let's rerun this so we get that difference.

153
00:12:18,190 --> 00:12:21,940
So you see here two six times 16 and then 256 and that's it.

154
00:12:21,940 --> 00:12:27,590
So each and every one of those 256 now will be passed into our transformer.

155
00:12:27,610 --> 00:12:31,600
Now, we could modify this print, the plotting of the image.

156
00:12:31,600 --> 00:12:39,820
So let's let's have that and then we'll say we have patches shape one.

157
00:12:39,820 --> 00:12:47,620
So we take 256 and then here we have 16 by 16 and then we have simply a plus one.

158
00:12:48,100 --> 00:12:54,610
Okay, So we have that we could take out this K now we do not need this again, take that off and then

159
00:12:54,610 --> 00:13:01,570
yeah, we have this eye here which we now reshape into this and this should be fine.

160
00:13:02,080 --> 00:13:04,330
So there we go, let's run this again.

161
00:13:07,230 --> 00:13:07,830
And that's it.

162
00:13:07,830 --> 00:13:14,150
We get the exact same output, which is normal since we just restructured the inputs.

163
00:13:14,160 --> 00:13:20,910
So with this we see that we are now set to create our patch encoder layer and the special encoder layer

164
00:13:20,910 --> 00:13:24,240
will be similar to the kinds of layers which we've been creating.

165
00:13:24,240 --> 00:13:29,070
So we could just copy this here, which is that and then paste it out here.

166
00:13:29,070 --> 00:13:32,730
So we'll call this our patch encoder layer.

167
00:13:33,150 --> 00:13:34,140
There we go.

168
00:13:34,140 --> 00:13:39,270
We have Patch Encoder and your patch encoder.

169
00:13:39,270 --> 00:13:46,050
So this patch encoder will be responsible for, first of all, converting our image into patches and

170
00:13:46,050 --> 00:13:54,480
then carrying out the projection and adding the positional encoding patch encoder.

171
00:13:54,570 --> 00:13:55,260
That's it.

172
00:13:56,490 --> 00:14:05,190
And now here we have this linear projection which will create your So let's call this linear linear

173
00:14:05,190 --> 00:14:05,970
projection.

174
00:14:06,360 --> 00:14:08,190
And we could do this with our dense layer.

175
00:14:08,190 --> 00:14:10,620
So we have this dense layer.

176
00:14:11,040 --> 00:14:12,240
Let's take that off.

177
00:14:12,900 --> 00:14:17,790
And then we could select our hidden size dimension or embedding dimension.

178
00:14:17,820 --> 00:14:26,160
Now, since our input is already let's write this out here, we have input of let's take this off or

179
00:14:26,310 --> 00:14:36,350
batch dimension or let's, let's take the back dimension of we have inputs of 256 by 768 already.

180
00:14:36,360 --> 00:14:40,410
So this is our hidden size dimension for now.

181
00:14:40,410 --> 00:14:44,120
But what we could do is we could convert this into 512.

182
00:14:44,130 --> 00:14:46,020
So let's do just that.

183
00:14:46,470 --> 00:14:49,880
We could let's no, let's, let's take 1024.

184
00:14:49,890 --> 00:14:51,510
Let's make the model bigger.

185
00:14:51,510 --> 00:14:55,080
But this is too many parameters already, and it would take much time to train.

186
00:14:55,080 --> 00:14:57,000
So let's, let's, let's stay with this.

187
00:14:57,360 --> 00:15:02,010
So we stick with that and let's get back to the code.

188
00:15:02,010 --> 00:15:07,880
We're going to specify our embedding dimension or let's say in size, as is in the paper.

189
00:15:07,890 --> 00:15:11,670
So our hidden size, we specify the hidden size and that's it.

190
00:15:11,670 --> 00:15:15,120
We're going to pass this in your we have hidden size.

191
00:15:15,120 --> 00:15:16,080
There we go.

192
00:15:16,080 --> 00:15:18,840
And we do not need this batch.

193
00:15:18,840 --> 00:15:21,270
Now we have the linear projection.

194
00:15:21,270 --> 00:15:22,920
We have this hidden size.

195
00:15:23,220 --> 00:15:24,690
Let's leave it this way for now.

196
00:15:24,690 --> 00:15:26,730
So we have this call method.

197
00:15:26,730 --> 00:15:29,880
We should take this input and that will be it.

198
00:15:29,880 --> 00:15:33,600
We'll have this calculation which we saw already here.

199
00:15:33,600 --> 00:15:35,640
So we'll have this patches.

200
00:15:35,700 --> 00:15:37,830
Let's copy this from here.

201
00:15:37,830 --> 00:15:41,940
Okay, so we get the input, we get this input.

202
00:15:41,970 --> 00:15:43,370
Now this input will be getting here.

203
00:15:43,590 --> 00:15:49,530
We wouldn't need to do this expand DBMS because the when training we already have the batch dimension.

204
00:15:49,530 --> 00:15:52,110
So we'll just have X Let's call this in now.

205
00:15:52,110 --> 00:15:53,700
Let's, let's, let's have that as X.

206
00:15:54,000 --> 00:15:55,920
So we have that X and that's fine.

207
00:15:55,920 --> 00:15:57,690
Then we'll do the reshaping.

208
00:15:57,720 --> 00:15:58,380
That's it.

209
00:15:58,380 --> 00:16:07,410
And then from this reshaping we'll have our output output will be this patch, which has been projected

210
00:16:07,410 --> 00:16:09,510
to this new dimension.

211
00:16:09,510 --> 00:16:19,920
And so you would have output and it's self, uh, linear projection which takes in the patches.

212
00:16:20,820 --> 00:16:26,370
Then to make this value dynamic, we're going to have let's take this off, we're going to have the

213
00:16:26,370 --> 00:16:29,490
patches shape negative one.

214
00:16:29,490 --> 00:16:36,090
So this value we have here is the last dimension of our patches.

215
00:16:36,090 --> 00:16:39,510
So we have that and we get that last dimension.

216
00:16:39,600 --> 00:16:47,790
Now that we have the set, we now ready to add up this positional embeddings we have here.

217
00:16:47,790 --> 00:16:54,930
So we will add this different positional embeddings here onto our linearly projected patches.

218
00:16:54,930 --> 00:17:01,400
We should also note that we are not going to take into consideration this class embeddings which I mentioned

219
00:17:01,400 --> 00:17:02,250
in the paper.

220
00:17:02,970 --> 00:17:05,790
As in practice, this is not really important.

221
00:17:05,790 --> 00:17:13,260
We have your look as Bayer from the Google brain team who says the main aim of including this was to

222
00:17:13,260 --> 00:17:17,580
reproduce the exact transformer network but on image patches.

223
00:17:17,580 --> 00:17:21,870
And so in practice we are going to stick to this linear projections.

224
00:17:21,870 --> 00:17:26,790
The positional embedding now will be constructed using TensorFlow embedding layer.

225
00:17:26,790 --> 00:17:32,300
So here we have positional embedding as described in the paper embedding or encoding.

226
00:17:32,310 --> 00:17:34,860
And then here we have the embedding layer.

227
00:17:34,890 --> 00:17:38,610
Now let's check out on how this embedding layer is constructed.

228
00:17:38,610 --> 00:17:46,350
Or first of all, what is about now this embedding layer as described to FCS layers, embedding turns

229
00:17:46,350 --> 00:17:51,390
positive integers into dense vectors of fixed size.

230
00:17:51,870 --> 00:17:58,350
Now, getting back to the paper at this point, you see we have this linear projections here we have

231
00:17:58,350 --> 00:18:05,370
this linear projections which need to be added up to the position encoding, which is this once.

232
00:18:05,790 --> 00:18:10,950
Now the linear projection we're going to get from your if we take if we include a batch dimension,

233
00:18:10,950 --> 00:18:15,570
we will have an output batch size by number of patches.

234
00:18:15,570 --> 00:18:21,660
So let's call this NP number of patches because obviously we have these patches here and each patch

235
00:18:21,690 --> 00:18:23,190
is going to be a vector.

236
00:18:23,190 --> 00:18:32,340
So it's a number of patches by that last dimension, which in this case is 768.

237
00:18:32,400 --> 00:18:34,680
So you could always fix this hidden dimension.

238
00:18:34,680 --> 00:18:41,220
These are hidden dimension, let's say batch by NP by hidden dimension.

239
00:18:41,640 --> 00:18:45,000
And then here, let's take this off.

240
00:18:46,040 --> 00:18:47,480
Let's take a batch of one.

241
00:18:47,930 --> 00:18:52,390
So if we take a batch of one, we would have one for the batch dimension.

242
00:18:52,400 --> 00:18:56,870
Then the number of patches would have will be 256.

243
00:18:56,870 --> 00:19:04,370
Because if we have 256 by 256 image and we break this up into 16 by six and pixel patches, then we'd

244
00:19:04,370 --> 00:19:07,850
have 256 different patches.

245
00:19:08,030 --> 00:19:10,460
So here's what we have for this one.

246
00:19:10,460 --> 00:19:17,990
And now with the embedding layer, what we'll be able to get will be another tensor which is like this

247
00:19:17,990 --> 00:19:20,270
in the output of that embedding layer.

248
00:19:20,420 --> 00:19:25,370
And so as we can see here, it takes these indices and then turns them into the dense vectors, which

249
00:19:25,370 --> 00:19:26,390
we are interested in.

250
00:19:26,690 --> 00:19:33,860
Now, the arguments we have here, the input dimension output dimension initializes regular, as we

251
00:19:33,860 --> 00:19:34,760
can see here.

252
00:19:34,940 --> 00:19:41,840
And then we can see this example where it converts, for example, this number four into this 2D vector,

253
00:19:41,840 --> 00:19:46,010
this index 20 into this other 2D vector.

254
00:19:46,160 --> 00:19:52,390
And then from this example, which we could copy out and test in our code, let's test this out.

255
00:19:52,400 --> 00:19:53,930
Let's add this here.

256
00:19:54,380 --> 00:19:56,840
So we have this, which we could test out.

257
00:19:56,840 --> 00:19:59,490
So you better you see better how it works.

258
00:19:59,510 --> 00:20:07,400
You see, the model takes us input this integer matrix by dimension, by input length, and then give

259
00:20:07,400 --> 00:20:08,660
this other output.

260
00:20:09,020 --> 00:20:16,250
Now, in our case, since we're working with image data, we do not really have this vocabulary we should

261
00:20:16,250 --> 00:20:17,210
talk about here.

262
00:20:18,020 --> 00:20:21,980
And so instead of this vocabulary we replaces by the number of patches.

263
00:20:23,060 --> 00:20:29,930
Now, if this were a natural language processing, then this will be the total number of words that

264
00:20:30,170 --> 00:20:31,960
our model can treat.

265
00:20:31,970 --> 00:20:36,860
And so if we could have a vocabulary of, say, 300,000 words, and that's what we're going to pass

266
00:20:36,860 --> 00:20:38,390
in as this input dimension.

267
00:20:38,540 --> 00:20:41,960
But as we said here, we're going to take this to be our number of patches.

268
00:20:41,990 --> 00:20:47,060
Now, this input dimension is, as I say, the dimension of the dense embedding.

269
00:20:48,200 --> 00:20:53,480
So in this case, they put in the value of 64, which will change to 768 shortly.

270
00:20:53,960 --> 00:20:58,400
Now, that said, we will have this model which is defined.

271
00:20:58,430 --> 00:20:59,470
Take that off.

272
00:20:59,480 --> 00:21:01,130
We have this input.

273
00:21:02,390 --> 00:21:05,440
You see, the size of this input is 32 by ten input.

274
00:21:05,450 --> 00:21:08,990
There's a batch dimension and here are the different indices.

275
00:21:08,990 --> 00:21:15,080
So we have this input and then we're going to print out the shape from here.

276
00:21:15,080 --> 00:21:16,370
So let's run this.

277
00:21:17,120 --> 00:21:17,730
There we go.

278
00:21:17,750 --> 00:21:22,820
You see, it takes in this and then outputs this one now as a vector.

279
00:21:22,820 --> 00:21:27,110
So basically what we have is we could have let's take this off.

280
00:21:27,440 --> 00:21:32,720
What we have is we have some indices.

281
00:21:33,170 --> 00:21:34,300
Let's take this off.

282
00:21:34,310 --> 00:21:35,680
We have some indices.

283
00:21:35,690 --> 00:21:38,510
Let's say we have like in this case, they have ten indices.

284
00:21:38,510 --> 00:21:46,610
So we have this, this, this this is seven, eight, nine, ten.

285
00:21:46,850 --> 00:21:47,380
Okay.

286
00:21:47,390 --> 00:21:53,060
So here we have this ten indices and then we know the batch dimension.

287
00:21:53,060 --> 00:21:55,070
So we just have this ten indices here.

288
00:21:55,070 --> 00:22:03,950
So this ten indices now could be projected into its two dimensional version, where this one will be

289
00:22:03,950 --> 00:22:05,540
represented by a vector.

290
00:22:05,540 --> 00:22:06,770
So just like the patches.

291
00:22:06,770 --> 00:22:10,850
So you have a patch represented by this vector, this one presented by this vector.

292
00:22:10,850 --> 00:22:12,810
So this is no longer represented by an index.

293
00:22:12,810 --> 00:22:17,240
Say let's say because here this random values take value between zero and one 1000.

294
00:22:17,240 --> 00:22:20,690
So it's no more value like, say, 900.

295
00:22:20,690 --> 00:22:26,030
So this maybe what's 900 now will be converted into a vector.

296
00:22:26,030 --> 00:22:34,010
And the size of this vector we see here will depend on this value we set here.

297
00:22:34,010 --> 00:22:40,550
So you see this, this means that this, this, this one year now becomes a 64 dimensional vector.

298
00:22:40,550 --> 00:22:46,070
And so each and every one of these becomes 64 dimensional vectors.

299
00:22:46,070 --> 00:22:54,110
And that's why we go from this ten to now ten by 64, which you could see here if we ignore the batch

300
00:22:54,110 --> 00:22:54,950
dimension.

301
00:22:55,190 --> 00:23:00,290
So what we'll be doing in our case now, let's take this off.

302
00:23:00,290 --> 00:23:09,710
What we do in our case is we have this 768 and then here we have 256.

303
00:23:09,710 --> 00:23:18,800
And then the size of this input, which is going to be some random input, we take the size to be 32

304
00:23:18,800 --> 00:23:22,460
by 256 since we have 256 patches.

305
00:23:22,460 --> 00:23:24,380
So we have that to 56.

306
00:23:24,380 --> 00:23:24,950
Let's take this.

307
00:23:24,950 --> 00:23:26,300
We could take this off.

308
00:23:26,330 --> 00:23:32,000
We could check out the the reason why we need the input, lend the documentation, but for now we don't

309
00:23:32,000 --> 00:23:32,480
need that.

310
00:23:32,480 --> 00:23:42,850
So we have 32 by 256 and then we expect to have an output which is 32 by 256, now by six by 768.

311
00:23:42,860 --> 00:23:44,360
Let's modify this.

312
00:23:44,360 --> 00:23:45,440
Let's take one.

313
00:23:46,130 --> 00:23:47,210
By size of one.

314
00:23:47,210 --> 00:23:54,740
So let's run this, which you have won by 256 by 768, and you see that it would match now with this

315
00:23:54,770 --> 00:23:55,490
year.

316
00:23:55,670 --> 00:23:57,410
With this year.

317
00:23:59,270 --> 00:24:03,140
From the projections or from the predictions of the inputs.

318
00:24:04,040 --> 00:24:07,200
Now, that said, let's simply copy this.

319
00:24:07,220 --> 00:24:12,290
Here we have this layer, which we're going to copy and add that year.

320
00:24:12,290 --> 00:24:16,550
So instead of this embeddings, we have this embeddings, there we go.

321
00:24:16,550 --> 00:24:19,180
And then yeah, we'll pass them the number of patches.

322
00:24:19,190 --> 00:24:25,890
So here we have a number of patches and then this one is this hidden size.

323
00:24:25,910 --> 00:24:30,460
So here again we have our hidden size and that's it.

324
00:24:30,470 --> 00:24:40,040
So now we have the hidden size we get here and then instead of having this or we have this plus the

325
00:24:40,040 --> 00:24:41,120
positional encoding.

326
00:24:41,270 --> 00:24:45,470
So we have solved that positional embedding.

327
00:24:45,470 --> 00:24:46,880
We could call that embedding.

328
00:24:47,120 --> 00:24:54,140
And then what this will take in will be an input of length, number of patches.

329
00:24:54,470 --> 00:25:04,790
So we get back here and then we would have our embedding embedding inputs or we could use the TF range

330
00:25:04,790 --> 00:25:07,280
and then we have the start, we could start from zero.

331
00:25:07,550 --> 00:25:08,570
The are the indices.

332
00:25:08,570 --> 00:25:12,050
So we have this and then we have the limit.

333
00:25:12,350 --> 00:25:20,780
Our limit is going to be the number of patches because we want to get the these indices going from the

334
00:25:21,530 --> 00:25:30,320
zero value to some end value, such that the length of this tensor we're going to create is going to

335
00:25:30,320 --> 00:25:32,780
be such that is equal the number of patches.

336
00:25:32,780 --> 00:25:39,380
So I'll limit your number of patches in that sense we could have this year.

337
00:25:39,380 --> 00:25:47,210
So we have number of patches already from year so we could get self number of patches.

338
00:25:48,020 --> 00:25:49,070
There we go.

339
00:25:49,070 --> 00:25:53,690
We have a number of number of patches.

340
00:25:53,720 --> 00:25:56,330
Okay, so we have our number of patches.

341
00:25:56,630 --> 00:26:00,170
That looks fine and then we could make use of that year.

342
00:26:00,830 --> 00:26:09,130
So we have that number of patches and then we are going to set our delta.

343
00:26:09,140 --> 00:26:12,670
So we have Delta would take to be one.

344
00:26:12,680 --> 00:26:14,750
Now obviously we said Delta to be two.

345
00:26:14,750 --> 00:26:16,820
It means you're going to go from zero to number of patches.

346
00:26:16,820 --> 00:26:16,970
Why?

347
00:26:16,970 --> 00:26:19,070
It's keeping two steps and that's not what we want.

348
00:26:19,070 --> 00:26:25,940
So we want a number of patches, elements in this embedding inputs.

349
00:26:25,940 --> 00:26:32,870
So with this now we could pass an embedding, embedding inputs and that should be fine.

350
00:26:33,350 --> 00:26:39,100
So we have this here, everything looks fine and we could take this off.

351
00:26:39,110 --> 00:26:40,610
So now we return.

352
00:26:40,610 --> 00:26:43,220
We could return our output simply.

353
00:26:43,580 --> 00:26:49,640
Now, it should be noted that this embedding layer here is similar to the dense layer, but with a dense

354
00:26:49,640 --> 00:26:50,090
layer.

355
00:26:50,090 --> 00:26:53,180
When you have an input x, you see with a dense layer.

356
00:26:53,180 --> 00:26:57,800
When you have an input, let's write it here for the dense layer when you have an input x.

357
00:26:59,010 --> 00:27:03,910
That input X is multiplied by the weights and you add the bias.

358
00:27:03,930 --> 00:27:05,630
So this is how we get the output.

359
00:27:05,640 --> 00:27:09,150
But with this embedding layer is a simple matrix multiplication.

360
00:27:09,150 --> 00:27:15,930
So when you take an input x, you multiply it by the weights and you get the output.

361
00:27:17,140 --> 00:27:17,770
With that.

362
00:27:17,770 --> 00:27:21,430
Now let's run the cellular and then we move ahead.

363
00:27:21,430 --> 00:27:26,410
So we have this run and we could delete this two cells now.

364
00:27:26,690 --> 00:27:30,430
Now, to test this out, we could define Patch Inc.

365
00:27:30,430 --> 00:27:38,170
We have our Patch encoder Patch Encoder, which we've just created here.

366
00:27:38,230 --> 00:27:43,030
So Apache Encoder and this takes in the number of patches and hidden size.

367
00:27:43,030 --> 00:27:48,370
So let's specify those two five, six and seven, six, eight.

368
00:27:48,370 --> 00:27:54,610
So we have this and then now we could have patch encoder and then passing an image and see what we get

369
00:27:54,610 --> 00:27:56,770
is output your image.

370
00:27:56,770 --> 00:27:59,430
We have this zeros.

371
00:27:59,440 --> 00:28:06,160
There we go one by 256 by two, 56 by three.

372
00:28:06,160 --> 00:28:14,290
So now when we pass an input image, we expect to get some output of shape, batch size by number of

373
00:28:14,290 --> 00:28:20,770
patches by the hidden size or the number of hidden units.

374
00:28:21,040 --> 00:28:27,580
And now that we've built this year up to the point where we have our embedded patches, let's go ahead

375
00:28:27,580 --> 00:28:30,880
and build this transformer encoder right here.

376
00:28:32,080 --> 00:28:38,320
As you could see, we start with the layer normalization, multi head attention, and then we add this

377
00:28:38,320 --> 00:28:40,000
this input to this output.

378
00:28:40,000 --> 00:28:45,100
We have the layer normalization again, then the multi layer perceptron.

379
00:28:45,940 --> 00:28:49,570
And again we have this addition and then we get the output.

380
00:28:50,560 --> 00:28:57,760
Now, to build a code for our transformer encoder, we're going to pass out this patch encoder here.

381
00:28:57,790 --> 00:29:00,220
We remove this, we have transformer.

382
00:29:00,700 --> 00:29:02,530
Uh, there we go.

383
00:29:02,530 --> 00:29:10,810
And then this transformer is made up of first norm layer and then the second normal layer, let's have

384
00:29:10,810 --> 00:29:21,490
this layer layer norm one it's our first norm layer layer, normalization line zation.

385
00:29:21,490 --> 00:29:22,510
There we go.

386
00:29:23,080 --> 00:29:23,980
That's fine.

387
00:29:25,060 --> 00:29:29,020
And then we have our second layer normalization layer.

388
00:29:29,020 --> 00:29:31,240
So we'll call this layer num two.

389
00:29:31,540 --> 00:29:34,090
Now we've defined our layer normalization layers.

390
00:29:34,090 --> 00:29:37,450
We could go ahead to define our multi head attention layer.

391
00:29:37,450 --> 00:29:45,970
So we have we have multi head tension and there we go again, your multi head attention.

392
00:29:45,970 --> 00:29:48,640
You could check out the documentation for this.

393
00:29:48,640 --> 00:29:54,220
So here in the documentation you have the different arguments which you could pass in the type of values

394
00:29:54,220 --> 00:29:56,470
which the multi head attention takes in.

395
00:29:56,470 --> 00:29:59,860
So your we will specify a number of heads.

396
00:29:59,860 --> 00:30:06,220
And the key dimension here, this key dimension is the same as the hidden size or the number of hidden

397
00:30:06,220 --> 00:30:11,080
units which in our case we fix at 768.

398
00:30:11,080 --> 00:30:13,870
We could also have the value of dimension drop out.

399
00:30:13,870 --> 00:30:15,270
And this or the arguments.

400
00:30:15,280 --> 00:30:24,130
Now, what this takes in, as you would see here, is a query value key and attention mask, which would

401
00:30:24,130 --> 00:30:24,730
not use here.

402
00:30:24,730 --> 00:30:25,470
Return attention.

403
00:30:25,480 --> 00:30:26,830
Scores will not do this.

404
00:30:26,830 --> 00:30:34,360
This training will not specify this, but what will specify will be this query and this value.

405
00:30:34,360 --> 00:30:40,210
And then since we're not going to specify, the key is going to consider that the key is the or rather

406
00:30:40,210 --> 00:30:42,910
it's going to consider that this value here is a key.

407
00:30:42,910 --> 00:30:47,890
So it would be the same value that said, here would have the number of heads.

408
00:30:48,250 --> 00:30:53,770
Let's take this off Malta head attention, number of heads.

409
00:30:53,770 --> 00:30:58,630
And then we would also need the hidden size or here.

410
00:30:58,630 --> 00:31:01,630
So we already have the hidden size, so we will leave that away.

411
00:31:01,720 --> 00:31:04,600
It will replace a number of patches by a number of heads.

412
00:31:04,900 --> 00:31:05,290
Okay.

413
00:31:05,290 --> 00:31:11,710
So we have our multi head attention, a number of heads, hidden size, and then the next will be the

414
00:31:11,710 --> 00:31:12,610
dense layers.

415
00:31:12,610 --> 00:31:18,550
So as we saw in this model here, you see we have after the multi adaptation, we have Layer Norm two

416
00:31:18,550 --> 00:31:22,720
and then we have this ML pillar which is made up of two dense layers.

417
00:31:23,110 --> 00:31:33,520
So we have our self dense one, which is our dense layer, and then the number of units here will take

418
00:31:33,520 --> 00:31:42,160
it to be hidden size, and then we'll specify the activation to be a glue activation as specified in

419
00:31:42,160 --> 00:31:42,880
the paper.

420
00:31:42,880 --> 00:31:49,930
So we have glue and then we repeat the same process for the second dense layer.

421
00:31:49,930 --> 00:31:53,560
So here we have that and then two.

422
00:31:54,130 --> 00:31:55,360
Okay, so that's it.

423
00:31:55,360 --> 00:32:00,040
We specify this to dense layers and we now get our output.

424
00:32:00,040 --> 00:32:03,850
So we should have that and then now we could go ahead and call.

425
00:32:03,850 --> 00:32:09,130
So let's before going to that, let's modify this here for my encoder.

426
00:32:09,790 --> 00:32:10,870
That's fine.

427
00:32:11,200 --> 00:32:13,780
The name Transform our encoder.

428
00:32:15,880 --> 00:32:16,930
There we go.

429
00:32:16,930 --> 00:32:17,950
That should be fine.

430
00:32:17,980 --> 00:32:26,200
OC So we, we build in our transformer encoder layer already and now we set to pass this inputs into

431
00:32:26,200 --> 00:32:27,550
this different layers.

432
00:32:28,000 --> 00:32:29,040
We have our input here.

433
00:32:29,050 --> 00:32:36,400
X there we go, we have X, let's take this off and then x gets into the layer norm one.

434
00:32:36,400 --> 00:32:44,530
So we have our output x x equals layer nom.

435
00:32:45,290 --> 00:32:53,270
One which takes in this X here, takes our input X and then outputs this.

436
00:32:53,300 --> 00:32:54,620
We could call this in.

437
00:32:55,220 --> 00:33:00,410
And then we have this in put input.

438
00:33:00,410 --> 00:33:05,120
We have our input which gets in here and then produces an output X.

439
00:33:05,120 --> 00:33:08,930
And then from here we get into the multi head attention layer.

440
00:33:08,930 --> 00:33:17,120
So we have X again this is multi head attention and this takes in X.

441
00:33:17,120 --> 00:33:22,550
So now we have this input which gets here, gives us output, this output gets into this motor head

442
00:33:22,550 --> 00:33:23,680
and gives us output.

443
00:33:23,690 --> 00:33:31,460
Now that we've had this, remember from the paper that we have this addition to do so we have this add

444
00:33:31,460 --> 00:33:39,440
layer which we need to specify here because after this output got into the multi head, we got this

445
00:33:39,440 --> 00:33:43,130
output which now needs to be added to this input from the patches.

446
00:33:43,130 --> 00:33:45,620
So we need to create this link here.

447
00:33:45,800 --> 00:33:54,200
So that said, getting back to the code, we have our X now, which is ADD and this add will take in

448
00:33:54,200 --> 00:33:55,520
this x.

449
00:33:56,570 --> 00:33:58,940
And the input.

450
00:34:00,440 --> 00:34:05,090
Then from year again, we take this and pass into our layer.

451
00:34:05,090 --> 00:34:05,620
Norm.

452
00:34:05,630 --> 00:34:11,870
So now we'll call this let's, let's call this x one because we'll need this x one again.

453
00:34:11,870 --> 00:34:18,140
So we can't just be working with x, you understand shortly why we need this x one.

454
00:34:18,140 --> 00:34:21,050
So we have this and we have your x one.

455
00:34:21,050 --> 00:34:22,910
Okay, so we have that.

456
00:34:23,360 --> 00:34:24,340
And that's fine.

457
00:34:24,350 --> 00:34:25,670
Everything looks fine.

458
00:34:25,670 --> 00:34:33,890
Then from here we take this and pass or take this output x one and pass into our layer Norm two.

459
00:34:33,890 --> 00:34:38,180
So we call this layer norm two, That's it.

460
00:34:38,180 --> 00:34:42,290
And then this takes in our x one produces output x one.

461
00:34:42,290 --> 00:34:46,310
Then from here we take this and pass into our dense layers.

462
00:34:46,790 --> 00:34:58,490
So we have this x one which is equal or dense one, and this takes in x one.

463
00:34:58,490 --> 00:35:07,280
And again we have x one which takes in dense two oh sorry, which takes ax1 which is gotten from dense

464
00:35:07,280 --> 00:35:09,530
two, which takes in x one.

465
00:35:09,530 --> 00:35:14,120
And now let's say this produces an output x to remember this like our final output.

466
00:35:14,120 --> 00:35:15,470
So let's call this output.

467
00:35:15,770 --> 00:35:18,560
Let's, let's call this output.

468
00:35:18,560 --> 00:35:21,290
Okay, So that's our output right there.

469
00:35:21,590 --> 00:35:29,750
Now let's do this then for this output again, we're going to do this addition recall from the paper.

470
00:35:29,750 --> 00:35:32,690
You see, after the addition, you get an output.

471
00:35:32,690 --> 00:35:36,170
We get this output that we take this output from here and add onto this.

472
00:35:36,170 --> 00:35:40,400
That's why you see, we had to create x one for this and then x two for this.

473
00:35:40,400 --> 00:35:42,230
So let's get back to our code.

474
00:35:42,230 --> 00:35:47,150
We have that and then here we have our output now, which is equal

475
00:35:49,700 --> 00:35:52,040
and that's it.

476
00:35:52,070 --> 00:36:01,790
And then it takes in our output and this x one here, All right, this x one after this addition.

477
00:36:01,790 --> 00:36:03,920
So we need to have this as x two.

478
00:36:03,920 --> 00:36:11,540
So let's change this to x two so we could make use of this Now in this addition, because if we if we

479
00:36:11,540 --> 00:36:18,590
have this, if we maintain this to be x one, then the value we get in here would be this final x one

480
00:36:18,590 --> 00:36:19,400
we got here.

481
00:36:19,400 --> 00:36:20,330
And that's not what we want.

482
00:36:20,330 --> 00:36:22,250
What we want is this output from yours.

483
00:36:22,250 --> 00:36:24,200
So that's why we have to change this variable name.

484
00:36:24,200 --> 00:36:28,190
So we have x two, this takes an x one, then that's it.

485
00:36:28,190 --> 00:36:32,510
So from here, now we take this output and add it to this x one.

486
00:36:33,020 --> 00:36:33,710
There we go.

487
00:36:33,710 --> 00:36:36,560
We have that x one we get now our output.

488
00:36:36,920 --> 00:36:39,860
So we could close this up and that's fine.

489
00:36:39,860 --> 00:36:45,470
So we return our output and we have our transformer encoder right here.

490
00:36:46,220 --> 00:36:48,110
Now we could go ahead and test this.

491
00:36:48,110 --> 00:36:54,110
So we have our transformer, my encoder number of heads here in size, that's fine.

492
00:36:54,110 --> 00:36:57,770
And then we also have this input right here.

493
00:36:57,770 --> 00:37:00,110
We call that the input or the patches take this form.

494
00:37:00,110 --> 00:37:02,240
So let's run this and see what we get.

495
00:37:02,960 --> 00:37:06,110
We get an error from this.

496
00:37:06,110 --> 00:37:10,010
Yeah, this is logical because we need to pass in at least two inputs.

497
00:37:10,010 --> 00:37:17,690
So here, if you recall from the from this year, from multi head attention, you see this, we have

498
00:37:17,690 --> 00:37:27,500
this two which must be passed and then this key is optional, this one optional and all this is optional.

499
00:37:27,500 --> 00:37:29,570
So we must pass at least this two.

500
00:37:29,570 --> 00:37:33,530
So that said, let's get back and modify that.

501
00:37:33,530 --> 00:37:40,460
So in this multi adaptation, we have this and then we have X one, so that's fine.

502
00:37:40,460 --> 00:37:41,840
So now we have that.

503
00:37:41,840 --> 00:37:44,090
Let's run this again and see what we get.

504
00:37:44,840 --> 00:37:45,800
You see, that's fine.

505
00:37:45,800 --> 00:37:47,810
You see, we get exactly what we expect.

506
00:37:47,810 --> 00:37:53,440
So this is the kind of input we get and yours, our output after going through the transformer encoder.

507
00:37:53,450 --> 00:38:01,370
Now, although this works, make sure that you check your work or check this code and be sure that everything

508
00:38:01,370 --> 00:38:03,620
is working as it's supposed to be.

509
00:38:03,620 --> 00:38:10,310
So be sure that you pass in the right inputs and get the right outputs and so that you get the exact

510
00:38:10,310 --> 00:38:16,880
output you are to get and not some outputs different from what your what you're supposed to have.

511
00:38:17,660 --> 00:38:18,350
So that's it.

512
00:38:18,350 --> 00:38:23,300
We have our transformer encoder and now we'll head on to building our model.

513
00:38:23,300 --> 00:38:26,290
So your would have our model.

514
00:38:26,300 --> 00:38:34,340
Let's get back up here, Let's copy this rest net from your complete network.

515
00:38:35,060 --> 00:38:40,910
We had this so we just simply copy the same structure and then paste it out.

516
00:38:40,910 --> 00:38:43,670
Your Now, though, there's not much difference.

517
00:38:43,670 --> 00:38:50,720
We just the main difference here is that we have this model instead of layer.

518
00:38:50,720 --> 00:38:55,610
So here we have a model and then here we call this white.

519
00:38:56,200 --> 00:39:04,330
So, yes, our model, our vision transformer model and then your we change this, we have the vision

520
00:39:04,330 --> 00:39:08,530
transformer and we'll call this our vision transformer.

521
00:39:08,800 --> 00:39:12,400
Vision transformer does it.

522
00:39:12,970 --> 00:39:13,750
So there we go.

523
00:39:13,750 --> 00:39:20,380
We could take this off, take this off and all this off.

524
00:39:22,060 --> 00:39:26,380
Then we build it in such a way that it takes in the number of heads, the hidden size.

525
00:39:27,310 --> 00:39:30,010
Then from this encoder, the number of patches.

526
00:39:30,010 --> 00:39:34,900
So we just need a number of heads, hidden size and number of patches.

527
00:39:35,050 --> 00:39:44,920
So with that would specify number of heads, there we go, hidden size and then number of patches.

528
00:39:45,640 --> 00:39:54,400
Okay, so we have that and then you're would have our patch encoder which will be defined.

529
00:39:54,400 --> 00:39:56,440
So we have patch encoder.

530
00:39:56,440 --> 00:40:01,030
And this patch encoder which we have in right here is simply what we created already.

531
00:40:01,030 --> 00:40:05,740
So we just have Patch encoder and what does this take?

532
00:40:05,740 --> 00:40:12,640
So we get back here and we look at this format, see number of patches and hidden size.

533
00:40:12,640 --> 00:40:15,490
So that's why it's important for you to test this as you go on.

534
00:40:15,490 --> 00:40:20,680
So since we've tested this, we show that this works when you pass in an input image.

535
00:40:20,680 --> 00:40:29,650
So your patch encoder, a number of heads and hidden sides or write a number of patches and hidden sides,

536
00:40:29,650 --> 00:40:34,840
so we have number of patches, then the hidden size which will get from your.

537
00:40:35,380 --> 00:40:38,320
Now we have this patch encoder.

538
00:40:39,580 --> 00:40:45,220
We have our transformer encoders recall that we have several transformers here.

539
00:40:45,250 --> 00:40:47,710
C we have L transform my encoders.

540
00:40:47,710 --> 00:40:51,820
So we get back here and then we define this transformer.

541
00:40:53,200 --> 00:40:55,990
Let's call it trance encoders.

542
00:40:56,110 --> 00:41:01,570
Okay, so this transformer encoders will be a list made of different layers.

543
00:41:01,570 --> 00:41:06,670
And the length of this list will depend on the number of layers.

544
00:41:06,670 --> 00:41:08,680
So we have your number of layers.

545
00:41:08,680 --> 00:41:19,780
And then what we'll do is we'll say four underscore in this number of layers or rather in range number

546
00:41:19,780 --> 00:41:20,830
of layers.

547
00:41:21,520 --> 00:41:25,960
C we are then going to define the transformer encoder.

548
00:41:25,960 --> 00:41:33,250
So we have transformer encoder, so we have the separate transformer encoders and then we'll specify

549
00:41:33,250 --> 00:41:36,310
a number of heads and hidden size.

550
00:41:36,310 --> 00:41:41,620
So you're again, we have a number of heads and then we have the hidden size.

551
00:41:41,620 --> 00:41:44,440
The input gets through this patch encoder.

552
00:41:44,440 --> 00:41:48,460
So here we have our patch encoder.

553
00:41:48,580 --> 00:41:49,180
That is it.

554
00:41:49,180 --> 00:41:52,330
This way we have X, let's call this input.

555
00:41:52,330 --> 00:42:02,200
So here we have input in this patch passes to the patch or patch encoder and then produces the output

556
00:42:02,200 --> 00:42:02,860
x.

557
00:42:02,860 --> 00:42:08,170
Then once we have this output x, what we are going to do is we are going to look through or we're going

558
00:42:08,170 --> 00:42:11,440
to go through each and every transformer encoder layer.

559
00:42:11,440 --> 00:42:21,460
So here we have for eye in range of self dot number of layers, number of layers.

560
00:42:21,790 --> 00:42:31,750
Let's define this here, number of layers equal this number of layers from your So we've set our number

561
00:42:31,750 --> 00:42:37,810
of layers and then we take this in your So now we're going to go through this and then what we'll be

562
00:42:37,810 --> 00:42:46,180
doing is we'll be having this X, which is going to be the output from this, our transformer encoder

563
00:42:46,180 --> 00:42:46,840
layers.

564
00:42:46,840 --> 00:42:55,660
So we have trans encoders I which takes in the different inputs X, So we'll do this for a number of

565
00:42:55,660 --> 00:43:02,350
layers times and then for each of these we have trans encoders as we've defined already here, and we

566
00:43:02,350 --> 00:43:04,790
pick the specific transformer encoder.

567
00:43:04,810 --> 00:43:12,130
Now once we get the output from your C, once we get our output from your will, now obtain an output

568
00:43:12,130 --> 00:43:13,030
which will flatten.

569
00:43:13,030 --> 00:43:16,660
So from your word get, we'll say x equal.

570
00:43:16,660 --> 00:43:18,190
We flattened this out.

571
00:43:19,030 --> 00:43:24,640
Take an X and then from this we'll have this or MLPs block right here.

572
00:43:24,850 --> 00:43:26,680
Oh, let's get back here.

573
00:43:28,600 --> 00:43:30,430
We have MLPs.

574
00:43:30,430 --> 00:43:31,420
BLOCK here.

575
00:43:31,420 --> 00:43:32,050
Okay.

576
00:43:32,050 --> 00:43:34,420
So we have this MLPs block right here.

577
00:43:34,450 --> 00:43:35,440
That's after this.

578
00:43:35,440 --> 00:43:42,430
Now, the paper, what they had done was since they had this class token, they picked those last or

579
00:43:42,430 --> 00:43:44,680
rather the pick this first output here.

580
00:43:44,680 --> 00:43:51,070
So this means that if we have let's take from your if we have an output by size one, let's take this

581
00:43:51,070 --> 00:43:55,720
off one by 256 by several hundred and.

582
00:43:55,860 --> 00:43:56,870
68.

583
00:43:56,880 --> 00:44:05,910
Then out of this 256 year or 257 because do have this additional so other these 257 will pick out only

584
00:44:05,910 --> 00:44:13,320
one from your and so we'll be left with that one which has 768 hidden units.

585
00:44:13,320 --> 00:44:18,870
So basically this will be like having one by 768 output.

586
00:44:18,870 --> 00:44:24,030
But now since we're taking all this into consideration, we'll just flatten this out and then pass this

587
00:44:24,030 --> 00:44:26,750
to our M.P. head right here.

588
00:44:26,760 --> 00:44:28,110
So that's it.

589
00:44:28,110 --> 00:44:29,730
Let's get back to the code.

590
00:44:29,730 --> 00:44:35,640
We flatten this out and then we, we, we specify the dense layers which make up the M.P. head.

591
00:44:35,640 --> 00:44:39,750
So we have dense one dense layer.

592
00:44:39,750 --> 00:44:47,340
This dense layer has set a number of units, let's say a 1000 or let's let's, let's give this dense

593
00:44:47,340 --> 00:44:48,330
units here.

594
00:44:48,360 --> 00:44:51,090
Let's let's have it as this argument.

595
00:44:51,090 --> 00:44:56,410
So we have a number of dense units, okay?

596
00:44:56,460 --> 00:44:58,800
So we have a number of dense units there.

597
00:45:00,150 --> 00:45:10,700
Let's specify this number of dense units and then we have the Jello activation, S.F. and Jello.

598
00:45:10,740 --> 00:45:11,170
Okay.

599
00:45:11,190 --> 00:45:18,030
So we have that for dense one and now we should have or we should write this code for dense too.

600
00:45:18,030 --> 00:45:22,470
So we have now dense two and then your same number of dense units.

601
00:45:22,470 --> 00:45:26,520
You could always modify or modify this and that's it.

602
00:45:27,030 --> 00:45:33,120
Next from here we have x equal dense one.

603
00:45:33,120 --> 00:45:33,870
That's it.

604
00:45:33,870 --> 00:45:34,770
Dense one.

605
00:45:34,770 --> 00:45:36,210
This is self.

606
00:45:37,020 --> 00:45:45,180
Okay, so here we have dense one which takes an X and then dense two which takes an x.

607
00:45:45,210 --> 00:45:51,660
Now our final this our final output dense layer has to consider the number of classes.

608
00:45:51,660 --> 00:45:57,510
So just as what we have been doing so far in this course, you would let's take this one this complete

609
00:45:57,510 --> 00:45:58,230
block here.

610
00:45:58,230 --> 00:46:06,700
So so far, what what we've been doing is we always ensure that our output dense layer has the number

611
00:46:06,700 --> 00:46:09,300
of number of classes, number of units in this output.

612
00:46:09,300 --> 00:46:18,270
So you're would simply copy this, scroll down back to our code and put this right here.

613
00:46:19,080 --> 00:46:20,220
So that's it.

614
00:46:20,220 --> 00:46:24,480
So let's call this dense three dense and that should be fine.

615
00:46:24,480 --> 00:46:28,140
Okay, so here we have dense.

616
00:46:28,950 --> 00:46:30,030
Take this off.

617
00:46:30,750 --> 00:46:31,590
That's fine.

618
00:46:31,770 --> 00:46:32,670
We have that.

619
00:46:32,670 --> 00:46:34,560
Now let's run this.

620
00:46:36,060 --> 00:46:36,710
That should be fine.

621
00:46:36,720 --> 00:46:37,970
Let's go ahead and test this.

622
00:46:37,980 --> 00:46:41,770
So here we have our model and we'll define V.

623
00:46:41,770 --> 00:46:50,190
It's of course, v i t and the parameters, those arguments here, a number of heads here in size,

624
00:46:50,190 --> 00:46:53,160
all of this, we're going to define this just down here.

625
00:46:53,370 --> 00:46:56,940
Scroll down and then define this year.

626
00:46:57,120 --> 00:47:09,300
OC So we have now this number of heads with say eight heads, hidden size 768 number of patches, 256

627
00:47:09,300 --> 00:47:18,090
the number of layers, let's say four layers, the number of dense units is 1024 OC So there we go.

628
00:47:18,090 --> 00:47:29,070
We have our weights and then this right now from the we pass in this input to one by 256 by two, 56

629
00:47:29,070 --> 00:47:32,640
by three, and we should get reasonable output.

630
00:47:32,760 --> 00:47:39,120
We could print out the summary so we summary and we could check out this model.

631
00:47:39,240 --> 00:47:45,600
So you see here we have 283 million different parameters for this model.

632
00:47:45,630 --> 00:47:47,250
Now let's reduce this.

633
00:47:47,250 --> 00:47:55,350
We could say four heads, just two layers, and then this number of dense units here, we could take

634
00:47:55,350 --> 00:47:56,640
128.

635
00:47:56,700 --> 00:47:58,380
Now let's run this again.

636
00:47:59,850 --> 00:48:00,540
And there we go.

637
00:48:00,540 --> 00:48:02,160
You see the two layers.

638
00:48:02,190 --> 00:48:03,480
This is the patch encoder.

639
00:48:03,480 --> 00:48:09,150
And then you see the dense layers which follow this transform our encoder layers.

640
00:48:09,150 --> 00:48:15,300
So now we have this model, we'll change this to the batch size.

641
00:48:15,300 --> 00:48:20,460
So we have configuration and then batch size.

642
00:48:20,500 --> 00:48:24,870
Now, the reason why we're doing this is because when we're training, our by size is going to be known.

643
00:48:24,870 --> 00:48:26,940
So we don't want to have this.

644
00:48:27,480 --> 00:48:28,440
You're like that.

645
00:48:28,440 --> 00:48:33,990
So with that, now we run this again and run the summary.

646
00:48:35,320 --> 00:48:35,950
Or.

647
00:48:35,950 --> 00:48:38,060
Yeah, we need to change this to 32.

648
00:48:38,080 --> 00:48:40,230
You see, when we change this, it doesn't work.

649
00:48:40,240 --> 00:48:43,980
So here we have 32 symbols all by size.

650
00:48:44,050 --> 00:48:45,030
That's it.

651
00:48:45,040 --> 00:48:48,580
And now we can go ahead to compile and train our model.

652
00:48:49,090 --> 00:48:50,380
Our model and our training.

653
00:48:50,380 --> 00:48:58,210
Not that you wouldn't get the best results, because obviously the whites need very large data sets

654
00:48:58,210 --> 00:49:04,180
or even extra large data sets to perform as well as a confidence.

655
00:49:04,180 --> 00:49:09,490
And so when working with the whites generally, we want to train on a very, very large data set and

656
00:49:09,490 --> 00:49:14,370
then later on fine tune on the smaller data set towards the end of the epoch.

657
00:49:14,380 --> 00:49:21,970
We still get this error and the source of this error is the fact that since we've fixed the batch size,

658
00:49:21,970 --> 00:49:29,200
that's if you get back up here in our training, if you see if you look at this here, we fixed this.

659
00:49:29,450 --> 00:49:37,510
And so since this isn't dynamic now we have a data set which has been broken down into batches of 32.

660
00:49:37,510 --> 00:49:43,000
And now towards the end, you may have a batch of, say, for example, eight.

661
00:49:45,010 --> 00:49:53,050
Obviously, because the dataset, the dataset isn't necessarily divisible by 32.

662
00:49:53,050 --> 00:49:57,970
That is, a number of elements you have isn't necessarily divisible by 32, and so you have a remainder.

663
00:49:57,970 --> 00:50:04,600
And so when you have this remainder and that you fixed this year, you should get an error because now

664
00:50:04,600 --> 00:50:09,370
you've told the model to always use at this position a value of 32.

665
00:50:09,580 --> 00:50:17,080
So to avoid that, instead of doing the patches shape we're going to do to have that shape so we have

666
00:50:17,110 --> 00:50:17,680
the shape.

667
00:50:17,680 --> 00:50:22,330
Before we had patches, we got the shape and that was it.

668
00:50:22,330 --> 00:50:27,220
But now we are using TTF dot shape, so we call it the shape method from TensorFlow.

669
00:50:27,220 --> 00:50:31,730
And then you're we're going to pass in our patches.

670
00:50:31,780 --> 00:50:39,580
So once we have those, once we have our patches pass in, we now select this batch dimension and that

671
00:50:39,580 --> 00:50:40,390
should be it.

672
00:50:40,390 --> 00:50:42,400
So let's run this again.

673
00:50:42,760 --> 00:50:46,780
You'll see that even from here we could modify this.

674
00:50:46,810 --> 00:50:49,150
We could put 32 and it'll still work.

675
00:50:49,150 --> 00:50:49,810
Fine.

676
00:50:50,080 --> 00:50:51,700
Okay, let's see.

677
00:50:51,740 --> 00:50:52,240
So.

678
00:50:52,450 --> 00:50:57,190
So with that now let's go ahead and compile and then restart the training.

679
00:50:57,820 --> 00:50:59,230
The model is training.

680
00:50:59,230 --> 00:51:10,900
And as you could see, it starts to stagnate around this 44.4% accuracy and 44.16% validation accuracy.

681
00:51:11,200 --> 00:51:18,250
In the next section we would fine tune a bit, which has been trained on a very large data set.
