1
00:00:00,180 --> 00:00:06,060
Hi there and welcome to this new and exciting session in which we shall be treated, or we shall be

2
00:00:06,060 --> 00:00:14,790
using this transformer network right here to solve problems in computer vision and more specifically

3
00:00:14,790 --> 00:00:17,760
in the task of image classification.

4
00:00:18,390 --> 00:00:25,290
Up until this point, we have seen different convolutional neural networks like the Loonette, the VG,

5
00:00:25,290 --> 00:00:28,260
the rest nets, the mobile nets, the efficient net.

6
00:00:28,260 --> 00:00:32,070
And now we'll be looking at the vision transformers.

7
00:00:32,070 --> 00:00:39,870
This vision transformers were first developed in this paper entitled and Image is what, 16 by 16 Words

8
00:00:39,870 --> 00:00:43,900
Where to Build Transformers for Image Recognition at Scale.

9
00:00:43,920 --> 00:00:50,970
In this section, we'll take a deep dive into how this whole architecture has been constructed and how

10
00:00:50,970 --> 00:00:59,640
it works, and also how and why Transformers perform as well as their convolutional neural network counterparts.

11
00:00:59,730 --> 00:01:06,960
The very first point you want to note here is the usage of Transformers for computer vision tasks has

12
00:01:07,530 --> 00:01:10,370
been developed in very recent times.

13
00:01:10,380 --> 00:01:13,860
You could see here from this date this paper was published.

14
00:01:14,520 --> 00:01:20,490
The authors hear that while the transformer architecture has become the de facto standard for natural

15
00:01:20,490 --> 00:01:25,680
language processing tasks, its application to computer vision remains limited.

16
00:01:26,040 --> 00:01:33,180
In vision, attention is either applied in conjunction with the convolutional networks or used to replace

17
00:01:33,180 --> 00:01:38,910
certain components of convolutional networks while keeping the overall structure in place.

18
00:01:39,090 --> 00:01:45,660
We should add this reliance on the convolutional neural networks is not necessary, and a pure transformer

19
00:01:45,660 --> 00:01:53,310
that's we doubt any convolutional neural networks apply directly to sequences of image patches can perform

20
00:01:53,310 --> 00:01:55,830
very well on image classification text.

21
00:01:56,010 --> 00:02:02,430
They even go ahead to tell us that when pre-trained on large amounts of data and transfer it to multiple

22
00:02:02,430 --> 00:02:10,440
mid sized or small image recognition benchmarks like the image net CI for the white that is a vision

23
00:02:10,440 --> 00:02:16,110
transformer attains excellent results compared to the state of the art conf nets like the efficient

24
00:02:16,110 --> 00:02:22,600
net while requiring substantially fewer computational resources to train.

25
00:02:22,620 --> 00:02:28,800
Now, it's possible that you've never heard of this term transformer or maybe you from an electrical

26
00:02:28,800 --> 00:02:33,750
engineering background and you've only heard of this when it comes to stepping up and stepping down

27
00:02:33,750 --> 00:02:34,620
electric power.

28
00:02:34,620 --> 00:02:41,130
Now we are going to go straight away to explain terms like the transformer or even this attention which

29
00:02:41,130 --> 00:02:49,710
was mentioned to better understand the transformer and the role it has to play in this white architecture

30
00:02:49,710 --> 00:02:50,760
right here.

31
00:02:50,760 --> 00:02:55,230
We would have to get back in time to understand why do it first.

32
00:02:55,230 --> 00:02:57,270
Developed in 2017.

33
00:02:57,270 --> 00:03:05,490
This paper entitled Attention is All You Need was first developed by one year out, and it has turned

34
00:03:05,490 --> 00:03:10,980
out to be one of the most influential papers in the modern deep learning era.

35
00:03:10,980 --> 00:03:17,730
With the development of this transformer architecture right here at the heart of this transformer architecture,

36
00:03:17,730 --> 00:03:26,190
we have this self attention modules and more specifically in this paper, they used the up product attention

37
00:03:26,190 --> 00:03:27,300
that we could see here.

38
00:03:27,300 --> 00:03:35,040
But then, as we said, the whole purpose or the domain in which these kinds of architectures or those

39
00:03:35,040 --> 00:03:39,690
kinds of networks were build was for natural language processing.

40
00:03:39,990 --> 00:03:47,520
But the question is, how does this work in natural language processing to understand how and why the

41
00:03:47,520 --> 00:03:54,030
attention and also the transformers are used in natural language processing will take the following

42
00:03:54,030 --> 00:04:01,170
example, which is that of translation, which we're used to already doing and with Google Translate.

43
00:04:01,170 --> 00:04:08,190
So here we're going to put in I love the weather or could you see the weather today?

44
00:04:08,190 --> 00:04:09,120
It's amazing.

45
00:04:09,120 --> 00:04:11,400
And we translate this to French Luton.

46
00:04:11,400 --> 00:04:19,170
Should we inquire now initially the kinds of deep learning techniques which were used in solving these

47
00:04:19,170 --> 00:04:24,090
kinds of problems, that is taking us from one language to another.

48
00:04:25,330 --> 00:04:32,440
Where the recurrent neural networks, the way the recurrent neural networks work is quite simple.

49
00:04:32,590 --> 00:04:35,140
So we'll start by putting the text here.

50
00:04:36,070 --> 00:04:42,130
Yeah, we've put out our example from Google Translate, and then we've added this extra blocks right

51
00:04:42,130 --> 00:04:42,640
here.

52
00:04:42,670 --> 00:04:50,740
Now, the blocks we see here are recurrent neural network blocks, recurrent neural networks generally

53
00:04:50,740 --> 00:04:51,160
reading.

54
00:04:51,160 --> 00:04:59,740
Ah, and hence you're one of the first deep learning based models used in natural language processing

55
00:04:59,740 --> 00:05:03,010
texts like the Case of Translation we have here.

56
00:05:03,160 --> 00:05:09,760
Now, the way this works is we have our initial text or we have our source.

57
00:05:09,760 --> 00:05:11,180
That's English text.

58
00:05:11,200 --> 00:05:13,030
The weather today is amazing.

59
00:05:13,030 --> 00:05:16,040
And then we have the target, which we want to generate.

60
00:05:16,060 --> 00:05:23,140
So initially we have this input and this output which we're going to train, and later on when we pass

61
00:05:23,140 --> 00:05:27,700
in some random input, we expect to get a reasonable output.

62
00:05:28,000 --> 00:05:35,320
Now, the way we have this or the the way this is structured is such that each and every one of these

63
00:05:35,320 --> 00:05:37,230
is called a token.

64
00:05:37,240 --> 00:05:44,140
So we have this word here, which is a token this, this offers token next token, this token, this

65
00:05:44,140 --> 00:05:52,240
token and this other token then this different tokens have been converted into vectors and then being

66
00:05:52,240 --> 00:05:56,130
passed in this hour and blocks right here.

67
00:05:56,140 --> 00:06:02,950
Now we carry out some simple computations like multiplication and addition, and then some information

68
00:06:02,950 --> 00:06:10,030
is being passed in from one block to another, hence the term the recurrent neural network.

69
00:06:10,120 --> 00:06:16,900
Now, the importance of passing information from one block to another is that this tokens computations

70
00:06:16,900 --> 00:06:23,410
in this block will depend on this other previous tokens that you could see here.

71
00:06:23,410 --> 00:06:27,370
So it depends on this, depends on this, and also depends on this other one.

72
00:06:27,370 --> 00:06:33,820
And then once we're done with convert or passing this information from one block to another up to this,

73
00:06:33,820 --> 00:06:37,150
we are then going to take this year, we're going to have some information.

74
00:06:37,150 --> 00:06:42,280
We're going to be passed onto this other random block here.

75
00:06:42,280 --> 00:06:44,200
So this is our encoded block.

76
00:06:44,200 --> 00:06:50,560
So we encode the information and then, yeah, we decode this information.

77
00:06:50,560 --> 00:06:54,490
So here we have the encoder and the decoder.

78
00:06:54,490 --> 00:07:00,790
And then again, your a similar process is repeated where we have this computations which produce an

79
00:07:00,790 --> 00:07:01,810
output here.

80
00:07:01,810 --> 00:07:08,740
And then we could take this output and feed it in this one to produce this other output and so on and

81
00:07:08,740 --> 00:07:11,170
so forth up to this final output.

82
00:07:11,170 --> 00:07:18,580
But then the problem with this technique or with this method is that first of all, if we have a very

83
00:07:18,580 --> 00:07:25,750
long text, then it may happen that it starts becoming difficult for information to flow from this first

84
00:07:25,750 --> 00:07:29,350
blocks here to this final blocks.

85
00:07:29,350 --> 00:07:36,550
And given that even as humans, we know the importance of taking into consideration some previous context

86
00:07:36,550 --> 00:07:44,830
when trying to carry out a text like, for example, translation, this kind of problem will lead to

87
00:07:44,830 --> 00:07:46,600
very poor results.

88
00:07:47,530 --> 00:07:56,200
Now, another problem here is each time we train, we have to pass this information from one block to

89
00:07:56,200 --> 00:07:57,850
another sequentially.

90
00:07:57,850 --> 00:08:00,460
So here we pass all this information sequentially.

91
00:08:00,460 --> 00:08:07,360
And because this information is passed sequentially, it makes it difficult for us to implement parallel

92
00:08:07,370 --> 00:08:09,790
ization very efficiently.

93
00:08:10,090 --> 00:08:15,520
And so this makes the training of these kinds of neural networks very difficult.

94
00:08:15,550 --> 00:08:21,280
Now, to tackle the issue with long term dependencies, attention networks were developed.

95
00:08:21,430 --> 00:08:28,690
So right here, instead of depending on just this final vector here, we get all this final output we

96
00:08:28,690 --> 00:08:36,880
get from this hidden layer here, which has been passed on your to relay this information from the source

97
00:08:36,880 --> 00:08:38,830
to the target language.

98
00:08:38,830 --> 00:08:46,060
What we'll do is for each and every unit we have here, each and every recurrent neural network block

99
00:08:46,060 --> 00:08:52,360
we have here, we are going to take into consideration inputs from each and every block here.

100
00:08:52,360 --> 00:08:56,080
So this inputs will be taken into consideration.

101
00:08:56,080 --> 00:09:04,630
So each and every block now you see all of this is passed and then we have this attention layer right

102
00:09:04,630 --> 00:09:15,550
here, which then processes this inputs from this different source random blocks such that the layer

103
00:09:15,550 --> 00:09:23,710
that all this attention layer produces an output vector which is now passed as input into this random

104
00:09:23,710 --> 00:09:24,400
block.

105
00:09:24,760 --> 00:09:35,650
And so when we have this source and this target, we pass in the source, then we get or we combine

106
00:09:35,650 --> 00:09:44,620
those inputs from each and every random block right here, pass in this as input into this random block,

107
00:09:44,800 --> 00:09:45,850
get an output.

108
00:09:45,850 --> 00:09:47,110
In this case it's low.

109
00:09:47,770 --> 00:09:53,080
Then we take this output and pass it as an input in your.

110
00:09:53,290 --> 00:10:00,100
But again, once we shift and go and get to this time where all this time frame where we want to get

111
00:10:00,100 --> 00:10:11,200
this second output, what we'll do is we'll have another tension here, which again takes in all these

112
00:10:11,200 --> 00:10:12,280
different inputs.

113
00:10:12,280 --> 00:10:24,730
So we take again all this different inputs year and year and year and year, then carry out some computations

114
00:10:24,730 --> 00:10:27,190
based on the type of attention we are implementing.

115
00:10:27,190 --> 00:10:32,280
And then from here we get an output which is passed together with this right here.

116
00:10:32,290 --> 00:10:37,600
So from here we get this output and then we repeat the same step that's passing in the tone that is

117
00:10:37,600 --> 00:10:45,460
taking this output passing in here, and then also taking in this inputs from those different random

118
00:10:45,460 --> 00:10:46,180
blocks.

119
00:10:47,530 --> 00:10:55,720
And so as you can see here for each and every block we have here, it pays attention to each and every

120
00:10:55,720 --> 00:10:56,560
input.

121
00:10:57,670 --> 00:11:06,580
And for this or from this, we could even come up with an attention map where we would have this text,

122
00:11:06,610 --> 00:11:07,930
this text in English.

123
00:11:07,930 --> 00:11:10,090
The weather today is amazing.

124
00:11:10,540 --> 00:11:12,830
And then this other side we have Luton.

125
00:11:12,910 --> 00:11:14,370
Aujourd'hui enquire.

126
00:11:14,830 --> 00:11:25,690
So now after training this kind of model, we can be able to see how much attention this little piece

127
00:11:25,690 --> 00:11:27,940
to each and every input here.

128
00:11:28,360 --> 00:11:39,700
And it's logical that this layer will pay the most attention to the and then turn to actually years.

129
00:11:39,700 --> 00:11:46,360
Whether so this would pay more attention to our most attention to weather then we will pay most attention

130
00:11:46,360 --> 00:11:54,550
to today which is today a will pay most attention to is and enquire will pay most attention to amazing.

131
00:11:55,780 --> 00:12:02,230
And if we get to this paper entitled Neural Machine Translation by jointly learning to align and translate

132
00:12:02,860 --> 00:12:04,470
that is a famous battle.

133
00:12:04,480 --> 00:12:08,890
Now our paper, you can see some of this attention maps here.

134
00:12:08,890 --> 00:12:10,870
Let's have this.

135
00:12:11,710 --> 00:12:13,630
You see some of this attention maps.

136
00:12:13,630 --> 00:12:23,530
There you see uh la la course LA zone Économique European ET see are not and off and do and you have

137
00:12:23,530 --> 00:12:29,610
an end then the agreement on the European Economic Area was signed in August 1992.

138
00:12:29,620 --> 00:12:38,350
So you see this attention maps here where we see clearly which words attend most to one another.

139
00:12:39,280 --> 00:12:43,690
So yeah we have this image which shows exactly what we're describing previously.

140
00:12:43,690 --> 00:12:50,920
So here you have this inputs and then to get this output y of RT, you will find that we are going to

141
00:12:50,920 --> 00:12:58,630
take in or we are going to attend to each and every input here and then pass this here to obtain our

142
00:12:58,630 --> 00:12:59,500
y dt.

143
00:12:59,650 --> 00:13:07,270
Now at this point we are going to move on from the attention to self attention and to better explain

144
00:13:07,270 --> 00:13:13,090
the self attention will consider a whole different problem, which is that of sentiment analysis.

145
00:13:13,390 --> 00:13:16,390
So you want to we have this model.

146
00:13:16,690 --> 00:13:18,070
We could not take this off.

147
00:13:18,070 --> 00:13:25,480
We don't make use of this, although you should note that we still use center or self attention in the

148
00:13:25,480 --> 00:13:31,600
translation problems, but it will be easier to grasp this concept in the context of sentiment analysis.

149
00:13:31,750 --> 00:13:35,680
So what we are having is we we have the weather to this amazing.

150
00:13:35,680 --> 00:13:41,800
I want to be able to say whether this is a positive or negative statement.

151
00:13:43,210 --> 00:13:49,750
So now we have this model which takes in inputs like this, and then let's draw this model here.

152
00:13:49,780 --> 00:13:50,410
Like this.

153
00:13:50,410 --> 00:14:01,450
We have this model and then outputs or tells us whether the statement we've made is a positive or a

154
00:14:01,450 --> 00:14:03,160
negative statement.

155
00:14:03,490 --> 00:14:08,620
Now you're all for this self attention or layer.

156
00:14:08,620 --> 00:14:16,480
We are not going to need this recurrent neural network hidden states anymore.

157
00:14:16,750 --> 00:14:22,150
In fact, what we could do is we could take all this off actually, because basically we have in this

158
00:14:22,150 --> 00:14:24,640
self attention model, which we'll see in a minute.

159
00:14:25,390 --> 00:14:26,560
How it works.

160
00:14:27,520 --> 00:14:31,480
And then what we're passing in here is some vectors.

161
00:14:31,480 --> 00:14:41,260
So we have this vector, we have this other vector, we have this vector, this one, and finally this

162
00:14:41,260 --> 00:14:41,770
one.

163
00:14:41,830 --> 00:14:45,910
Now, if we combine all this, we'll find out we have a sequence length.

164
00:14:45,910 --> 00:14:47,900
So we have one, two, three, four, five.

165
00:14:48,010 --> 00:14:49,620
Suppose our second line is five.

166
00:14:49,630 --> 00:15:01,270
So we have a sequence length by, let's say, embedding dimension matrix, which we wish we get from

167
00:15:01,270 --> 00:15:01,690
here.

168
00:15:02,650 --> 00:15:04,030
Now let me explain.

169
00:15:04,030 --> 00:15:11,080
Let's suppose that the sequence length is five, as we've all seen, and then the embedding dimension

170
00:15:11,080 --> 00:15:12,730
is let's say three.

171
00:15:13,180 --> 00:15:20,950
So we have this five by three matrix, which we are going to pass into this self attention layer right

172
00:15:20,950 --> 00:15:21,490
here.

173
00:15:21,910 --> 00:15:30,040
Now this embeds all these vectors which we pass into the self attention unit are going to be designed

174
00:15:30,040 --> 00:15:39,070
in a way that words which look alike are going to be close to each other, while words which are opposites

175
00:15:39,070 --> 00:15:41,530
are going to be far away from each other.

176
00:15:41,710 --> 00:15:49,390
Now let's, since we're working in three dimensions, it means we have one, two, three values.

177
00:15:49,390 --> 00:15:51,460
Your one, two, three.

178
00:15:52,180 --> 00:15:55,090
And then finally, here's one, two, three.

179
00:15:55,390 --> 00:15:56,170
So let's.

180
00:15:56,170 --> 00:15:57,670
Let's do something like this.

181
00:15:57,670 --> 00:15:58,990
Three dimensions.

182
00:15:59,380 --> 00:16:03,160
What we'll have is the word happy.

183
00:16:04,790 --> 00:16:12,470
Which in this case can be represented by this vector or this embedding will be or can be plotted out

184
00:16:12,470 --> 00:16:13,220
like this.

185
00:16:13,220 --> 00:16:23,570
And this will be close to a word like smile while a word like sad or like sad will be far away from

186
00:16:23,570 --> 00:16:27,620
this two words because the actually opposites to each other.

187
00:16:27,620 --> 00:16:32,720
So we have sad and we could have angry right here.

188
00:16:33,740 --> 00:16:39,980
Now for this one year or for this text year, we could pick out this two words which are most likely

189
00:16:39,980 --> 00:16:41,210
to be very close to each other.

190
00:16:41,210 --> 00:16:46,580
We could have the right ear and we have is somewhere around here.

191
00:16:48,110 --> 00:16:56,300
And so now getting back to this model, we have this five by three input which is passed into our self

192
00:16:56,300 --> 00:16:57,350
attention layer.

193
00:16:57,350 --> 00:17:00,740
So we could let's, let's have this matrix here.

194
00:17:00,740 --> 00:17:06,770
Five by three would have the, the what the here would have its own embedding.

195
00:17:06,770 --> 00:17:10,820
So we will have some value, some value, some value.

196
00:17:10,850 --> 00:17:15,980
Let's suppose that we're working in three dimensional embedding and then whether we'll have its own

197
00:17:15,980 --> 00:17:16,640
value.

198
00:17:16,640 --> 00:17:17,420
It's on value.

199
00:17:17,420 --> 00:17:20,660
It's on value today, it's on value.

200
00:17:20,690 --> 00:17:28,970
This value, this value could you could take, say, 2.310.5 negative five one whatever value one year.

201
00:17:28,970 --> 00:17:32,930
And then you have this and you have this and you have this.

202
00:17:32,930 --> 00:17:36,560
This is four already and then amazing would have its own.

203
00:17:36,560 --> 00:17:41,630
So you see that each and every one of this year has its own embedding.

204
00:17:41,630 --> 00:17:43,940
So this is this are the different words we have here.

205
00:17:43,940 --> 00:17:50,390
Then at this point, we'll implement a special type of attention known as a dot product attention,

206
00:17:50,390 --> 00:17:58,700
where we'll take this year and multiply it by the transpose of a matrix which has the same shape at

207
00:17:58,700 --> 00:18:00,080
this matrix here.

208
00:18:00,080 --> 00:18:06,800
So we'll take this, we'll call this the query, and then we'll multiply this by the transpose of the

209
00:18:06,800 --> 00:18:07,490
key.

210
00:18:07,700 --> 00:18:13,580
Now this key is going to be three by five since is going to have the same shape as this query.

211
00:18:13,580 --> 00:18:14,780
Now this is our query.

212
00:18:14,780 --> 00:18:16,250
We'll call this a query.

213
00:18:16,250 --> 00:18:26,690
And so here now we have this three by five matrix, and then this product will give us a five by five

214
00:18:26,690 --> 00:18:27,690
matrix.

215
00:18:27,710 --> 00:18:33,540
Now, after this hour, after getting this five by five matrix, we could pass this to a soft max layer.

216
00:18:33,560 --> 00:18:39,110
Now we've looked at the soft max layer in previous sessions, but one thing you should note here is

217
00:18:39,110 --> 00:18:46,040
once we have this five by five matrix, it produces this attention map similar to what we are seeing

218
00:18:46,040 --> 00:18:48,410
before where we have this.

219
00:18:48,410 --> 00:18:53,360
The weather today is amazing to the site and the same again to the site.

220
00:18:53,360 --> 00:19:01,220
And then words which are most similar to each other in a certain context are going to have the highest

221
00:19:01,220 --> 00:19:02,060
values.

222
00:19:02,060 --> 00:19:10,850
And so if we're in the case where you had say, let's replace this weather by happy and then we have

223
00:19:11,150 --> 00:19:14,090
amazing let's let's let's leave Amazon.

224
00:19:14,090 --> 00:19:17,630
So if we have the happy today is amazing it still doesn't make sense.

225
00:19:17,630 --> 00:19:26,430
But let's consider this let's suppose that we have the happy to this amazing then this second row of

226
00:19:26,430 --> 00:19:33,230
foot column because amazing would be around here so we would have this value which is going to be relatively

227
00:19:33,230 --> 00:19:36,380
higher than all the other surrounding values.

228
00:19:36,380 --> 00:19:44,690
And this will be because after training the model, the attention map values would have been modified

229
00:19:44,690 --> 00:19:53,480
such that values or rather words which are similar to one another, take higher values, while words

230
00:19:53,480 --> 00:19:55,820
which are not similar to one another.

231
00:19:55,820 --> 00:19:57,560
Take very small values.

232
00:19:57,830 --> 00:20:06,620
Now from here we have this five by five matrix, which now where multiply by another five by three matrix

233
00:20:06,620 --> 00:20:11,450
will give us a five by three matrix.

234
00:20:12,170 --> 00:20:18,080
Generally we call this matrix, which is multiplied by this attention matrix, the value.

235
00:20:18,080 --> 00:20:21,920
So we have query, we have the key and we have the value.

236
00:20:22,880 --> 00:20:28,370
With this you see that we have this input which got in here, which was five by three, and now we have

237
00:20:28,370 --> 00:20:32,150
a five by three output.

238
00:20:32,990 --> 00:20:39,830
Then this year now we pass through some fully connected layers and then we'll have an output or a fully

239
00:20:39,830 --> 00:20:47,360
connected layer with one neurons output, which will tell us whether an input statement is a positive

240
00:20:47,360 --> 00:20:49,580
statement or a negative statement.

241
00:20:50,150 --> 00:20:56,450
And so as you've seen, we've gotten rid completely of the recurring neural network blocks as now we're

242
00:20:56,450 --> 00:21:03,680
just making use of this self attention blocks to extract information from our input.

243
00:21:05,770 --> 00:21:12,790
Now one of the first papers, if not the first paper, which made use of just the attention and getting

244
00:21:12,790 --> 00:21:17,170
rid of the rans was this attention is all you need paper.

245
00:21:17,170 --> 00:21:22,130
And it happens to be one of the most influential papers in modern day deep learning.

246
00:21:22,150 --> 00:21:29,110
So here in this attention is all you need the paper or the transform our paper to present this new network,

247
00:21:29,110 --> 00:21:35,700
which you could see just right here and then a single block.

248
00:21:35,710 --> 00:21:44,200
Let's take this off a single block, which makes up the transformer model is this multi head attention.

249
00:21:44,200 --> 00:21:53,770
So as you could see right here, we have this single block and then here we have this multi head attention.

250
00:21:53,770 --> 00:21:55,960
So let's look at this multi higher attention.

251
00:21:55,960 --> 00:21:58,090
This is actually the motor head attention here.

252
00:21:58,090 --> 00:22:03,370
So you have this year, which is this whole block.

253
00:22:03,370 --> 00:22:10,360
And then in this multi head attention, you have this killer product attention, which is this self

254
00:22:10,360 --> 00:22:12,450
attention we just talked about.

255
00:22:12,460 --> 00:22:15,430
You see we have the query, the key and the value.

256
00:22:15,430 --> 00:22:22,570
So since it's self attention, you will notice here that we have those inputs and all these come from

257
00:22:22,570 --> 00:22:23,830
the same input.

258
00:22:23,830 --> 00:22:32,360
So we have this input which is split it up into cure K and v query key and value.

259
00:22:32,380 --> 00:22:42,760
Now this resembles or is analogous to data management systems where data is stored in key value pairs,

260
00:22:42,760 --> 00:22:45,430
just like say, Python dictionary.

261
00:22:45,430 --> 00:22:51,340
So you have data in this key value pairs data stored this way.

262
00:22:51,340 --> 00:22:56,700
And then when you want a particular information, you have to pass in a query.

263
00:22:56,710 --> 00:22:59,950
Now, when you pass the query, let's change this color.

264
00:22:59,950 --> 00:23:05,050
When you pass in a query, you have a particular key which is selected.

265
00:23:05,050 --> 00:23:09,760
Once the key is selected, we now obtain the value, which is the data itself.

266
00:23:09,760 --> 00:23:12,760
And it's kind of similar to what we have in here.

267
00:23:14,230 --> 00:23:20,650
And then from you're in this level of the split, not that before the information has been passed into

268
00:23:20,650 --> 00:23:29,170
this skill dot product attention, we actually pass this K and V into some different linear layers.

269
00:23:29,170 --> 00:23:37,720
And so this means that even though we have the same inputs that will end up being projected into three

270
00:23:37,720 --> 00:23:38,800
different inputs.

271
00:23:38,800 --> 00:23:44,950
And so now we have this k v, we are going to carry out Keoki Transpose.

272
00:23:44,950 --> 00:23:54,490
Here we have the mad mol as we saw already cure times K transpose and then we have this scaling which

273
00:23:54,490 --> 00:23:57,640
you can see right here in this formula, this attention formula.

274
00:23:57,640 --> 00:24:03,850
We have DT divided by this d k Then from here we have sof max.

275
00:24:03,850 --> 00:24:06,760
Of all this and then we multiply by V.

276
00:24:06,760 --> 00:24:10,150
So let's get back up to this year.

277
00:24:11,170 --> 00:24:12,160
That's fine.

278
00:24:13,480 --> 00:24:18,610
Now that we have this output, you now see that we have this multi head.

279
00:24:18,610 --> 00:24:25,450
So we, we got this, we have the soft max, we have the mama where we take this self max of this multiplied

280
00:24:25,450 --> 00:24:26,020
by V.

281
00:24:26,020 --> 00:24:31,600
So that's how we get this recall, how we saw that with the example we had previously.

282
00:24:31,600 --> 00:24:35,260
And then we have this multi head attention.

283
00:24:35,260 --> 00:24:41,860
Now this multi head attention here simply means you take this year as you pass in your information like

284
00:24:41,860 --> 00:24:51,400
this, you get this care K and V, and then you again pass the same information into this block.

285
00:24:51,400 --> 00:24:55,360
So let's suppose that this is our skilled put attention.

286
00:24:55,360 --> 00:24:59,350
BLOCK This is called product attention, which is right inside your.

287
00:24:59,350 --> 00:25:02,680
And so once we have this, let's let's make that smaller.

288
00:25:02,710 --> 00:25:05,050
Let's suppose that this is what we have here.

289
00:25:05,050 --> 00:25:07,300
So this year is actually this.

290
00:25:07,300 --> 00:25:15,790
Now to obtain the multi head attention will have this other one here and then we'll have let's suppose

291
00:25:15,790 --> 00:25:16,990
that we have three heads.

292
00:25:16,990 --> 00:25:22,390
If we have three heads that we would have three of this stacked in this way, you have one, two, let's

293
00:25:22,390 --> 00:25:24,610
change the color so it becomes clearer.

294
00:25:24,760 --> 00:25:34,180
We have this one in red, we have this next one in blue here, and then we have this other one in green.

295
00:25:34,420 --> 00:25:35,560
So there we go.

296
00:25:35,560 --> 00:25:42,610
We have this three and then when the information gets in, so you have your cure, you have your K and

297
00:25:42,610 --> 00:25:50,410
you have your V, We pass this to this separate linear layers C for for each of this, we have some

298
00:25:50,410 --> 00:25:52,750
linear layer here.

299
00:25:53,230 --> 00:25:56,860
All of this came from the same inputs as you could see here.

300
00:25:56,860 --> 00:25:59,130
And then now this information is passed.

301
00:25:59,230 --> 00:26:03,550
So we have concave paths into this one, into this block here and then.

302
00:26:03,750 --> 00:26:07,560
This same shockwave also is passed into.

303
00:26:07,560 --> 00:26:08,880
Let's change the color.

304
00:26:09,180 --> 00:26:13,830
We will now have some other year, some in your lyrics.

305
00:26:13,920 --> 00:26:16,290
Let's put it besides this.

306
00:26:16,290 --> 00:26:18,600
We have some other linear layers here.

307
00:26:18,600 --> 00:26:23,400
We'll pass V we will pass, K will pass here, right here.

308
00:26:23,400 --> 00:26:28,640
And then this now will be sent into this self attention block right here.

309
00:26:28,650 --> 00:26:30,960
Then we also finally have this for the red.

310
00:26:30,960 --> 00:26:34,080
So we'll have something like this red.

311
00:26:34,080 --> 00:26:38,970
We have the K, something like this, we have the V, something like this.

312
00:26:38,970 --> 00:26:44,550
So now this V is passed now into this red here and that's it.

313
00:26:44,550 --> 00:26:50,820
And then the outputs here, the outputs will get at the end of this three self attention blocks will

314
00:26:50,820 --> 00:26:55,620
now be concatenated and then pass through a linear layer.

315
00:26:55,830 --> 00:26:59,490
So this linear layer is like our dense layer in TensorFlow.

316
00:26:59,520 --> 00:27:04,740
Now once we have this, you see we have our multi head attention, which is this block, and then now

317
00:27:04,740 --> 00:27:11,040
we'll take this input added on to the output and then go through a layer normalization.

318
00:27:11,160 --> 00:27:17,340
Then from here we pass this to a feedforward network that's like our fully connected network or dense

319
00:27:17,340 --> 00:27:17,910
layer.

320
00:27:17,910 --> 00:27:23,280
And then we will again repeat this addition and normalization looks similar to what we have with the

321
00:27:23,280 --> 00:27:24,200
rest nets.

322
00:27:24,210 --> 00:27:29,850
Now, once we have this now, we could then repeat this end times.

323
00:27:31,170 --> 00:27:36,910
Now you'll notice that this is similar to this, except for the fact that now we have this to multi

324
00:27:36,960 --> 00:27:39,750
head attentions and we also have this mask.

325
00:27:39,750 --> 00:27:41,880
Anyway, we're not going to get into all this details.

326
00:27:41,880 --> 00:27:48,960
What's important for you to understand how this encoder works and now we understand how this works.

327
00:27:48,960 --> 00:27:51,000
We will not get back to our paper.

328
00:27:51,000 --> 00:27:56,040
That is this paper entitled Transformers for Image Recognition at Scale.

329
00:27:56,550 --> 00:28:02,490
And now you should be able to understand this transformer block, which we presented earlier in this

330
00:28:02,490 --> 00:28:06,990
paper with this understanding of how this transformer encoder works.

331
00:28:06,990 --> 00:28:14,490
Let's now get into this unit here where we break this image into this different patches, as we could

332
00:28:14,490 --> 00:28:22,170
see right here, to better understand how and why we make use of patches right here.

333
00:28:22,170 --> 00:28:28,830
Let's not forget that what this transformer encoder takes in is some input sequence.

334
00:28:28,830 --> 00:28:30,960
So we have this input here.

335
00:28:31,560 --> 00:28:39,000
Initially, we had words where each word like this could be represented by this vector or this embedding

336
00:28:39,000 --> 00:28:39,780
vector.

337
00:28:39,780 --> 00:28:46,260
And then this now combined is passed into the transformer here.

338
00:28:46,260 --> 00:28:54,330
Since our input is this image, in order for us to represent this way, we'll have to break this up.

339
00:28:54,330 --> 00:29:00,510
So what we could do or what we could think of at first sight is we have this image.

340
00:29:00,510 --> 00:29:09,510
Let's suppose the image is 256 by 256 by say, three three channels.

341
00:29:09,510 --> 00:29:12,450
Then we could take each and every pixel here.

342
00:29:12,450 --> 00:29:16,560
So let's, let's omit the channel from the channels for now.

343
00:29:16,560 --> 00:29:23,700
So what we could have here is for each and every pixel in this 256 by 256 image we would have a vector

344
00:29:23,700 --> 00:29:25,050
representing that pixel.

345
00:29:25,050 --> 00:29:29,610
And then this other one is vector, this other one is vector, and so on and so forth.

346
00:29:29,610 --> 00:29:39,600
But don't forget that unlike previously where we had only five words, now we have 256 times 256 words.

347
00:29:39,600 --> 00:29:44,400
Because if we have an image like this and we have to get each and every pixel that will have 256 by

348
00:29:44,430 --> 00:29:51,060
256, which is more than 65,000 different

349
00:29:53,130 --> 00:29:55,650
vectors, which we will have to pass here.

350
00:29:55,650 --> 00:30:04,650
And so before where in our attention model we had an attention map which was five by five recall, we

351
00:30:04,650 --> 00:30:10,980
saw that already with the words we had a five or an input sentence with five words we had five, five,

352
00:30:10,980 --> 00:30:12,410
five attention map.

353
00:30:12,420 --> 00:30:20,160
Now we would have a 65,000 by 65,000 attention map.

354
00:30:20,250 --> 00:30:26,910
You see that working with these kinds of matrices and memory isn't very feasible.

355
00:30:26,910 --> 00:30:35,340
And so instead of going pixel by pixel, the authors decide to work patch by patch, let's create this

356
00:30:35,340 --> 00:30:35,730
again.

357
00:30:35,730 --> 00:30:37,020
So you get to see that.

358
00:30:37,320 --> 00:30:38,610
Take this off.

359
00:30:38,640 --> 00:30:41,910
You see, here we go, patch by patch.

360
00:30:42,120 --> 00:30:48,480
So you could see how this image, instead of taking each pixel, we break this up into patches.

361
00:30:48,480 --> 00:30:50,520
So this is now like a pixel.

362
00:30:50,520 --> 00:30:57,480
And then you see this patch, you see this patch, this patch, this other patch, this patch, and

363
00:30:57,480 --> 00:30:58,290
so on and so forth.

364
00:30:58,290 --> 00:31:00,050
Up to this patch right here.

365
00:31:00,060 --> 00:31:03,540
Now, this is what is like the word.

366
00:31:03,620 --> 00:31:08,060
Know your SO with images, we have to break this up like this.

367
00:31:08,060 --> 00:31:14,060
And the authors choose to work with 16 by 16 pixel patches.

368
00:31:14,060 --> 00:31:16,250
So each picture is 16 by 16.

369
00:31:17,210 --> 00:31:24,890
And so given that we have 16 by 16, if we have this patch, for example, then we would have 256 different

370
00:31:24,890 --> 00:31:26,310
pixels for each patch.

371
00:31:26,330 --> 00:31:31,220
Here we have 256, here we have 256 and so on and so forth.

372
00:31:31,250 --> 00:31:39,050
So all the images are at our on line with the words where we had five by three, so we had five words

373
00:31:39,050 --> 00:31:47,330
and each word was represented by a three dimensional vector here, each patches represented by 256 dimensional

374
00:31:47,330 --> 00:31:48,020
vector.

375
00:31:48,050 --> 00:31:51,440
Now, this doesn't mean that in NLP we generally work with this.

376
00:31:51,590 --> 00:31:55,460
This was just done to make it easier for you to understand.

377
00:31:55,640 --> 00:32:03,080
So getting back to computer vision, you see that we have this 256 to 50 6 to 56 and so on and so forth.

378
00:32:03,110 --> 00:32:13,280
Now, when working with the transformer, we may not want to work with this 256 dimensional vectors.

379
00:32:13,280 --> 00:32:19,090
Maybe we want to work with, say, 512 dimensional vectors.

380
00:32:19,100 --> 00:32:25,940
In that case, we would have to do this linear projection of the flattened patches such that we leave

381
00:32:25,940 --> 00:32:27,230
from this.

382
00:32:28,010 --> 00:32:31,820
Say, let's suppose that we have one, two, three, we have nine patches, so the sequence length is

383
00:32:31,820 --> 00:32:32,150
nine.

384
00:32:32,150 --> 00:32:35,570
So we have this input which is nine by two, five, six.

385
00:32:35,570 --> 00:32:43,130
And then after going through this linear projection, we now get to nine by 512 and this will be the

386
00:32:43,130 --> 00:32:46,250
embedded dimension for our transformer.

387
00:32:46,250 --> 00:32:49,370
In the previous example, our embedding dimension was three.

388
00:32:49,580 --> 00:32:56,780
So if this this permits us to be to, to work flexibly as now we could decide on what size we want for

389
00:32:56,780 --> 00:32:58,220
our embedding dimension.

390
00:32:58,400 --> 00:33:05,150
Now that said, we have this output C nine by 512 and then we're ready to pass this into the transformer

391
00:33:05,150 --> 00:33:05,780
encoder.

392
00:33:05,780 --> 00:33:11,870
But just before passing this, we'll add this position embeddings.

393
00:33:11,870 --> 00:33:17,360
You see there we have this input you see in this different this color, you have them getting in and

394
00:33:17,360 --> 00:33:22,650
then we have this position embeddings here is notice zero one, two, three and up to nine.

395
00:33:22,670 --> 00:33:25,760
Now the way this works are let's start by first.

396
00:33:25,760 --> 00:33:33,470
The reason why we even have to do this is because unlike with the conflicts where where the convolutional

397
00:33:33,470 --> 00:33:40,220
or the way the convolutional neural networks work is that for computing the feature maps, it takes

398
00:33:40,220 --> 00:33:41,960
into consideration locality.

399
00:33:41,960 --> 00:33:51,860
So this means that you see these two portions here or when passed with a conf filter will produce a

400
00:33:51,860 --> 00:33:52,970
certain output.

401
00:33:53,120 --> 00:34:02,630
And so this means that pixels which belong to a certain or to a small locality like this one will be

402
00:34:02,630 --> 00:34:04,610
used to produce the output.

403
00:34:05,390 --> 00:34:15,410
And this clearly gives an gives the scenes an upper hand over the transformers as when trying to understand

404
00:34:15,410 --> 00:34:20,900
an image, the positions of particular pixels actually matter.

405
00:34:21,230 --> 00:34:29,420
So this means that CNN's already have an inductive bias due to the way they actually work.

406
00:34:29,720 --> 00:34:37,490
And so to give a help in hand to the transformer network will now need or will need this position embedding,

407
00:34:38,540 --> 00:34:46,460
which gives this transformer encoder an idea of the location of each and every patch which is passed

408
00:34:46,460 --> 00:34:47,060
in.

409
00:34:48,080 --> 00:34:55,100
But again, it should be noted that this will have to be learned automatically by the model.

410
00:34:55,670 --> 00:35:02,180
Now, even though this we have this extra input right here, and the reason why we have this extra input

411
00:35:02,180 --> 00:35:08,630
is simply because we do not want a situation where after going through this encoder, all this transformer

412
00:35:08,630 --> 00:35:16,070
encoder right here, we pick one of this outputs because we will have outputs here.

413
00:35:16,070 --> 00:35:21,980
We don't want we don't want to pick one of these outputs to be used for the MLPs head or to be used

414
00:35:21,980 --> 00:35:29,090
for this fully connected network in this classification unit right here.

415
00:35:29,240 --> 00:35:39,650
So to avoid this sort of bias where we would be picking one of this, the others add this extra learnable

416
00:35:39,650 --> 00:35:48,350
class embedding right here, which will be or whose output will be passed into this MLPs head, and

417
00:35:48,350 --> 00:35:50,660
then we'll be used for classification.

418
00:35:51,440 --> 00:35:58,040
Another important point to note here is the transformer encoder or this or visual vision.

419
00:35:58,040 --> 00:36:03,020
Transformers are some sort of hybrid architecture.

420
00:36:03,020 --> 00:36:03,130
The.

421
00:36:03,190 --> 00:36:10,180
Because we may decide not to pass on this image patches directly, but instead passes image patches

422
00:36:10,180 --> 00:36:17,710
to a convolutional neural network, then get the output embeddings and pass in your directly instead

423
00:36:17,710 --> 00:36:18,970
of this image patches.

424
00:36:19,000 --> 00:36:26,710
It should be noted that the multilayer perceptron contains two fully connected layers with a glue non

425
00:36:26,710 --> 00:36:31,300
linearity yours the general glue nonlinearity.

426
00:36:31,870 --> 00:36:33,910
Compared to the rainbow and the ALU.

427
00:36:33,940 --> 00:36:41,410
So you see we have this realm where all values less than zero or less than zero gives output of zero

428
00:36:41,410 --> 00:36:45,490
and all values greater than zero give the exact same value.

429
00:36:45,520 --> 00:36:50,920
But with the glue, we have this curved function right here.

430
00:36:51,820 --> 00:36:53,110
So that's it.

431
00:36:53,860 --> 00:37:01,000
The type of normalization is the layer normalization, as we mentioned already, and the layer normalization.

432
00:37:01,030 --> 00:37:08,440
Here we could visualize this in this paper by sharing a all entitled power norm, rethinking batch normalization

433
00:37:08,440 --> 00:37:12,700
and transformers, where you see we let's zoom this.

434
00:37:13,270 --> 00:37:18,900
You see, we have layer normalization here and we have the batch normalization put side by side.

435
00:37:18,910 --> 00:37:22,060
So with the layer normalization, that's what we're saying.

436
00:37:22,540 --> 00:37:27,970
If you consider some inputs, let's let's yeah, we have the sequence length or the sequence dimension,

437
00:37:27,970 --> 00:37:32,080
we have the features or the embeddings or like a vector actually.

438
00:37:32,080 --> 00:37:35,830
So we have the different vectors and then we have the batch dimension.

439
00:37:35,830 --> 00:37:43,180
So basically what we're saying is we have this sequence length or we have this different vectors here

440
00:37:44,110 --> 00:37:48,190
which have been passed into some layer.

441
00:37:48,460 --> 00:37:56,230
And then instead of doing or carrying out normalization for, for throughout the batches, as is in

442
00:37:56,230 --> 00:38:03,770
the case of the batch norm here, we carry out this normalization for each and every vector.

443
00:38:05,080 --> 00:38:12,520
And the reason why we do not use the batch norm with the transformers is the fact that the batch statistics

444
00:38:12,520 --> 00:38:22,720
for NLP data have a very large variance throughout training, and this variance exists in the corresponding

445
00:38:22,720 --> 00:38:24,010
ingredients as well.

446
00:38:24,250 --> 00:38:31,840
And so to avoid this kind of situation, it's preferable for us to carry out this normalization on the

447
00:38:31,840 --> 00:38:32,500
features.

448
00:38:32,500 --> 00:38:33,070
Instead.

449
00:38:34,120 --> 00:38:40,750
Before we move on to the experiments, let's look at how the weights are being used in real world.

450
00:38:40,930 --> 00:38:48,940
So actually the weights of are pre-trained on very large data sets and fine tuned to smaller downstream

451
00:38:48,940 --> 00:38:49,720
tasks.

452
00:38:50,560 --> 00:38:58,840
Obviously, when fine tuning we remove this head and replace with a head which now correspond to our

453
00:38:58,840 --> 00:38:59,950
number of classes.

454
00:38:59,950 --> 00:39:07,990
So this means that initially we may have 1000 class head and then we move those to K classes or let's

455
00:39:07,990 --> 00:39:09,370
say three class head.

456
00:39:12,770 --> 00:39:18,140
To better understand why we're going from why we have a D by K output.

457
00:39:18,140 --> 00:39:20,090
Let's get back here.

458
00:39:20,090 --> 00:39:27,210
So after those inputs have been passed in here we have an output sequence length plus one this plus

459
00:39:27,210 --> 00:39:28,130
this one year.

460
00:39:29,300 --> 00:39:30,410
Let's just say we have.

461
00:39:30,410 --> 00:39:38,900
Yeah, we have say from year one by the output if, if we're considering all the sequence length will

462
00:39:38,900 --> 00:39:41,420
be a sequence length by the output.

463
00:39:41,420 --> 00:39:50,980
This here is our embedding dimension which we had fixed from this linear projection right here.

464
00:39:50,990 --> 00:39:56,780
So we have this one by DX and then we pass this through.

465
00:39:56,780 --> 00:40:01,820
Obviously it becomes, it becomes like simply the neurons.

466
00:40:01,820 --> 00:40:10,820
So we have one two, we now have the neurons since it's just one by DX and then we have this output.

467
00:40:10,820 --> 00:40:18,530
Let's say we have 1000 classes, then we'll have this fully connected layer which brings all this year,

468
00:40:18,560 --> 00:40:26,470
this the inputs to this K outputs, or in this case to this 1000 outputs.

469
00:40:26,480 --> 00:40:31,760
Now when we want to fine tune one, we want to fine tune, we are going to take this off, take this

470
00:40:31,760 --> 00:40:35,360
off and replace this now with K outputs.

471
00:40:35,540 --> 00:40:43,820
So we now have K outputs right here and then we initialize this weights of this fully connected layer.

472
00:40:44,570 --> 00:40:51,680
The others also make mention of the fact that doing fine tuning is better to work at higher resolutions.

473
00:40:51,680 --> 00:41:02,750
So this means that the model could be trained at 256 by 256 and then later on fine tuned with 512 by

474
00:41:02,750 --> 00:41:05,360
512 images.

475
00:41:07,330 --> 00:41:13,660
And then since they keep the patch size the same, that results in a larger effective sequence length.

476
00:41:13,690 --> 00:41:18,260
Now let's explain all let's visualize this statement.

477
00:41:18,280 --> 00:41:24,800
So here we have this input, which is say 48 by 48.

478
00:41:24,820 --> 00:41:28,750
Let's say we have your 48 by 48.

479
00:41:28,750 --> 00:41:34,610
And when we divide this or we break this up into three parts, we have 16, 16, 16, 16, six and 16.

480
00:41:34,630 --> 00:41:36,940
So we have 16 by 16 patches.

481
00:41:36,970 --> 00:41:46,750
Now, if we want to fine tune on a higher resolution image, then let's say the higher resolution images

482
00:41:46,750 --> 00:41:50,350
say 96 by 96, so we could have something like this.

483
00:41:50,350 --> 00:41:58,990
So if now we're finding on the 96 by nine six image and that we still maintain the fact that this year

484
00:41:58,990 --> 00:42:04,600
or the patches will be 16 by 16, then this means that instead of three here we're going to have six.

485
00:42:04,600 --> 00:42:16,840
So we now have or one, two, three, four, five and six patches, six patches this way two, three,

486
00:42:16,840 --> 00:42:20,020
four, five, six and so on and so forth.

487
00:42:20,050 --> 00:42:26,800
So now we're going to have 36 different patches instead of nine patches as we have here.

488
00:42:26,800 --> 00:42:31,600
And that's why they make mention of the fact that the sequence length is going to be increased.

489
00:42:32,380 --> 00:42:36,430
And that's so long as they can fit in the memory.

490
00:42:38,480 --> 00:42:44,840
Now due to this modifications, the pre trained that's what we had before the pre trained position embeddings

491
00:42:44,840 --> 00:42:46,670
may no longer be meaningful.

492
00:42:47,300 --> 00:42:52,850
So they therefore perform two D interpolation of the pre trained position and variance according to

493
00:42:52,850 --> 00:42:55,640
the location in the original image.

494
00:42:56,390 --> 00:42:59,930
The experiments here we could see those different models.

495
00:42:59,930 --> 00:43:06,680
They have the V base, the V latch and the V huge number of parameters at 6 million to 632 million.

496
00:43:06,680 --> 00:43:09,470
Then yeah, we have 12 layers.

497
00:43:09,470 --> 00:43:11,900
Recall, let's get back here.

498
00:43:12,080 --> 00:43:15,320
Recall we had this number of layers here.

499
00:43:15,320 --> 00:43:20,840
So basically you're repeating this, you're repeating this here 12 times.

500
00:43:21,170 --> 00:43:22,850
So we get back.

501
00:43:22,850 --> 00:43:31,900
So we have 12 layers for the base as stated here, and then we have 24 for the latch and 32 for with

502
00:43:31,910 --> 00:43:32,510
huge.

503
00:43:32,530 --> 00:43:44,510
Then this hidden size, this D this embedded dimension is 768 for base 1024 for large and 1284 huge.

504
00:43:45,320 --> 00:43:55,970
The MLPs size that's fully connected layers to 3072 4000 960 5120.

505
00:43:55,970 --> 00:44:02,360
Then the number of heads remember the attention heads 12 1616 So this MLPs size here, they're talking

506
00:44:02,360 --> 00:44:05,180
about this MLPs recall.

507
00:44:05,180 --> 00:44:10,280
We have we have this MLP, we have this MLPs right here.

508
00:44:10,280 --> 00:44:15,800
And so this MLP is made of two fully connected layers like this.

509
00:44:15,950 --> 00:44:21,020
Then depending on the size, you have a certain number of neurons.

510
00:44:21,020 --> 00:44:30,500
Their experiments were carried out on this GFT 300 million data set and we see how this 14 by 14 patch

511
00:44:30,500 --> 00:44:35,600
version of the white outperforms this rest net 152.

512
00:44:35,930 --> 00:44:42,740
Now this performs once, although not largely greater than that of the rest.

513
00:44:42,740 --> 00:44:48,140
Nets requires less competition resources to train.

514
00:44:48,140 --> 00:44:58,490
As we see here we have 2500 CPU core days required to train this model as compared to this one, which

515
00:44:58,490 --> 00:45:03,050
requires 9900 CPU car days.

516
00:45:03,620 --> 00:45:11,450
Also from these plots, you see that when you increase the number of pre training samples, the model

517
00:45:11,450 --> 00:45:15,320
which performs the best is this white right here.

518
00:45:15,350 --> 00:45:22,970
So here we have this fit and this outperforms the rest nets, whereas for a reduced number of samples,

519
00:45:22,970 --> 00:45:25,670
the rest nets outperforms the whites.

520
00:45:26,510 --> 00:45:35,720
While here the smaller the patch size like here we have this 1414 by 14 pitch size we have the better

521
00:45:35,720 --> 00:45:36,800
the results.

522
00:45:37,490 --> 00:45:45,740
Now, in order to understand the reason why, as you increase this data set size, the widths start

523
00:45:45,740 --> 00:45:48,500
to outperform the coordinates.

524
00:45:49,220 --> 00:45:56,090
We have to recall that when working with confidence like the rest net, there is some inductive bias

525
00:45:56,090 --> 00:46:06,530
in the sense that the fact that this rest net takes as input, this two dimensional image already gives

526
00:46:06,530 --> 00:46:12,470
this conflict a helping hand when it comes to extracting features from your.

527
00:46:13,850 --> 00:46:23,270
And so even with relatively smaller data sets, this confidence can make sense out of this input image.

528
00:46:23,780 --> 00:46:33,560
Now, with the Transformers, which are some sort of generic neural network, the model doesn't get

529
00:46:33,560 --> 00:46:37,520
to see the image in this, its natural form.

530
00:46:37,520 --> 00:46:42,590
What it sees is some patches which have been converted to some vectors.

531
00:46:43,670 --> 00:46:54,080
And so at the very beginning, or with small data, the transformer model finds it difficult to make

532
00:46:54,080 --> 00:46:57,380
much sense out of this patches.

533
00:46:58,610 --> 00:47:07,940
But as soon as we increase this data set to considerable amounts, this transformer model, now free

534
00:47:07,940 --> 00:47:14,150
of the inductive bias, can even do better than the confidence.

535
00:47:14,960 --> 00:47:23,150
And interestingly enough, you notice that after training a transformer model, this position embeddings

536
00:47:23,150 --> 00:47:29,420
we call the position embeddings, which are added onto the patch embeddings before passing to the transformer.

537
00:47:30,200 --> 00:47:37,220
Actually learn on their own to encode the position of the patches.

538
00:47:37,530 --> 00:47:43,200
You could see from this, uh, plot here where we have the input patch row and the input patch column.

539
00:47:43,230 --> 00:47:52,860
You see that this one one you see the position this is gotten by the model or this is learned automatically

540
00:47:52,860 --> 00:47:54,950
by the model during the training process.

541
00:47:54,960 --> 00:48:02,190
You see to one, it goes a step in the direction and maintains this direction or maintains a row.

542
00:48:02,190 --> 00:48:04,070
And you see three.

543
00:48:04,110 --> 00:48:06,690
One maintains a row that goes three steps.

544
00:48:06,690 --> 00:48:09,660
You see this, you see, you see that.

545
00:48:09,720 --> 00:48:11,700
And then finally, here you go.

546
00:48:11,700 --> 00:48:16,140
You go seven steps to the right and then seven steps downward.

547
00:48:17,040 --> 00:48:20,190
Then year to the left, you could look at this.

548
00:48:20,190 --> 00:48:24,990
You see this, um, this embedding filters right here.

549
00:48:24,990 --> 00:48:32,640
We have this embedding filters which we see here which look much alike to the the conflict filters.

550
00:48:32,820 --> 00:48:42,720
Then to the right, you have this plot which summarizes the reason why the whites end up being more

551
00:48:42,720 --> 00:48:44,490
powerful than the conflicts.

552
00:48:44,760 --> 00:48:52,050
To understand this, let's take this year we will consider a conflict with a given depth.

553
00:48:52,410 --> 00:48:57,840
Now, with the conflicts, the initial layers, let's let's let's have this conflict and we break this

554
00:48:57,840 --> 00:48:58,560
up.

555
00:48:58,560 --> 00:48:59,970
So we break this up.

556
00:48:59,970 --> 00:49:03,360
We have our initial layers and then we have our final layers.

557
00:49:03,360 --> 00:49:09,900
This initial layers parameters extract low level features, while the final layers permit us to extract

558
00:49:09,900 --> 00:49:11,040
high level features.

559
00:49:11,040 --> 00:49:19,950
And so if we have an image like this, like this one, and we have this head and we have this, then

560
00:49:21,090 --> 00:49:28,080
given that we're passing this filters or this conflict filters here, you'll see that this pixel, for

561
00:49:28,080 --> 00:49:36,210
example, attends to this other pixels which are found around its locality.

562
00:49:36,750 --> 00:49:44,040
And then as we go deeper in the network, we would have this pixel here which now tries to attend to

563
00:49:44,040 --> 00:49:51,930
this other pixel here, which is much more far away from it to better picture this.

564
00:49:51,930 --> 00:50:00,690
Remember the example we took for three or rather two three by three filters compared to a single five

565
00:50:00,720 --> 00:50:02,510
by five filter.

566
00:50:02,520 --> 00:50:04,440
Let's let's, let's draw this here.

567
00:50:04,440 --> 00:50:07,830
We compare this with a single five by five filter.

568
00:50:07,830 --> 00:50:16,800
And we saw that although those five by five filter had a larger receptive field compared to a single

569
00:50:16,800 --> 00:50:23,730
three by three filter, which we have here, making or stacking up this three by three filters that

570
00:50:23,730 --> 00:50:33,210
is making our network deeper permitted us to still be able to capture this part of the image.

571
00:50:34,530 --> 00:50:42,360
And so this shows us that with the conflicts in the earlier layers, when the when is not yet deep enough,

572
00:50:42,360 --> 00:50:48,420
we still capturing this local information.

573
00:50:48,420 --> 00:50:54,300
And then as we go deeper, we start capturing much more global information.

574
00:50:55,740 --> 00:51:02,580
And so if we're to have this kind of plot here where this we have mean attention distance, and here

575
00:51:02,640 --> 00:51:10,440
we have the network that would see that for a confident, we will keep increasing this up to a point

576
00:51:10,440 --> 00:51:18,560
where we may we will no longer be able to continue increasing because as this this network depth all

577
00:51:18,560 --> 00:51:21,120
of this, we increase the number of layers.

578
00:51:21,660 --> 00:51:25,470
We are able to capture much more global features.

579
00:51:25,470 --> 00:51:31,800
And so this mean attention distance keeps increasing.

580
00:51:33,660 --> 00:51:42,390
But with the attention or with the transformers, since each patch attends to each and every other patch,

581
00:51:43,020 --> 00:51:49,590
as we have seen already with the self attention, each and every patch will attend to the other right

582
00:51:49,590 --> 00:51:53,010
from the very first attention layer.

583
00:51:53,160 --> 00:51:58,650
We are not going to have this, but instead this plot we have here.

584
00:52:00,000 --> 00:52:07,260
And so this means that if we train our wit with a very large data set right from the very first layers,

585
00:52:07,260 --> 00:52:16,590
we are able to capture both the local and the global features.

586
00:52:16,830 --> 00:52:25,590
And this is what makes the whites more powerful compared to the continents when we work with big data.

587
00:52:26,640 --> 00:52:27,150
Yeah.

588
00:52:27,150 --> 00:52:32,040
We can also visualize what the model sees by looking at this attention maps.

589
00:52:32,040 --> 00:52:37,290
You'll notice that after training the model you see we have this attention here.

590
00:52:37,530 --> 00:52:43,260
See your pixels, which pay much attention to one another here.

591
00:52:43,350 --> 00:52:50,550
These pixels are paying attention or much more attention to one another as compared to the other pixels.

592
00:52:50,880 --> 00:52:59,400
In summary, to understand or to visualize what goes on when training to CNN and VITE model side by

593
00:52:59,400 --> 00:53:00,120
side.

594
00:53:00,150 --> 00:53:06,750
You will see here that with events as you increase this data site, this increased data site, and this

595
00:53:06,750 --> 00:53:08,540
is increased data size here.

596
00:53:08,550 --> 00:53:15,150
So as we increase the the data size, or rather when we start with small data sizes, we have this kind

597
00:53:15,150 --> 00:53:16,230
of accuracy.

598
00:53:16,230 --> 00:53:23,000
While for the CNNs we already have reasonable accuracies, even with small data size.

599
00:53:23,010 --> 00:53:29,490
And then as we keep increasing this data size, as we keep increasing this data size, you see this

600
00:53:29,490 --> 00:53:39,900
accuracy keeps increasing while for the CNN's stat to plateau at some point and this plateauing is simply

601
00:53:39,900 --> 00:53:49,050
comes due to the fact that this CNN here are limited by the inductive biases, whereas this transformers,

602
00:53:49,050 --> 00:53:57,660
which are more generic neural networks, are free to learn even better from this larger data sets.
