1
00:00:00,180 --> 00:00:05,640
Hello, everyone, and welcome to this new and exciting session in which we are going to be looking

2
00:00:05,640 --> 00:00:09,840
at how to fine tune an already trained transformer.

3
00:00:09,840 --> 00:00:16,560
More precisely, vision transform our model using the hugging face library and TensorFlow to hug and

4
00:00:16,560 --> 00:00:16,830
face.

5
00:00:16,830 --> 00:00:22,920
Today is at the forefront of practical AI, and it permits practitioners around the world to build,

6
00:00:22,920 --> 00:00:26,370
train and deploy state of the art models very easily.

7
00:00:26,640 --> 00:00:31,500
It's also used by thousands of organizations and teams around the world.

8
00:00:31,860 --> 00:00:39,210
So here we have a host of different tasks which we could solve with readily available hugging face models

9
00:00:39,210 --> 00:00:44,430
like audio classification, image, classification, object detection, question, answering summarization,

10
00:00:44,430 --> 00:00:47,430
text, classification, and translation.

11
00:00:47,460 --> 00:00:50,790
In our specific case, we are dealing with image classification.

12
00:00:51,360 --> 00:00:59,130
And if you click open right here, you see we have this image classification page where you could already

13
00:00:59,130 --> 00:01:02,670
test an image on this white model here.

14
00:01:02,670 --> 00:01:11,580
So this is the white base model with patch size 16 and the image size 224.

15
00:01:12,240 --> 00:01:21,690
We also have this facebook d i.t based distilled model and pipe size 16 image size 224.

16
00:01:21,720 --> 00:01:25,470
You could browse a host of other image classification models here.

17
00:01:25,470 --> 00:01:30,630
As you could see here we have the different models sorted in order of number of downloads.

18
00:01:30,660 --> 00:01:35,700
C 219,000 times this model was downloaded.

19
00:01:35,820 --> 00:01:40,400
Now we'll be working with this video or with fine tuning this bit by Google right here.

20
00:01:40,410 --> 00:01:44,160
Here we have the model description so you can have that.

21
00:01:44,160 --> 00:01:51,120
We've seen this already in the paper, the intended uses and limitations, and then how to use this

22
00:01:51,120 --> 00:01:53,130
model without any fine tuning.

23
00:01:53,130 --> 00:02:02,460
So your you could pass in your image and then already run classification on this image using this white

24
00:02:02,490 --> 00:02:05,860
model right here with for image classification model.

25
00:02:05,880 --> 00:02:12,240
Now here you will notice that this is we suppose that we are dealing with a PyTorch model.

26
00:02:12,240 --> 00:02:19,680
So we could check in the documentation here where we'll see this white model on the left side together

27
00:02:19,680 --> 00:02:23,880
with many other different models which are available for free.

28
00:02:23,880 --> 00:02:27,150
And you see here we have this white.

29
00:02:27,210 --> 00:02:32,040
You could also check out the the i.t, the distillation transformers.

30
00:02:32,250 --> 00:02:33,600
Let's get to d.

31
00:02:33,600 --> 00:02:36,930
We should have g i t around this.

32
00:02:36,930 --> 00:02:37,650
There we go.

33
00:02:37,650 --> 00:02:38,850
Here the eit.

34
00:02:38,880 --> 00:02:42,990
You see, you have the eit and this documentation right here.

35
00:02:42,990 --> 00:02:46,140
So we also have the screen which we have seen.

36
00:02:46,140 --> 00:02:49,890
We looked at the screen previously so we could check out the screen somewhere.

37
00:02:49,890 --> 00:02:54,300
Your screen as well.

38
00:02:54,300 --> 00:02:55,410
We are.

39
00:02:56,510 --> 00:02:59,630
SWC, You have the swim channel from around here.

40
00:02:59,630 --> 00:03:00,710
So that's it.

41
00:03:00,740 --> 00:03:02,930
See the swim transformer we had seen already?

42
00:03:02,960 --> 00:03:08,290
Now let's get back to our v v vision transformer OC.

43
00:03:08,300 --> 00:03:10,940
So we have this vision transformer right here.

44
00:03:11,060 --> 00:03:16,790
And you could your go through the, the whole documentation.

45
00:03:16,790 --> 00:03:22,490
So we have the fit config with feature extractor looks like this so it becomes clearer.

46
00:03:23,060 --> 00:03:23,990
There we go.

47
00:03:23,990 --> 00:03:30,290
We have this feature extractor, the model with four masked image modeling with four image classification.

48
00:03:30,290 --> 00:03:32,690
And you'll notice here we have two with model.

49
00:03:32,690 --> 00:03:39,710
So the difference with this is this is PyTorch or model and then this is a TensorFlow model which we'll

50
00:03:39,710 --> 00:03:44,810
be using now the t v model and tier V for image classification.

51
00:03:44,840 --> 00:03:48,710
Now recall that you have your model which starts from the patch.

52
00:03:48,710 --> 00:03:49,810
You have the patch.

53
00:03:49,820 --> 00:03:52,340
Once you have the patch, you have the transformer encoder.

54
00:03:52,340 --> 00:04:01,670
And then from here you take this, you have this NLP head and then you have your output right here.

55
00:04:01,670 --> 00:04:07,520
Now with this TFV for image classification, we have all this full year.

56
00:04:07,520 --> 00:04:16,910
So we go from this patch right up to the MLPs head, but with the tf V model, this part, this part

57
00:04:16,940 --> 00:04:18,080
is not included.

58
00:04:18,080 --> 00:04:21,170
So what you get will be only this outputs right here.

59
00:04:21,170 --> 00:04:28,160
So with the tf V model, we just we have this whereas with the TF of it for image classification, we

60
00:04:28,160 --> 00:04:30,500
have all this now apart from TensorFlow.

61
00:04:30,560 --> 00:04:32,000
We also have the code for flags.

62
00:04:32,000 --> 00:04:35,650
So you could check our flags with model flags with for image classification.

63
00:04:35,660 --> 00:04:40,640
Now that said, we are now going to focus on fine tuning this vision.

64
00:04:40,640 --> 00:04:41,920
Transform our model.

65
00:04:41,930 --> 00:04:48,000
Don't forget to subscribe and hit that notification button so you never miss amazing content like this.

66
00:04:48,020 --> 00:04:55,190
Before doing anything, we'll start by first installing this Transformers library right here.

67
00:04:55,190 --> 00:04:59,120
So we have people install Transformers now.

68
00:04:59,120 --> 00:04:59,870
That's fine.

69
00:04:59,870 --> 00:05:01,580
We move on.

70
00:05:01,580 --> 00:05:07,790
Let's go ahead to start with the fine tuning, but then let's get back to documentation and here we

71
00:05:07,790 --> 00:05:09,320
have the overview.

72
00:05:09,350 --> 00:05:14,930
We could check out those v config right here and you'll notice that all the different configurations

73
00:05:14,930 --> 00:05:16,730
are basically what we've seen already.

74
00:05:16,730 --> 00:05:21,260
So we have this hidden size 768 as default value number of hidden layers.

75
00:05:21,260 --> 00:05:30,500
12 So you're just stacking 12 different transformer encoder blocks, number of attention heads 12 Intermediate

76
00:05:30,500 --> 00:05:32,510
size 3072.

77
00:05:32,540 --> 00:05:39,490
Now for this, 3072 actually is for the dense layers we have here.

78
00:05:39,500 --> 00:05:50,060
Now the input you're getting from this normal or right from here, we have say one by 256 by 768.

79
00:05:50,060 --> 00:05:54,110
Then this gets in and get to this point where we still have the same.

80
00:05:54,110 --> 00:06:00,230
So up to here is the same and we expect that this output here should be of the same shape.

81
00:06:00,500 --> 00:06:04,540
But then in this MLPs layer, we have two dense layers.

82
00:06:04,550 --> 00:06:19,580
Now the first dense layer will convert this to one by 256 by 3072, and then the next dense layer would

83
00:06:19,580 --> 00:06:26,090
convert this to one by 256 by 768.

84
00:06:26,480 --> 00:06:31,580
So that's why they call this the intermediate size right here.

85
00:06:31,580 --> 00:06:37,550
So this intermediate size, you can see dimensionality of the intermediate feedforward layer.

86
00:06:37,730 --> 00:06:41,630
In the transformer encoder, we have the hidden activation glue.

87
00:06:41,660 --> 00:06:42,980
He didn't drop our probability.

88
00:06:42,980 --> 00:06:44,420
So you have some drop out here.

89
00:06:44,420 --> 00:06:52,580
Attention, probably drop out attention, perhaps drop our probability, initialize our range layer

90
00:06:52,580 --> 00:06:59,870
norm epsilon value to better understand this, we could get back to the layer norm documentation in

91
00:06:59,870 --> 00:07:00,530
TensorFlow.

92
00:07:00,530 --> 00:07:07,220
You see, epsilon here by default is one times ten to the negative three, and then we could see where

93
00:07:07,220 --> 00:07:08,390
exactly it's been used.

94
00:07:08,390 --> 00:07:15,440
So recall that with normalization we have x minus a given mean divided by a standard deviation.

95
00:07:16,460 --> 00:07:23,420
Now, we we do not want a situation where this here is zero and then we have an infinite output.

96
00:07:23,420 --> 00:07:26,370
So we generally add some epsilon right here.

97
00:07:26,390 --> 00:07:33,440
Now, this epsilon by default, as you see here, is 10.001.

98
00:07:33,440 --> 00:07:38,750
And in hugging face here it is one times 10 to -12.

99
00:07:39,530 --> 00:07:40,760
Now it's only encoder.

100
00:07:40,760 --> 00:07:41,960
So there's no encoder decoder.

101
00:07:41,960 --> 00:07:47,870
That's why this is set to false image size to 24, the pipe size 16, number of channels three.

102
00:07:48,530 --> 00:07:55,520
What are we going to add a bias into the query keys and values your true and code two Stride 16.

103
00:07:56,880 --> 00:08:02,730
Now you can remember the use of the stride when we're trying to get the patches we want.

104
00:08:02,760 --> 00:08:10,770
Once we have an image like this and that we have a patch size of 16 by 16, we'll move through 16 pixels

105
00:08:10,770 --> 00:08:14,280
to obtain the next patch so that we have no space here.

106
00:08:14,280 --> 00:08:17,190
So we have something like this actually.

107
00:08:18,240 --> 00:08:19,080
So that's it.

108
00:08:19,500 --> 00:08:21,390
We understand this with config.

109
00:08:21,390 --> 00:08:25,740
You could check out your this usage of the v config.

110
00:08:25,740 --> 00:08:28,890
So let's copy this code and get back here.

111
00:08:29,250 --> 00:08:34,980
Um, there we go, get back here and then we have this code.

112
00:08:34,980 --> 00:08:36,370
Pass it out your OC.

113
00:08:36,390 --> 00:08:37,110
So that's it.

114
00:08:37,110 --> 00:08:45,870
You see clearly you could easily create a V model without necessarily going through all this process

115
00:08:45,870 --> 00:08:47,670
which we had done right here.

116
00:08:47,670 --> 00:08:53,220
So this was just for educational purposes and means that if you want to build your own with your simple,

117
00:08:53,220 --> 00:08:56,010
you just have to do this.

118
00:08:56,310 --> 00:09:02,850
Specify the configuration that's v config and then you initialize the model and that's it.

119
00:09:02,850 --> 00:09:11,220
So let's, let's run this and then let's also print out this configuration.

120
00:09:11,640 --> 00:09:12,930
Let's look at that.

121
00:09:12,930 --> 00:09:14,490
You see, there we go.

122
00:09:14,490 --> 00:09:15,990
We could change this.

123
00:09:15,990 --> 00:09:21,330
We could change, let's say we change this hidden size to some value.

124
00:09:21,330 --> 00:09:24,900
144 So yeah, let's have our hidden size.

125
00:09:24,900 --> 00:09:28,140
So you see, this is how you could change the hidden size to suit your needs.

126
00:09:28,140 --> 00:09:29,010
So that's it.

127
00:09:29,010 --> 00:09:31,650
You change this, that's fine.

128
00:09:31,650 --> 00:09:35,880
You look at that and you see you now have this new configuration and that's it.

129
00:09:37,260 --> 00:09:41,100
Now, for the next we'll look at this feature extractor.

130
00:09:41,100 --> 00:09:44,790
The feature extractor is similar to what we have done already.

131
00:09:45,210 --> 00:09:50,530
That is taking in the input, resizing it and then carrying out some normalization.

132
00:09:50,550 --> 00:09:53,040
So that's it for the feature extractor.

133
00:09:53,040 --> 00:09:54,660
You could check in the documentation.

134
00:09:54,660 --> 00:09:56,910
This model here is a PyTorch model.

135
00:09:56,910 --> 00:09:58,170
So let's go to TFS.

136
00:09:58,530 --> 00:10:01,590
So this is for TensorFlow now.

137
00:10:01,590 --> 00:10:08,790
So here you see, there we go, we have our t v model.

138
00:10:08,790 --> 00:10:16,050
You could expand this parameters and your we have all those different arguments which we could check

139
00:10:16,050 --> 00:10:16,640
out.

140
00:10:16,650 --> 00:10:25,110
Now here you have this example of how we could use the TFV model directly without going through any

141
00:10:25,110 --> 00:10:26,100
stressful process.

142
00:10:26,100 --> 00:10:34,530
So here you see we have this first of all, we have the beta dataset and then we have this image which

143
00:10:34,530 --> 00:10:35,370
is extracted this.

144
00:10:35,400 --> 00:10:37,680
You could you could get this image from our own data set.

145
00:10:37,680 --> 00:10:39,120
So that's it.

146
00:10:39,780 --> 00:10:45,210
They're loading this from hogging face hub or better still hogging face data sets.

147
00:10:45,210 --> 00:10:46,170
So that's it.

148
00:10:46,170 --> 00:10:48,660
And then you have this feature extractor.

149
00:10:48,660 --> 00:10:50,550
So that's a feature extractor.

150
00:10:50,550 --> 00:10:56,550
We have this model and now not that you're the tier model is from pre trained.

151
00:10:56,550 --> 00:11:03,090
So this means that we are going to use this model which is already been trained and you see this specifications

152
00:11:03,090 --> 00:11:09,840
here with base patch 16 to 24 in 21 K Okay.

153
00:11:09,840 --> 00:11:11,520
So that's it.

154
00:11:12,060 --> 00:11:16,740
The inputs now pass through the feature extractor before then pass to our model.

155
00:11:16,740 --> 00:11:23,580
So we'll see how to adapt this code so that we could fine tune our own model in TensorFlow.

156
00:11:24,150 --> 00:11:29,940
And as we said before, this model different from the image classification model in that the outputs

157
00:11:29,940 --> 00:11:37,550
here and not the final output classes, but this hidden states from the ten from the transformer encoder.

158
00:11:37,560 --> 00:11:40,590
So let's call your maybe that's really an example.

159
00:11:40,590 --> 00:11:40,770
Okay.

160
00:11:40,770 --> 00:11:48,060
This an example here you see this output here that gives you directly Kat So model produce one of 1000

161
00:11:48,060 --> 00:11:49,260
images in classes.

162
00:11:49,260 --> 00:11:49,980
So that's it.

163
00:11:49,980 --> 00:11:56,820
Whereas here we have this hidden, we have this hidden states.

164
00:11:56,820 --> 00:12:03,810
Okay, we paste this out here, let's take this off, take this one off, and then we could get started

165
00:12:03,810 --> 00:12:11,220
with building our own with model based off this hugging face to fit model.

166
00:12:11,220 --> 00:12:13,440
So we wouldn't need this data sets here.

167
00:12:13,440 --> 00:12:15,360
We already have our own data set.

168
00:12:16,050 --> 00:12:16,980
That's fine.

169
00:12:16,980 --> 00:12:19,560
We wouldn't make use of this feature extractor.

170
00:12:19,560 --> 00:12:22,620
We have this model here.

171
00:12:22,620 --> 00:12:29,580
We have our hugging face, let's call this hugging face model, and then let's just take all this off

172
00:12:29,580 --> 00:12:30,120
actually.

173
00:12:30,120 --> 00:12:35,580
So we have this we have this interface model, we have our tier model from pre trained and that's it.

174
00:12:36,660 --> 00:12:38,730
Now we're going to define some inputs.

175
00:12:38,730 --> 00:12:46,920
So we have your our input is equal the input layer and then we'll specify the shape.

176
00:12:46,920 --> 00:12:55,440
So here we work with 224 by two, 24 by three and then using the TensorFlow functional API.

177
00:12:55,620 --> 00:13:00,240
I will get an output of X, which we take in this hugging model.

178
00:13:00,240 --> 00:13:01,710
Let's call this base model.

179
00:13:01,710 --> 00:13:05,420
So taking the base model takes in inputs.

180
00:13:05,430 --> 00:13:08,340
Let's call this let's change this to base model.

181
00:13:08,370 --> 00:13:08,700
Okay?

182
00:13:08,790 --> 00:13:14,570
We have the base model text in the inputs and then this this year we have base models.

183
00:13:14,970 --> 00:13:20,250
So we have this method, this method which we call before taking in the input.

184
00:13:20,250 --> 00:13:26,160
And so from here we have this base model of vertex and the input, and then now we have the output here

185
00:13:26,160 --> 00:13:26,710
X.

186
00:13:26,730 --> 00:13:32,910
Now when we run this, you see we download this 330 megabytes of pre-trained model.

187
00:13:42,640 --> 00:13:45,820
Let's have this owl hugging face model here.

188
00:13:45,850 --> 00:13:50,920
So we we get the inputs from your and then we have the outputs.

189
00:13:50,920 --> 00:13:53,020
Let's call this output X.

190
00:13:53,740 --> 00:13:54,790
So we have that.

191
00:13:54,790 --> 00:13:56,110
Let's run this again.

192
00:13:56,110 --> 00:14:02,550
We have a hugging face model now set and we still get this error.

193
00:14:02,560 --> 00:14:06,850
You see, it's linked to the positioning of this inputs here.

194
00:14:06,850 --> 00:14:08,710
So let's let's change this.

195
00:14:08,830 --> 00:14:12,880
Let's say we have three by 220, four by 224.

196
00:14:12,880 --> 00:14:14,080
We run this again.

197
00:14:14,800 --> 00:14:18,140
You should see now that everything works fine.

198
00:14:18,160 --> 00:14:20,320
See, it now works fine.

199
00:14:20,320 --> 00:14:24,460
So this means that the inputs of this are base model year.

200
00:14:24,520 --> 00:14:27,910
So hogging face model should be of this shape.

201
00:14:27,910 --> 00:14:28,900
So it's three.

202
00:14:29,950 --> 00:14:38,590
Let's take this here instead of being 224 by two, 24 by 224, as we usually do.

203
00:14:38,590 --> 00:14:44,590
Now it's inset three by two, 24 by two, 24.

204
00:14:44,620 --> 00:14:51,550
Now this means that we need an extra layer which will convert this into this before passing into our

205
00:14:51,550 --> 00:14:53,160
base model right here.

206
00:14:53,170 --> 00:14:58,930
So let's build that extra layer and we'll take inspiration from our resize rescale layer, which we

207
00:14:58,930 --> 00:14:59,950
had built already.

208
00:15:00,250 --> 00:15:05,590
Let's get to resize rescale right here.

209
00:15:06,430 --> 00:15:07,840
So we have that.

210
00:15:08,380 --> 00:15:14,530
We'll modify the recital scale specifically for this organ face model.

211
00:15:14,530 --> 00:15:21,640
So here we have at this code resize scale for hugging face.

212
00:15:21,670 --> 00:15:28,630
Okay, So we have this resize scale, we'll resize make sure it is 224 by 224.

213
00:15:28,630 --> 00:15:37,210
So every image which passes here will be 224 by to 24 we're going to rescale.

214
00:15:37,210 --> 00:15:42,190
And then after we scaling, we are going to permit take the value.

215
00:15:42,190 --> 00:15:49,060
So we call on this permit here permit layer and the way we'll build this all the way we'll call this

216
00:15:49,060 --> 00:15:54,700
permit layer will be such that we move this from this third position.

217
00:15:54,700 --> 00:15:59,200
This is zero one, two, three, but by 224, by two, 24 by three.

218
00:15:59,200 --> 00:16:02,200
So we move this from this third position to this first position.

219
00:16:02,200 --> 00:16:10,480
So here we have three go into this position and then this one two shifts to the right.

220
00:16:10,480 --> 00:16:12,400
So we have three.

221
00:16:12,400 --> 00:16:18,610
This one goes here, three, one, and then this two comes here so that the output now will be batch

222
00:16:18,970 --> 00:16:29,020
C, the batch remains intact by three, which has been shifted by 224 by 224.

223
00:16:29,560 --> 00:16:31,630
So be careful not to do instead.

224
00:16:31,630 --> 00:16:32,200
Two one.

225
00:16:32,200 --> 00:16:38,620
This is, this is one, two and not two one, because here we have in height by width by channel.

226
00:16:38,620 --> 00:16:43,450
So we want to change this to channel by height, by width.

227
00:16:43,630 --> 00:16:44,120
Okay.

228
00:16:44,140 --> 00:16:47,410
So that said, we do this here.

229
00:16:47,410 --> 00:16:51,040
So we just do three one, two and that's it.

230
00:16:51,040 --> 00:17:01,510
So after this input layer, before getting your call, this X, we'll take in our resize rescale and

231
00:17:01,510 --> 00:17:02,080
that's it.

232
00:17:02,080 --> 00:17:06,490
So this resize skill takes in the inputs and then here we'll pass in will be x.

233
00:17:06,490 --> 00:17:07,900
Let's run this again.

234
00:17:08,440 --> 00:17:14,410
We should get an error because when we permit it, it goes back to 224 by two, 24 by three.

235
00:17:14,410 --> 00:17:21,880
So let's have this see we have that 224 by two, 24 by three as we used to working and then we run this

236
00:17:21,880 --> 00:17:24,280
now and everything should be okay.

237
00:17:25,570 --> 00:17:29,020
All we get in an error resize is not defined.

238
00:17:29,440 --> 00:17:30,460
Let's run this.

239
00:17:30,460 --> 00:17:33,160
Oh, let's make sure that's how we called it.

240
00:17:34,390 --> 00:17:35,920
Let's go up resize.

241
00:17:36,280 --> 00:17:37,210
We need to run this.

242
00:17:37,210 --> 00:17:39,340
Actually, this should be fine.

243
00:17:39,340 --> 00:17:41,230
Now.

244
00:17:41,230 --> 00:17:41,980
That's it.

245
00:17:42,220 --> 00:17:43,780
You see, everything is okay.

246
00:17:43,780 --> 00:17:47,470
So now what we'll do is we'll pass in some input.

247
00:17:47,990 --> 00:17:49,690
Let's pass in some input.

248
00:17:50,110 --> 00:17:57,520
We have this test image right here, and then we have our model which takes in the test image.

249
00:17:57,520 --> 00:18:02,410
Now we need to also convert this or rather add the batch dimension.

250
00:18:02,420 --> 00:18:06,400
So let's expand the terms and take in the test image right here.

251
00:18:06,970 --> 00:18:08,230
So we have that.

252
00:18:08,230 --> 00:18:10,420
We run this and see what we get.

253
00:18:10,900 --> 00:18:16,540
Let's add this to the zero axis axis zero, run that again.

254
00:18:17,380 --> 00:18:20,790
And then we told that we expected this, but instead found this.

255
00:18:20,800 --> 00:18:29,080
Now let's get back up here and we could change this to 256 by 256.

256
00:18:29,080 --> 00:18:35,500
And then knowing that this resize will convert it back to 224 since our model takes 224.

257
00:18:35,500 --> 00:18:37,570
So let's run this again.

258
00:18:39,730 --> 00:18:40,420
And there we go.

259
00:18:40,450 --> 00:18:41,360
Here's our output.

260
00:18:41,380 --> 00:18:46,540
You see, we have the last hidden state, one by one, 97 by 768.

261
00:18:46,870 --> 00:18:52,690
And as we scroll, we have this pull out output one by 768.

262
00:18:53,050 --> 00:18:57,810
And then from here, we scroll down again.

263
00:18:57,820 --> 00:18:59,650
Let's see if we have another output.

264
00:19:00,070 --> 00:19:01,270
And that's it.

265
00:19:01,310 --> 00:19:03,970
Okay, So this is what we get is output.

266
00:19:03,970 --> 00:19:12,460
We have the the last hidden state and this puller output from the documentation where we had the parameters.

267
00:19:12,460 --> 00:19:14,890
TF We model the parameters.

268
00:19:14,890 --> 00:19:19,150
You see here we have this last hidden state, the puller output.

269
00:19:19,150 --> 00:19:22,510
And then we told that these are the two outputs we will always get.

270
00:19:22,510 --> 00:19:24,130
And then we know that this one is optional.

271
00:19:24,130 --> 00:19:28,780
The hidden States is optional, but the last hidden state is an optional.

272
00:19:28,780 --> 00:19:29,140
We get.

273
00:19:29,140 --> 00:19:30,400
We always get the last hidden state.

274
00:19:30,400 --> 00:19:34,240
We always get this puller output and then this attention's.

275
00:19:34,240 --> 00:19:35,830
You're also optional.

276
00:19:35,830 --> 00:19:37,450
So if you want to get this attention.

277
00:19:37,450 --> 00:19:41,170
So all you need to do here is to specify config.

278
00:19:41,200 --> 00:19:45,970
Remember the configuration with config config that output attention set out to true.

279
00:19:47,140 --> 00:19:50,110
And so this means that by default this will be set to false.

280
00:19:50,110 --> 00:19:55,960
And then for the hidden states we also repeat the same so config that output here in states we set it

281
00:19:55,960 --> 00:19:57,040
to to true.

282
00:19:57,760 --> 00:20:02,500
Now you're in the documentation to explain the difference between this puller output and this last hidden

283
00:20:02,500 --> 00:20:03,130
state.

284
00:20:03,340 --> 00:20:06,640
But getting back here, you should see the shape.

285
00:20:06,640 --> 00:20:11,620
So you see this is just one of this year.

286
00:20:11,620 --> 00:20:20,600
While this is all our full hidden states are full last hidden state one by one and seven by 768.

287
00:20:20,620 --> 00:20:29,560
But this model or this hugging face with model was built, taking into consideration this class embedding

288
00:20:29,560 --> 00:20:30,360
right here.

289
00:20:30,370 --> 00:20:37,810
And so this means that if you want to carry out some classification is better off or you're better off

290
00:20:37,810 --> 00:20:46,120
taking this final year, this final or this class embeddings final hidden state.

291
00:20:46,990 --> 00:20:50,230
Now, if you want just those hidden states, we could specify this.

292
00:20:50,320 --> 00:20:52,450
Pick out the the zero index.

293
00:20:52,450 --> 00:20:58,090
We run this and we should get only the output or the last hidden states.

294
00:20:58,090 --> 00:20:58,690
So that's it.

295
00:20:58,690 --> 00:21:00,430
We get this last hidden states.

296
00:21:01,300 --> 00:21:08,050
And then since we are interested only in that output corresponding to the class embedding and the class

297
00:21:08,050 --> 00:21:14,350
embedded at the zero position here, the zero position we are going to take, we are going to we are

298
00:21:14,350 --> 00:21:15,400
going to do this here.

299
00:21:15,400 --> 00:21:18,280
We're going to take this.

300
00:21:18,280 --> 00:21:22,420
We're going to take for the first dimension here, we take all.

301
00:21:22,900 --> 00:21:31,120
And then for this next one we take we select the zeroth index, and then the next we take all.

302
00:21:31,120 --> 00:21:34,150
So let's run this again and see what we get.

303
00:21:34,810 --> 00:21:35,650
And that's it.

304
00:21:35,650 --> 00:21:37,600
We have this output right here.

305
00:21:38,770 --> 00:21:45,370
And now, since we've converted this hugging face model into a TensorFlow model, we could do or summary.

306
00:21:45,730 --> 00:21:49,650
So let's run this and check out our model summary.

307
00:21:49,660 --> 00:21:50,350
So that's it.

308
00:21:50,350 --> 00:21:51,670
We have this model summary.

309
00:21:51,670 --> 00:21:55,660
We scroll down and see 86 million parameters.

310
00:21:55,660 --> 00:21:59,170
We have the input sequential.

311
00:21:59,170 --> 00:22:03,310
The sequential is corresponds to the skill layer, right?

312
00:22:03,310 --> 00:22:12,880
And then the slicing operator, which we have here, which permits us get our specific output.

313
00:22:14,110 --> 00:22:18,370
Now getting back here, we will now add our final classifier.

314
00:22:18,520 --> 00:22:23,590
So we have this and then we have we'll call this output.

315
00:22:24,940 --> 00:22:25,950
That's fine.

316
00:22:25,960 --> 00:22:27,040
Let's just call this output.

317
00:22:27,040 --> 00:22:34,600
So we have this output takes in the dense layer, has the number of classes specified here, and then

318
00:22:34,600 --> 00:22:36,990
we have our activation self max as usual.

319
00:22:37,000 --> 00:22:38,170
So that's it.

320
00:22:38,170 --> 00:22:41,560
We have this your output.

321
00:22:41,560 --> 00:22:45,550
Okay, Now let's run this and see what we get.

322
00:22:46,000 --> 00:22:51,310
We get in an arrow because we didn't pass in this x here, so let's run that again.

323
00:22:51,550 --> 00:22:58,540
Now, as the model is training, just remember that the learning rate we use here is an appropriate

324
00:22:58,540 --> 00:23:04,360
as we can't be using this type of higher learning rates when doing fine tuning.

325
00:23:04,360 --> 00:23:11,740
So we have to change this and use some lower, let's say five times ten to the negative, say five.

326
00:23:11,740 --> 00:23:15,850
Okay, So let's stop the training and then restart this process.

327
00:23:15,850 --> 00:23:22,300
You see now that when we re initialize our model and then we modify this learning rate, you see the

328
00:23:22,300 --> 00:23:28,660
loss drops now much lower than what we had before with a higher learning rate and the accuracy is already

329
00:23:28,660 --> 00:23:34,300
at 75% and we are still at the first epoch.

330
00:23:35,020 --> 00:23:39,430
So be careful when you fine tuning or when you're updating all the.

331
00:23:39,670 --> 00:23:42,250
Parameters of an already trained model.

332
00:23:43,210 --> 00:23:46,540
You have to make sure you use a very low learning rate.