1
00:00:00,360 --> 00:00:06,300
And welcome back in this section, we'll take a look at implementing a very, very inefficient transformer

2
00:00:06,300 --> 00:00:08,100
here using PyTorch.

3
00:00:08,250 --> 00:00:09,540
So let's get started.

4
00:00:09,570 --> 00:00:12,930
So open notebook 60 and we'll begin to listen.

5
00:00:13,590 --> 00:00:18,600
So firstly, a credit goes to Hirota Honda for creating this notebook here.

6
00:00:19,580 --> 00:00:23,460
It comes from the people in images with 16 by 16 words.

7
00:00:24,000 --> 00:00:28,980
And also, here's another implementation of it here, as well as this.

8
00:00:29,130 --> 00:00:34,950
We use Tim, which is a Python library that allows us to load PyTorch models pre-trained apply to its

9
00:00:34,950 --> 00:00:35,400
models.

10
00:00:35,760 --> 00:00:41,250
It's quite useful, so I would encourage you if you want to get more hands on experience with using

11
00:00:41,850 --> 00:00:46,410
the state of the art models and easily luring them into your own projects.

12
00:00:46,860 --> 00:00:48,420
It can take a look at this tutorial here.

13
00:00:48,420 --> 00:00:49,020
It's quite good.

14
00:00:49,710 --> 00:00:50,610
So let's begin.

15
00:00:50,610 --> 00:00:55,620
So firstly, you've got to install Tim, which either takes about three seconds and then this follow

16
00:00:55,620 --> 00:00:57,420
the library takes about six seconds.

17
00:00:57,960 --> 00:00:59,910
Next, we need to load the model.

18
00:00:59,910 --> 00:01:03,870
So the model we're going to look at is a VAT based pad 16 to 24.

19
00:01:04,290 --> 00:01:10,320
This was trained on image net and its resolution image size story is 12 to 24.

20
00:01:10,890 --> 00:01:12,200
So let's begin.

21
00:01:12,300 --> 00:01:19,680
So let's load this model and we use create model from time to create the model right here, and you

22
00:01:19,680 --> 00:01:22,380
can see device basic CUDA means that we're using the GPU.

23
00:01:23,250 --> 00:01:25,790
We can always make sure you're using the GPU here.

24
00:01:25,790 --> 00:01:29,010
It should be set the GPU by default on this notebook.

25
00:01:29,640 --> 00:01:32,940
This takes a little while because it has to download a model in the background.

26
00:01:32,940 --> 00:01:34,800
So let's just be patient.

27
00:01:36,120 --> 00:01:36,800
OK, there we go.

28
00:01:36,810 --> 00:01:39,590
It's actually not that long, 21 seconds.

29
00:01:39,590 --> 00:01:42,330
So now we just defined some parameters that we're using.

30
00:01:42,820 --> 00:01:44,730
So transforms as well.

31
00:01:45,270 --> 00:01:51,270
Next, we're going to load the image now labels, as well as a test image that we'll be using for this

32
00:01:51,270 --> 00:01:52,140
demonstration.

33
00:01:53,430 --> 00:01:58,590
So what we're going to do, we're going to run an inference on this image from Santorini, the Greek

34
00:01:58,590 --> 00:01:59,010
island.

35
00:01:59,640 --> 00:02:00,510
So let's take a look.

36
00:02:00,690 --> 00:02:02,550
And it says Doom, OK?

37
00:02:02,760 --> 00:02:04,590
So you can see we just this is quite simple.

38
00:02:05,010 --> 00:02:06,870
You just got the image here.

39
00:02:06,870 --> 00:02:11,130
We can witness the image to a pill, which will be loaded the image using pill.

40
00:02:11,700 --> 00:02:18,300
And then we just created it and put it into a tensor here using the transforms that we use and squeeze

41
00:02:18,300 --> 00:02:19,530
it and send it to the GPU.

42
00:02:20,790 --> 00:02:21,920
And that's pretty much it.

43
00:02:21,930 --> 00:02:28,720
We just get the output out of it there and then we get the ARG max of that output using top torch dot

44
00:02:28,740 --> 00:02:35,490
org max, which is analogous to the MP nampai of Max function converted back to an end so we can use

45
00:02:35,490 --> 00:02:41,670
it as an index and image net labels list that we created here was our dictionary that we created there,

46
00:02:42,330 --> 00:02:44,040
and then we just showed results here.

47
00:02:44,060 --> 00:02:48,240
So inference result is a domain and you can use this on any image you want.

48
00:02:49,020 --> 00:02:52,350
So now let's dive deeper into depression transform.

49
00:02:52,350 --> 00:02:57,960
So you would have seen this in the previous section where we take an image like this splits it up into

50
00:02:57,960 --> 00:02:58,440
patches.

51
00:02:58,440 --> 00:03:08,160
We're using 14 by 14 patches here and 14 by 14 is 196, which is where we have in one and 196 here.

52
00:03:08,820 --> 00:03:09,820
Then we we.

53
00:03:09,870 --> 00:03:17,430
So this is the patch embedding each patches encoded into one by six and 768 vector here.

54
00:03:17,790 --> 00:03:21,460
And we as convolutional layers to create this embedding.

55
00:03:21,480 --> 00:03:23,580
So there are some conflicts involved here.

56
00:03:24,120 --> 00:03:29,700
And then we next, we combine it with the position embeddings here and then we have a class toolkit

57
00:03:29,700 --> 00:03:30,050
as well.

58
00:03:30,060 --> 00:03:31,830
So it's one more added to this.

59
00:03:31,830 --> 00:03:40,020
It's 197 in total and it's an equal 197 that this is adapted each that is 768.

60
00:03:40,020 --> 00:03:47,700
So this is 197 by 768 matrix that we passed to the transformer encoder and it outputs the encoding here,

61
00:03:47,700 --> 00:03:49,290
which is the same dimensions anyway.

62
00:03:49,890 --> 00:03:53,820
And then we pass to the MLP header, which does the final prediction.

63
00:03:54,420 --> 00:03:57,030
That's pretty much transformer, in a nutshell.

64
00:03:58,110 --> 00:04:04,140
So the second look at how we actually split these into patches, you can just remember what this is,

65
00:04:04,140 --> 00:04:08,400
how it's done here, and this is where we get the embeddings from now.

66
00:04:08,550 --> 00:04:09,950
You can just print this out there.

67
00:04:09,960 --> 00:04:10,980
You can see the image.

68
00:04:10,980 --> 00:04:13,490
Tensor size was 224, which 224.

69
00:04:13,490 --> 00:04:13,540
24.

70
00:04:13,570 --> 00:04:22,230
The tree and patch embeddings here for the image is 196 by 768, as we saw in the diagram above.

71
00:04:22,920 --> 00:04:28,290
Now we can visualize what the patches actually look like for that image, even though we sort of above.

72
00:04:28,890 --> 00:04:34,320
Let's just generate this nice patch image here using mod plot lib and you can see this.

73
00:04:34,530 --> 00:04:41,130
It's each of these are split up into these little cells, and that's what we use for the inputs into

74
00:04:41,130 --> 00:04:43,080
the bits and transformer encoder.

75
00:04:44,400 --> 00:04:50,370
Next, we need to add the positional embeddings, which are trainable or learnable, I should say.

76
00:04:50,430 --> 00:04:57,330
So we to do that, let's just take a look at the position of that vector here and now we can see.

77
00:04:57,330 --> 00:04:59,460
Let's visualise some of the.

78
00:05:00,770 --> 00:05:05,300
The embedding similarities, so you can take a look and see what that looks like.

79
00:05:09,720 --> 00:05:13,350
I'm not entirely sure how to interpret this chart here.

80
00:05:14,700 --> 00:05:20,780
They all look very similar to me, but it's probably a month early show.

81
00:05:20,790 --> 00:05:23,860
Maybe I should look into it before explaining it wrongly to you guys.

82
00:05:23,860 --> 00:05:26,190
So make a note of that and come back to this.

83
00:05:26,880 --> 00:05:30,330
So next, we need to make a transformer input here.

84
00:05:30,540 --> 00:05:33,720
So this is what creates that class token here.

85
00:05:34,140 --> 00:05:37,080
So that's what we add to the zero vector.

86
00:05:37,080 --> 00:05:40,560
So we get two one nine seven six one plus 14 by 14.

87
00:05:41,700 --> 00:05:42,780
So let's do that.

88
00:05:43,470 --> 00:05:44,840
And then we just added it there.

89
00:05:44,850 --> 00:05:45,930
So that's quite simple.

90
00:05:46,680 --> 00:05:51,870
So no, this is a closer look inside of the transformer, deeper look inside of the transformer and

91
00:05:51,870 --> 00:05:52,620
code itself.

92
00:05:53,220 --> 00:05:54,690
So you can see what's happening here.

93
00:05:54,700 --> 00:06:00,480
You can see the vector is input here and then we have different processes going on.

94
00:06:00,900 --> 00:06:07,650
You can see it's split up now into these into final detailed factors here matrices.

95
00:06:08,370 --> 00:06:11,260
And then we just go to the parallel attention heads.

96
00:06:11,280 --> 00:06:14,970
This is an important block here, and I'll put it here.

97
00:06:14,970 --> 00:06:20,250
And then we have the fully connected layers here which have formed the MLP head, which predicts the

98
00:06:20,250 --> 00:06:21,540
final class here.

99
00:06:23,640 --> 00:06:30,900
So let's take a look at the series of Transformers, so we were using 12 different, sorry, 12 different

100
00:06:30,900 --> 00:06:33,660
Transformers and series here, so we can print this out here.

101
00:06:33,690 --> 00:06:37,320
You can see Transformer one, sorry, 02 11, that's 12.

102
00:06:38,070 --> 00:06:44,160
Now we're going to see what the attention part that is transformer creates what it looks like and visualize

103
00:06:44,160 --> 00:06:44,300
it.

104
00:06:44,310 --> 00:06:45,510
So let's take a look.

105
00:06:45,510 --> 00:06:49,530
Firstly, at the multi attention had block, which is this here?

106
00:06:49,770 --> 00:06:55,350
And you can see the input to that encoder is as what we expected one like a seven by one by 768.

107
00:06:55,980 --> 00:06:59,910
Next, we can take a look at the fully connected layer that expands terror dimension.

108
00:07:00,480 --> 00:07:02,200
And you can see what it looks like here.

109
00:07:02,220 --> 00:07:06,330
So this is 768 is now replaced by 3:37.

110
00:07:07,830 --> 00:07:15,270
Next, we need to split QQQ V into multiple of vectors for the multi-headed attention.

111
00:07:16,110 --> 00:07:18,110
So let's take a look at that.

112
00:07:19,410 --> 00:07:25,140
And now we need to create the attention matrix and that it's visualized like that.

113
00:07:26,130 --> 00:07:30,870
So now we can visualize the attention matrix for all of the patches.

114
00:07:32,670 --> 00:07:34,430
And you can see what they look like there.

115
00:07:36,240 --> 00:07:41,430
OK, so now this is a multi layered perception head.

116
00:07:42,180 --> 00:07:49,670
So this produces a vector or a vector that has a probabilities for each class or basically just normal

117
00:07:49,680 --> 00:07:50,820
probabilities with a score.

118
00:07:50,970 --> 00:07:54,600
You can see if each class here and you can see five to the edge here.

119
00:07:54,600 --> 00:07:57,800
Let's run this and generate a total of one five three.

120
00:07:57,810 --> 00:08:00,870
It is the class with the highest score.

121
00:08:00,960 --> 00:08:01,510
You can see it.

122
00:08:02,010 --> 00:08:05,160
And then we just normalize it to get the probabilities out of it.

123
00:08:05,730 --> 00:08:10,500
And we know that that's the domed 538 corresponds to doom class here.

124
00:08:10,620 --> 00:08:14,970
I'm not sure what number this is, but it's probably something that looks similar to doom.

125
00:08:15,750 --> 00:08:18,670
So that's it for Vision Transformers.

126
00:08:18,690 --> 00:08:25,500
This was a lesson that went in depth into the attention parts of the vision transformer and into the

127
00:08:25,500 --> 00:08:26,970
individual blocks at the end.

128
00:08:27,990 --> 00:08:32,820
Next, we'll take a look at how you can train your own efficient transformer in.

129
00:08:34,050 --> 00:08:35,580
So stay tuned for that lesson.

130
00:08:35,700 --> 00:08:36,150
Thank you.