1
00:00:00,180 --> 00:00:07,020
Hello, everyone, and welcome to this new section in which we shall be treating the screen Transformers.

2
00:00:07,560 --> 00:00:15,060
Up until this point, the kinds of transformer models or specifically image transformer models we've

3
00:00:15,060 --> 00:00:23,850
been working with, I've been designed such that the input shape is the same as the output shape.

4
00:00:23,850 --> 00:00:29,010
So if we have layers, we would have a situation where we could have an input.

5
00:00:29,010 --> 00:00:34,830
Let's let's say we have this input, let's say nine by seven, six, eight, where nine represents the

6
00:00:34,830 --> 00:00:39,810
sequence length, we suppose we're taking all nine input patches.

7
00:00:40,320 --> 00:00:42,180
If we add this, then we'll have ten.

8
00:00:42,180 --> 00:00:45,570
So let's say we have ten by 768.

9
00:00:45,570 --> 00:00:49,620
Where 768 is a embedding dimension or the hidden size.

10
00:00:49,620 --> 00:00:56,370
So we have those kinds of inputs and when we pass this to layer one or we pass this through the encoder,

11
00:00:56,370 --> 00:01:02,970
this ll when L one, when we pass through one of this or the first transformer encoder, we have an

12
00:01:02,970 --> 00:01:11,070
output which has exactly the same shape here, 768 And then passing through this next one, we have

13
00:01:11,070 --> 00:01:16,350
the same and all of this right up to the end.

14
00:01:16,350 --> 00:01:24,390
So yeah, we'll also have seven, six, eight and here we would have seven, ten by seven, six, eight.

15
00:01:24,420 --> 00:01:28,950
Now you have omitted a batch dimension, so you could add the batch dimension here.

16
00:01:29,160 --> 00:01:37,560
Now that said, you see that we have this shape which is maintained throughout the transformer encoder,

17
00:01:37,560 --> 00:01:45,270
but yet with the screen transformer, as we shall see, this shape is modified throughout the network.

18
00:01:45,270 --> 00:01:52,230
So you could see here you see this ship which has been modified as we move from one stream transformer

19
00:01:52,230 --> 00:01:58,800
block to another, coupled with the fact that the transformer is designed in a hierarchical manner.

20
00:01:59,130 --> 00:02:06,450
The authors of this paper also introduced this shift, that window scheme from where the name was Garden

21
00:02:06,600 --> 00:02:09,090
here, swing Transformers.

22
00:02:09,330 --> 00:02:15,390
And in this section we'll go in depth to understand how this hierarchical structure has been designed

23
00:02:15,390 --> 00:02:21,960
and also how the attention is being computed with this shifted window scheme.

24
00:02:22,290 --> 00:02:31,800
Now, unlike with the usual model where we start off with 16 by 16 patches and end up with 16 by 16

25
00:02:31,800 --> 00:02:38,940
patches, as we move to the different image transformer blocks like you could see here.

26
00:02:38,940 --> 00:02:46,380
You see we start out with this and it remains this same right up to the end with this screen transformer.

27
00:02:46,380 --> 00:02:49,110
We starting off with 4x4 patches.

28
00:02:49,110 --> 00:02:50,220
You could see that year.

29
00:02:50,310 --> 00:02:56,100
If you look at this image, you see you have one, two, three, four, one, two, three, four.

30
00:02:56,100 --> 00:02:58,680
So this is one 4x4 patch.

31
00:02:58,680 --> 00:03:03,270
And so the original image is broken up into this 4x4 patches.

32
00:03:04,260 --> 00:03:11,250
And since each 4x4 patch is made of 16 pixels and that this image is RGV.

33
00:03:11,250 --> 00:03:14,670
So we have three channels you see we have 16 times three.

34
00:03:14,670 --> 00:03:24,030
That is 48 channels for each token because we consider this one now as a single token and then the single

35
00:03:24,030 --> 00:03:30,030
token will now be made of 48 different features.

36
00:03:30,030 --> 00:03:32,400
And that's why you see you have this 48.

37
00:03:32,400 --> 00:03:35,360
So do we suppose another heightened with some arbitrary value?

38
00:03:35,370 --> 00:03:42,570
Now if we have height way to be 256, then we would have 256 divided by four, there should be 64,

39
00:03:42,570 --> 00:03:53,550
so this will be 6464 times 64, which is 4900 or 4096 tokens.

40
00:03:53,550 --> 00:03:54,930
Let's take this off.

41
00:03:54,930 --> 00:04:02,670
So as we calculate that 256 times two 5256 is our image height and our image width.

42
00:04:02,670 --> 00:04:10,260
So to 56 whereby 464 to 6 by four, 64, 64, 1064, 4009 to 6 tokens.

43
00:04:10,260 --> 00:04:19,200
So we have 4006 tokens and then each token will be made off of 48 features as we've just seen here.

44
00:04:19,500 --> 00:04:24,340
Now with this we pass this into this linear embedding.

45
00:04:24,360 --> 00:04:32,280
Now we call the role of the linear embedding is to help us get a desired hidden size.

46
00:04:32,280 --> 00:04:37,680
And so here, as we could see this, our desired hidden size is C, so the fixed is to be C, So C could

47
00:04:37,680 --> 00:04:38,310
be any value.

48
00:04:38,310 --> 00:04:41,340
Let's suppose that C is 768.

49
00:04:41,370 --> 00:04:51,240
Now this means that at this point we have 4096 tokens with each 4768

50
00:04:53,070 --> 00:04:53,940
features.

51
00:04:53,940 --> 00:04:58,350
So we have that Now once we get to this, we get to this patch merge.

52
00:04:58,380 --> 00:04:59,760
Notice that this is the first time we meet the we.

53
00:04:59,860 --> 00:05:02,320
Get this patch merging layer here.

54
00:05:03,190 --> 00:05:07,450
And what does patch merging layer does is what you what you find here.

55
00:05:07,450 --> 00:05:15,120
So you'll see that it's going to join this the different patches together, hence the term patch merging.

56
00:05:15,130 --> 00:05:23,620
So it merges you see this, this four patches into this one and then merges this into this one.

57
00:05:24,040 --> 00:05:28,540
It merges this into this, merges this into this.

58
00:05:28,540 --> 00:05:36,100
And so now we have our tokens or a number of tokens which will be divided by four.

59
00:05:36,370 --> 00:05:42,940
And if we divide this by four, we'll have 1024 tokens.

60
00:05:43,540 --> 00:05:54,010
And with or rather in this patch merging layer, the we move from C features to two C features.

61
00:05:54,010 --> 00:05:57,280
So here we have two times 768.

62
00:05:57,280 --> 00:05:58,960
Let's, let's just say to see.

63
00:06:00,000 --> 00:06:03,900
Which is 1536 features.

64
00:06:04,560 --> 00:06:08,110
Now to again better understand how we got this from this year.

65
00:06:08,130 --> 00:06:11,640
Recall we have H and W to be 256.

66
00:06:11,940 --> 00:06:18,450
Then we'll have 256 whereby eight times 256 whereby eight which is 32 times 32, which is 1024.

67
00:06:19,050 --> 00:06:27,930
Then again we go to this exact same process where now this four tokens are being merged into one.

68
00:06:28,050 --> 00:06:31,560
And the number of tokens here again is divided by four.

69
00:06:31,560 --> 00:06:40,800
So now we have 256 tokens, 256, and this number of features is multiplied by two.

70
00:06:42,120 --> 00:06:48,480
And so we end up with this 256 by 3072 and the 64 by 6144.

71
00:06:48,840 --> 00:06:59,160
Then instead of carrying out the attention globally, we instead carry out this attention locally.

72
00:06:59,160 --> 00:07:04,500
And so this means that when the image is played out like this with the different tokens, the attention

73
00:07:04,500 --> 00:07:06,210
here, let's take this off.

74
00:07:06,210 --> 00:07:11,040
The attention here is only in this local region.

75
00:07:11,040 --> 00:07:16,290
So we carry our attention with this local region, attention, this local region, attention, this

76
00:07:16,290 --> 00:07:19,170
local region and so on and so forth.

77
00:07:20,040 --> 00:07:27,480
And it turns out that carrying out the attention locally is computationally less expensive than carrying

78
00:07:27,480 --> 00:07:29,100
out the attention globally.

79
00:07:29,220 --> 00:07:37,140
Now, to understand why the users are working with local attention will be more computationally cheaper

80
00:07:37,140 --> 00:07:39,200
than working with global attention.

81
00:07:39,210 --> 00:07:41,360
Let's consider the following example.

82
00:07:41,370 --> 00:07:47,790
So here we have this eight by eight pixel image, eight by eight pixel image, and then we'll decide

83
00:07:47,790 --> 00:07:50,730
to create patches which are two by two.

84
00:07:50,730 --> 00:07:54,420
So this is one patch, this is another patch.

85
00:07:54,910 --> 00:07:56,130
So these are the different pixels.

86
00:07:56,130 --> 00:07:58,080
So we have all the different patches.

87
00:07:58,080 --> 00:08:03,540
And after this, at the end of this process, we should have 4x4.

88
00:08:03,540 --> 00:08:06,030
That is 16 different patches.

89
00:08:06,030 --> 00:08:12,990
And so when we have this kind of when we have this kind of setup where we have 16 different patches,

90
00:08:12,990 --> 00:08:24,000
it means that our attention map will be 16 by 16 because here you have 16 one, two, three, four,

91
00:08:24,000 --> 00:08:30,060
five, six, seven, eight, nine, ten, 11, 12, 13, 14, 15, 16.

92
00:08:30,060 --> 00:08:30,480
Patches.

93
00:08:30,480 --> 00:08:36,330
We're getting all these 16 patches from all from your and then each and every one of them attends to

94
00:08:36,330 --> 00:08:36,990
one another.

95
00:08:36,990 --> 00:08:44,070
So this one attends to all the other patches and this one attends to all the others and so on and so

96
00:08:44,070 --> 00:08:44,280
forth.

97
00:08:44,280 --> 00:08:51,630
So you have all this attention resulting in a 16 by 16 attention map.

98
00:08:51,630 --> 00:09:05,730
Now, the 16 by 16 attention map comes from the fact that we multiplying 1616 by C times, C by 16,

99
00:09:05,880 --> 00:09:14,400
because the each patch here will be a query and then the key is being transposed to have C by 16.

100
00:09:14,400 --> 00:09:15,810
The C is the hidden size.

101
00:09:15,810 --> 00:09:19,230
And then from here we have 16 by 16.

102
00:09:19,230 --> 00:09:28,080
And so this computation or better still, this matrix multiplication becomes very expensive when we're

103
00:09:28,080 --> 00:09:31,950
dealing with a large number of patches.

104
00:09:32,340 --> 00:09:39,360
Anyway, let's take to 16 and we have 16 C times 16, C ten, C ten by C by 16.

105
00:09:39,360 --> 00:09:44,820
This is not 16 times C, this is a C by 16 matrix.

106
00:09:44,820 --> 00:09:47,280
We saw this already in some previous sections.

107
00:09:47,290 --> 00:09:49,500
So this is C by 16.

108
00:09:49,500 --> 00:09:57,870
Let's have this here and this now gives 16 by 16 OC Now let's suppose that we carried out this local

109
00:09:57,870 --> 00:09:58,500
attention.

110
00:09:58,500 --> 00:10:06,240
So here we are, we carrying out the attention on just this two by two part patches right here.

111
00:10:06,240 --> 00:10:09,840
So the attention here will be we will have one, two, three, four.

112
00:10:09,840 --> 00:10:11,900
So we have one, two, three, four.

113
00:10:11,910 --> 00:10:14,130
This one attends to this three.

114
00:10:15,130 --> 00:10:16,780
This one attends to this.

115
00:10:16,780 --> 00:10:17,650
Attends to this.

116
00:10:17,650 --> 00:10:18,850
Attends back to this.

117
00:10:18,850 --> 00:10:19,150
This one.

118
00:10:19,150 --> 00:10:20,080
Attends to this.

119
00:10:20,080 --> 00:10:20,830
Attends to this.

120
00:10:20,830 --> 00:10:21,850
Attends to this.

121
00:10:21,850 --> 00:10:23,140
This attends to this.

122
00:10:23,140 --> 00:10:24,190
This attends to this.

123
00:10:24,190 --> 00:10:24,640
And this.

124
00:10:24,640 --> 00:10:25,870
Attends to this.

125
00:10:25,870 --> 00:10:31,270
So we end up with four by C times, C by four.

126
00:10:31,270 --> 00:10:35,560
And so now after computing this, we have a 4x4 matrix.

127
00:10:35,560 --> 00:10:39,190
So we have 4x4 output here.

128
00:10:39,220 --> 00:10:45,760
Now, since we'll do this 16 different times for this, we do this locally, this locally, this locally,

129
00:10:45,760 --> 00:10:47,980
this locally and so on and so forth.

130
00:10:48,010 --> 00:10:57,640
It means we take 16 or rather will take this computation and carry it out 16 times.

131
00:10:57,760 --> 00:11:08,860
Now this means that if instead of 16 patches we have 2048 patches will be computing a 2048.

132
00:11:10,030 --> 00:11:12,550
These 2048 by sea.

133
00:11:13,680 --> 00:11:16,080
Times a 2000.

134
00:11:16,080 --> 00:11:17,640
Let's let's write this better.

135
00:11:17,640 --> 00:11:18,540
Let's write a year.

136
00:11:18,570 --> 00:11:30,270
So we'll be doing 2000 of 48 by C times a C by 2048 metrics which would give us a 2048 by 2048 metrics,

137
00:11:31,020 --> 00:11:35,490
which has more than 4 million different positions.

138
00:11:35,730 --> 00:11:36,900
That's in this matrix.

139
00:11:37,050 --> 00:11:38,140
If we if we have two.

140
00:11:38,160 --> 00:11:43,710
If we take 2048 times 2048, you give us more than 4 million different positions.

141
00:11:43,710 --> 00:11:49,140
And then if we're dealing with this, we would have the 16 change to 2048.

142
00:11:49,140 --> 00:11:56,040
We still have our 2048, but here we still have this four by C, ten, C by four.

143
00:11:56,040 --> 00:12:00,840
So we still have this 4x4 matrix.

144
00:12:03,200 --> 00:12:05,960
Which is computed 2048 times.

145
00:12:06,170 --> 00:12:15,290
And so this shows us clearly that the local attention is much computationally cheaper than the global

146
00:12:15,290 --> 00:12:16,100
attention.

147
00:12:17,150 --> 00:12:23,930
But remember that one of the advantages of working with Wits is the fact that we have actually this

148
00:12:23,930 --> 00:12:25,180
global attention.

149
00:12:25,190 --> 00:12:32,150
So how do we take advantage of this local attention, which we've just presented, where attention is

150
00:12:32,150 --> 00:12:40,070
carried out only within the selected patches while still doing some global or while still carrying out

151
00:12:40,070 --> 00:12:42,170
some global attention?

152
00:12:42,380 --> 00:12:50,240
This is done using this shifted window approach, which we saw already from the abstract, where in

153
00:12:50,240 --> 00:13:00,110
a given layer l As we could see here in this layer L we have the usual partitioning scheme and then

154
00:13:00,110 --> 00:13:01,790
attention is carried out.

155
00:13:01,790 --> 00:13:09,080
Only in this local window we see a local window to perform self attention, see with this red box surrounding.

156
00:13:09,080 --> 00:13:10,820
So that's what we have here.

157
00:13:10,880 --> 00:13:12,980
Let's take this off so you could see that clearly.

158
00:13:12,980 --> 00:13:19,190
So you see attention is carried out in this locale and this local patches, this patch, this patch,

159
00:13:19,190 --> 00:13:21,050
this patch and this patch.

160
00:13:21,050 --> 00:13:25,460
And we've seen already the advantage of carrying out this attention locally.

161
00:13:25,670 --> 00:13:36,440
Then to solve the problem of global attention, they modify this partitioning scheme via this shifted

162
00:13:36,440 --> 00:13:43,310
window where original originally we could have something like this and then we go see two steps downward

163
00:13:43,700 --> 00:13:46,970
to steps to the right and we stop here.

164
00:13:46,970 --> 00:13:55,610
We obtain this and then for the next layer we could go again to and this way and see we obtain this

165
00:13:55,610 --> 00:13:56,150
now.

166
00:13:57,600 --> 00:14:01,560
And so this means that unlike previously, let's take this off.

167
00:14:01,590 --> 00:14:12,000
Unlike previously where this only this year, only the pixels in this patch could attend to one another.

168
00:14:12,030 --> 00:14:16,450
Now you see this pixel right here?

169
00:14:16,470 --> 00:14:24,600
This pixel is now found here, and it's now attending to this other pixel, which initially was on a

170
00:14:24,600 --> 00:14:25,560
different patch.

171
00:14:25,890 --> 00:14:33,270
Because now they found at this position together or they now make up this patch.

172
00:14:34,320 --> 00:14:46,260
And so with this now, not only is the local attention helping the screen transformer be more computationally

173
00:14:46,260 --> 00:14:47,250
efficient.

174
00:14:48,060 --> 00:14:56,850
On the other hand, we have this shifted window technique, which still helps us carry out global attention.

175
00:14:58,680 --> 00:15:07,590
And in this table right here, we see how all we can see the results of the screen transformer and a

176
00:15:07,590 --> 00:15:15,360
different version of TSB with image size 224 and B with image 384.

177
00:15:16,290 --> 00:15:26,700
And how it outperforms this d i t and V transformer models while maintaining a reasonable throughput.

178
00:15:27,360 --> 00:15:36,360
For example, here we see we have this of the image size 384 and then 307 million parameters.

179
00:15:36,360 --> 00:15:45,060
When we compare this with the screen B version, see the same image size you see is much smaller number

180
00:15:45,060 --> 00:15:46,170
of parameters.

181
00:15:46,170 --> 00:15:54,810
Then the floating point operations here are smaller and then you see this, the throughput here is higher.

182
00:15:54,810 --> 00:16:00,930
And then also the top one accuracy on image net is your higher two.

183
00:16:00,960 --> 00:16:10,620
Comparing this with the ITE, although here we have slightly higher number of parameters for the number

184
00:16:10,620 --> 00:16:11,940
of floating point operations.

185
00:16:11,940 --> 00:16:19,590
This is fewer floating point operations then the triple here is slightly below the the digit B version,

186
00:16:19,590 --> 00:16:26,540
but the accuracy here is higher than that of the DCT on this table.

187
00:16:26,550 --> 00:16:33,660
Here we have something similar for the image net trained on its 22,000 class version and you could see

188
00:16:33,660 --> 00:16:41,520
that again, the screen outperforms the models and even the rest nets.

189
00:16:42,630 --> 00:16:52,260
Then another strong point of the screen transformer is that due to its shifted window based self attention,

190
00:16:52,920 --> 00:17:00,540
it does also very well on other computer vision tasks like object detection and segment segmentation.
