1
00:00:00,210 --> 00:00:06,840
Hello, everyone, and welcome to this new and exciting session in which we shall look at training,

2
00:00:06,840 --> 00:00:11,580
data efficient image transformers and the solution to attention.

3
00:00:12,180 --> 00:00:23,040
In this i t image transformer paper by go to all the show that it is possible to train an image transformer

4
00:00:23,040 --> 00:00:34,340
on a relatively smaller data set like the image net while getting competitive performance with no external

5
00:00:34,380 --> 00:00:34,980
litter.

6
00:00:36,150 --> 00:00:42,360
Now if you're new to this topic on image, Transformers is important for you to check on our previous

7
00:00:42,360 --> 00:00:46,680
section where we treat the image transformers from scratch.

8
00:00:46,800 --> 00:00:53,370
And one thing we know that from that section was that to train this image transformers the weights and

9
00:00:53,370 --> 00:00:55,140
get competitive results.

10
00:00:55,140 --> 00:01:03,360
We needed to do this on a very large data set like say 300 million data point data set.

11
00:01:03,360 --> 00:01:10,290
And then if we want to train this or if we want to work with a smaller data set, then we will now do

12
00:01:10,290 --> 00:01:11,160
fine tuning.

13
00:01:12,360 --> 00:01:19,080
And so as we see with this itis, they are able to produce results without necessarily training on this

14
00:01:19,080 --> 00:01:20,850
kind of large data set.

15
00:01:21,510 --> 00:01:29,550
In this section we'll see how the use this teachers to ten strategy to help the image transformer achieve

16
00:01:29,550 --> 00:01:34,950
competitive results without necessarily being trained on an extremely large data set.

17
00:01:35,340 --> 00:01:37,740
Before getting to the subject matter.

18
00:01:37,740 --> 00:01:43,080
The authors start by an overview of the vision transformer, which we've seen previously.

19
00:01:43,080 --> 00:01:50,340
So here we have the talk about this multi haired self attention layer and then they get to the transformer

20
00:01:50,340 --> 00:01:55,350
block as a whole that's taken into consideration that multi head transformer layer.

21
00:01:55,350 --> 00:02:02,160
And then from here they discuss the class token and how it is used by the image transformer.

22
00:02:02,160 --> 00:02:08,340
And then finally they discuss this modifications made on the positional encoders when carrying out fine

23
00:02:08,340 --> 00:02:13,980
tuning in order to understand this architecture of the Z i.t transformer.

24
00:02:13,980 --> 00:02:21,600
We have to start by recalling that with the usual image transformers, this transformer which doesn't

25
00:02:21,600 --> 00:02:30,120
have any or which has very little inductive biases, finds a difficult learning from very small data

26
00:02:30,120 --> 00:02:30,840
sets.

27
00:02:30,840 --> 00:02:38,280
And so as we increase the data set size, this transformers then start becoming much more performant.

28
00:02:38,880 --> 00:02:46,650
Now, what if in this small data regime, right, here's where the model's performance isn't that great,

29
00:02:46,650 --> 00:02:56,700
we are able to transfer our knowledge from a component which has or which has better performance in

30
00:02:56,700 --> 00:03:05,700
this small data regime such that this image transformer takes advantage of this confidence inductive

31
00:03:05,730 --> 00:03:14,040
bias to update its weights in such a way that it learns from this smaller data sets.

32
00:03:14,760 --> 00:03:23,700
Now, the authors devised this technique known as Hard desolation, where we have this conflict, which

33
00:03:23,700 --> 00:03:29,460
is a teacher and this image transformer, which is actually our usual image transformer, which is a

34
00:03:29,460 --> 00:03:30,180
student.

35
00:03:31,110 --> 00:03:36,090
And then here they add this desolation token just in a similar way.

36
00:03:36,090 --> 00:03:39,180
We added the class token as we saw in the previous section.

37
00:03:39,180 --> 00:03:45,660
So here we have both the class and the dissolution token together with the patch tokens as usual.

38
00:03:45,990 --> 00:03:53,610
Now what goes on here is you have this class desolation patch which had been passed in and then at the

39
00:03:53,610 --> 00:04:01,230
level of the outputs, we are going to ignore this and then we'll focus on the class and this distillation

40
00:04:02,070 --> 00:04:03,510
outputs right here.

41
00:04:04,440 --> 00:04:12,150
And so instead of just minimizing this cross entropy loss, we are going to minimize both this cross

42
00:04:12,150 --> 00:04:17,190
entropy loss and this teacher loss right here, which is given by these two formulas.

43
00:04:17,190 --> 00:04:19,470
You see here we have the model's output.

44
00:04:19,470 --> 00:04:22,560
This is what the model outputs, and this is the true level.

45
00:04:22,560 --> 00:04:25,350
So usually this is what we calculated.

46
00:04:25,350 --> 00:04:30,750
So before we had just this, now we have this and this together.

47
00:04:30,750 --> 00:04:37,680
So now we we've trying to minimize the difference between what the model outputs here and the actual

48
00:04:37,680 --> 00:04:47,400
output and also the difference between what the model outputs and the teacher's output Y of RT.

49
00:04:47,730 --> 00:04:54,390
So this means that we have this conflict right here which has already been pre trained on say, image

50
00:04:54,390 --> 00:04:59,220
net and for a given input we know its output already.

51
00:05:00,160 --> 00:05:08,410
And this calf nets output, which has been pre trained, is not necessarily this y level or the true

52
00:05:08,410 --> 00:05:10,930
value of Y or the true output.

53
00:05:11,470 --> 00:05:21,460
But with this they found that after training to convergence this tokens here, this embeddings we pass

54
00:05:21,460 --> 00:05:24,850
in are not very similar to one another.

55
00:05:24,970 --> 00:05:32,050
But as we go up that as we get deeper into the model right up to the end, we find that we have this

56
00:05:32,050 --> 00:05:35,530
to outputs now which are very similar.

57
00:05:35,530 --> 00:05:44,500
And with this we see how this teacher or this curve net right here, which has been pre trained on the

58
00:05:44,500 --> 00:05:53,230
image net, helps this image transformer to learn from very small or relatively small theater to get

59
00:05:53,230 --> 00:05:58,750
the sense of how small this is compared to the kind of theater we work with previously.

60
00:05:58,750 --> 00:06:04,030
The image net is about 1.2 million data points.

61
00:06:04,030 --> 00:06:11,020
While the dataset we're working with previously was about 300 million data points.

62
00:06:11,380 --> 00:06:20,740
Then the authors your provide these three models, the T, the SE, and the B with your different embedding

63
00:06:20,740 --> 00:06:26,230
dimensions, number of heads, number of layers, number of parameters, the training resolution and

64
00:06:26,230 --> 00:06:27,160
the throughput.

65
00:06:27,340 --> 00:06:34,390
We see here that this DTP version performs as well as the corner teachers.

66
00:06:35,050 --> 00:06:42,520
Then in this table right here, we see that it's preferable to use the d ite with the hard distillation

67
00:06:42,520 --> 00:06:48,940
as compared to the usual distillation technique or when we work with no distillation at all.

68
00:06:49,060 --> 00:06:57,040
So the hard desolation adopted by this authors makes the image transform our perform better.

69
00:06:57,430 --> 00:07:05,230
And then we see that working with both the class and the distillation embeddings gives us better results

70
00:07:05,230 --> 00:07:10,960
as compared to working with only the class or only the distillation embeddings.

71
00:07:11,380 --> 00:07:12,220
In this table.

72
00:07:12,220 --> 00:07:18,130
Right here, we have a summary of the different components, which we have your different state of the

73
00:07:18,130 --> 00:07:25,450
art components, and then state of the art transformers where to carry out this comparison with respect

74
00:07:25,450 --> 00:07:32,260
to a number of parameters, the image size, the throughput now for the throughput is simply the number

75
00:07:32,260 --> 00:07:35,920
of images the model can process per second.

76
00:07:35,920 --> 00:07:41,890
So obviously you want the model or a model to have a high throughput because you want to be able to

77
00:07:41,890 --> 00:07:44,650
process as many images as possible per second.

78
00:07:44,800 --> 00:07:48,340
Then here we have this top one accuracy.

79
00:07:48,340 --> 00:07:56,200
We have this order to one accuracy on this order data set, and then this other one on this other data

80
00:07:56,200 --> 00:07:57,220
set right here.

81
00:07:57,970 --> 00:08:06,670
Now, if you look at this year, you'll see that this de ite model outperforms the usual, right, while

82
00:08:07,330 --> 00:08:18,040
being trained on fewer data and hence closing the gap between the transformer performance and the conflict

83
00:08:18,040 --> 00:08:19,060
performance.

84
00:08:19,570 --> 00:08:27,100
That said, given that this de ite model was trained on smaller or relatively smaller data set compared

85
00:08:27,100 --> 00:08:34,360
to the initial vite, the authors use the data augmentation very much so.

86
00:08:34,360 --> 00:08:41,560
Here you could see the random augment which was used or to augment the mix up, the cut mix and other

87
00:08:41,560 --> 00:08:45,370
regularization strategies to help avoid overfitting.

88
00:08:46,000 --> 00:08:52,660
And that's it for this section where we have seen how to train this data hungry image transformers with

89
00:08:52,660 --> 00:08:57,880
relatively smaller data while achieving even better performance.