1
00:00:00,330 --> 00:00:01,470
Hi and welcome back.

2
00:00:01,620 --> 00:00:07,680
In this section, we'll take a look at Vision Transformers, which are, pardon the pun, transforming

3
00:00:07,680 --> 00:00:15,330
the Peter Vision world right now, and they're almost almost replacing scenes in many of the state of

4
00:00:15,330 --> 00:00:17,820
the art computer vision, deep learning tests.

5
00:00:18,360 --> 00:00:19,500
So let's get started.

6
00:00:19,590 --> 00:00:22,650
So what exactly are vision transformers?

7
00:00:23,040 --> 00:00:24,090
Well, what they're not.

8
00:00:24,090 --> 00:00:29,130
They're not those transformers from the movie, and they're also not electrical transformers that do

9
00:00:29,140 --> 00:00:31,350
step up or step down voltage changes.

10
00:00:31,980 --> 00:00:34,460
What they are, it's actually something.

11
00:00:34,470 --> 00:00:36,480
It's a method or algorithm.

12
00:00:36,630 --> 00:00:44,460
You can see what type of neural network design that came from the A.P. world in a famous people in 2017

13
00:00:44,460 --> 00:00:47,400
that was called attention is all you need.

14
00:00:47,760 --> 00:00:54,180
And this was a ground breaking people in the A.P. world because it allowed it allowed them to encode

15
00:00:54,330 --> 00:00:59,730
way more information into their networks and beating less DMs and earnings.

16
00:01:00,150 --> 00:01:07,050
At the time that so much A.P. tests and pretty much, no, it's all they only do is transformers.

17
00:01:07,050 --> 00:01:12,060
And if you want to do a state of the art stuff, at least not simple classifiers, but steady state

18
00:01:12,060 --> 00:01:18,900
of the art A.P. is almost all transformers, and that is almost changing in the computer vision world

19
00:01:18,990 --> 00:01:19,560
right now.

20
00:01:19,920 --> 00:01:26,130
And they use something called self attention, which allows us to take sequential inputs like words

21
00:01:26,520 --> 00:01:32,220
and find the correlations between different features and towards that information as part of the networks

22
00:01:32,220 --> 00:01:32,760
knowledge.

23
00:01:32,970 --> 00:01:37,350
And from 2020, it started to gain a lot of traction in the computer vision world.

24
00:01:37,860 --> 00:01:44,250
And you can see why you can see in this task here or in this model, this T2D vision transformer is

25
00:01:44,250 --> 00:01:53,100
actually beating resonance and previous votes, as well as well and in accuracy and model size and max

26
00:01:53,100 --> 00:01:53,520
as well.

27
00:01:54,000 --> 00:02:00,390
And you can see model sizes actually getting better results from a similar model size to mobile that

28
00:02:00,450 --> 00:02:01,140
vision, too.

29
00:02:01,380 --> 00:02:06,510
And this is just one transformer network that the researchers are presenting here, comparing it to

30
00:02:06,510 --> 00:02:07,910
a few of the popular networks.

31
00:02:07,920 --> 00:02:11,580
It doesn't mean this is the best right now, but it's pretty good.

32
00:02:12,720 --> 00:02:14,640
So how is it used for images?

33
00:02:14,670 --> 00:02:21,600
Well, firstly, we've got to split an image into patches and then flattened as patches and then create

34
00:02:21,660 --> 00:02:26,970
a basically all of those patches, a lower dimensional linear embedding was formed from those flattened

35
00:02:26,970 --> 00:02:27,540
patches.

36
00:02:27,990 --> 00:02:34,380
And then next, we add a positional embedding alongside of those the slower dimensional effector that

37
00:02:34,380 --> 00:02:34,920
we've created.

38
00:02:35,670 --> 00:02:40,050
And because usually this is a trainable position embedding, it's typically used.

39
00:02:40,560 --> 00:02:46,140
And then we feed that sequence as an input into a standard transformer encoder.

40
00:02:46,710 --> 00:02:51,810
And next, we pre-trained the model with image labels from a very large dataset, something like image

41
00:02:51,810 --> 00:02:52,110
net.

42
00:02:52,590 --> 00:02:58,350
And then when you when you want to train it with your own dataset, you fine tune that network on the

43
00:02:58,350 --> 00:03:01,470
downstream dataset for your image classification tests.

44
00:03:02,160 --> 00:03:07,560
Now, let's take a look at this very cool animation produced by Google that explains vision of Transformers.

45
00:03:08,070 --> 00:03:13,440
So you take an image here, you split it up into patches, as you can see here, then have a linear

46
00:03:13,440 --> 00:03:14,910
projection of that in patches.

47
00:03:14,940 --> 00:03:23,010
You add the positional embedding still, and then you pass it through transforming CUDA it to A.P. head.

48
00:03:23,250 --> 00:03:25,050
And then we predict the classes out of it.

49
00:03:25,500 --> 00:03:27,600
So let's watch this one more time again.

50
00:03:27,600 --> 00:03:32,400
You can see the positional embedding here after we have a projection of flattened patches.

51
00:03:32,910 --> 00:03:37,660
This goes into a transform encoder, then to an LP head and then classify results.

52
00:03:37,890 --> 00:03:38,430
Go here.

53
00:03:39,840 --> 00:03:47,760
So let's talk a bit more about Vision Transformers, so the image patches analogous equivalent to sequence

54
00:03:47,760 --> 00:03:49,920
tokens, which is words and LP.

55
00:03:50,370 --> 00:03:57,150
That's how we treat images as inputs into a transformer network, which relies on a sequential input.

56
00:03:57,810 --> 00:04:03,090
And if you wanted to get a deeper vision transformer, you can just increase the number of blocks inside

57
00:04:03,090 --> 00:04:04,260
of the Transformer encoder.

58
00:04:04,680 --> 00:04:10,590
And these are what the blocks look like, says in multi attention distance, that's the most important

59
00:04:10,590 --> 00:04:14,340
parts or the most important parts of the transformer encoder.

60
00:04:15,060 --> 00:04:21,630
And it's important to know that Vati is to require a lot of data to be trained to even be compared to

61
00:04:21,630 --> 00:04:22,860
state of the art scenes.

62
00:04:23,400 --> 00:04:28,650
So that means basically means creating from scratch when you have smaller data set is not really an

63
00:04:28,650 --> 00:04:34,040
option with Vision Transformers, so you may as well stick to CNN something like a small C and then

64
00:04:34,050 --> 00:04:39,480
like a resonant 18, would probably work well for most general image specification tests.

65
00:04:40,110 --> 00:04:46,590
However, what we can do is take a pre-trained vigilance transformer and then fine tune it on your smaller

66
00:04:46,590 --> 00:04:50,460
data set and change it and they'll be updates to your class size.

67
00:04:50,460 --> 00:04:54,900
So if it's like five classes or two classes, change it and that's it.

68
00:04:55,350 --> 00:04:59,520
So we'll stop this now because this is just a brief overview.

69
00:05:00,070 --> 00:05:03,790
An introduction to fishing and transformers.

70
00:05:04,270 --> 00:05:10,120
What we'll do next, we'll take a look at some implementations of fish and Transformers in both PyTorch

71
00:05:10,120 --> 00:05:10,960
and Keros.

72
00:05:11,500 --> 00:05:14,320
So stay tuned for that and I'll see you in this lessons.

73
00:05:14,410 --> 00:05:15,220
Thank you for watching.
