1
00:00:00,450 --> 00:00:01,770
Hi and welcome back.

2
00:00:01,890 --> 00:00:07,110
In this section, we'll take a look at implementing video classification, which we've done in the previous

3
00:00:07,110 --> 00:00:11,100
lesson, but now be using Transformers to do that.

4
00:00:11,250 --> 00:00:12,420
So let's get started.

5
00:00:12,570 --> 00:00:14,520
So open notebook 67.

6
00:00:15,090 --> 00:00:21,180
And again, this notebook comes from the official Keras tutorial repository, and it's produced by St.

7
00:00:21,180 --> 00:00:21,780
Paul.

8
00:00:23,430 --> 00:00:24,580
So let's get started.

9
00:00:24,600 --> 00:00:30,720
So as just so much as I just mentioned, we are basically duped into the same thing we did before.

10
00:00:30,720 --> 00:00:37,140
But instead of using the CNN Arnon architecture, we're going to use it to transform Mobius architecture.

11
00:00:37,530 --> 00:00:44,130
And Transformers should be naturally applicable to video data because it streams in a sequential order.

12
00:00:44,670 --> 00:00:50,730
So and that's what makes Transformers so powerful to get all of the correlations between each of those

13
00:00:50,730 --> 00:00:51,210
inputs.

14
00:00:51,300 --> 00:00:53,340
So now let's get started.

15
00:00:53,350 --> 00:00:56,790
So firstly, we need to install TensorFlow docs.

16
00:00:56,790 --> 00:01:02,100
So it's too that it takes about 23 seconds, then we need to download the dataset.

17
00:01:02,100 --> 00:01:09,250
And the dataset again is the the S.F. one to one day to set its benchmark data set used for action or

18
00:01:09,300 --> 00:01:10,260
video recognition.

19
00:01:11,280 --> 00:01:16,710
So you can see that it has many different scenes like this of different activities baseball, pitch

20
00:01:16,710 --> 00:01:23,820
boxing, diving, cricket shot a lot of different things here of cutting and coaching as well, so that

21
00:01:23,820 --> 00:01:25,740
has a quite a variety of stuff.

22
00:01:26,280 --> 00:01:30,330
So we'll be using a subset of that with just a few different classes as well.

23
00:01:30,870 --> 00:01:34,770
So we download a subset here doesn't take that long, about 15 seconds.

24
00:01:34,770 --> 00:01:40,020
Then we load on libraries to find some type of parameters that we'll be using for this.

25
00:01:40,950 --> 00:01:48,030
Next, we prepare the dataset, so we just get the training levels, training data as well.

26
00:01:48,030 --> 00:01:48,390
And so our.

27
00:01:49,290 --> 00:01:51,840
Then we just have some crop center functions.

28
00:01:52,290 --> 00:01:58,380
The loading video function here that uses open copy to load the video, as you can see frame by frame.

29
00:01:59,520 --> 00:02:02,100
Then we have the build feature extractor part of it.

30
00:02:02,610 --> 00:02:08,280
So in this case, we're using dense net 121, which is a fairly good network, actually.

31
00:02:09,420 --> 00:02:15,840
So we just build a feature extractor here and then we create that feature extractor right there.

32
00:02:16,410 --> 00:02:20,070
Then we use some labeled preprocessing here with Keros.

33
00:02:20,670 --> 00:02:23,850
And then we also prepare all the videos here in this function.

34
00:02:24,090 --> 00:02:30,240
So this loops of everything in the video parts and returns to frame features, which is the feature

35
00:02:30,240 --> 00:02:33,190
extractor that we got, as well as levels.

36
00:02:33,210 --> 00:02:35,100
So quite a lot going on there.

37
00:02:35,730 --> 00:02:40,620
So we do that and actually it takes about 10 seconds to run and these are the classes here.

38
00:02:40,620 --> 00:02:45,030
So we have cricket shot, playing cello, punch, having beer and tennis swing.

39
00:02:45,510 --> 00:02:47,760
So we just grab the subset of that.

40
00:02:47,760 --> 00:02:49,560
You have to see one on one data.

41
00:02:50,700 --> 00:02:55,590
Next, we need to download the prepared data here.

42
00:02:56,070 --> 00:03:00,510
This is what we produced in the previous lesson that took us about 20 minutes to download.

43
00:03:00,540 --> 00:03:03,300
So we just download the pickled visions here.

44
00:03:03,360 --> 00:03:04,360
An umpire is.

45
00:03:04,380 --> 00:03:06,670
And we just get load them here.

46
00:03:06,690 --> 00:03:09,780
Well, actually, it's the typical it's it might maybe a pickle file for, I know.

47
00:03:11,010 --> 00:03:14,490
And next we build the transform the model here.

48
00:03:14,970 --> 00:03:22,440
So we have to encode the positional embeddings because remember, transformers don't like by default,

49
00:03:22,590 --> 00:03:24,480
don't have that positional embedding.

50
00:03:24,480 --> 00:03:26,100
So we have to added back to it.

51
00:03:26,100 --> 00:03:29,910
And that's easy because the video are in the sequence so we can use that.

52
00:03:30,770 --> 00:03:36,180
But then we have to create our transformer cool here with the multi attention head to remember this

53
00:03:36,180 --> 00:03:39,080
important little note there.

54
00:03:39,840 --> 00:03:45,690
Then we just have our training function, so we get to compile to model here, to the model, and then

55
00:03:45,690 --> 00:03:47,910
we just run the experiment through here.

56
00:03:48,060 --> 00:03:50,700
So it's all relatively simple so far.

57
00:03:51,530 --> 00:03:57,150
Now we just start training our model and this actually prints quite quickly, just about 15 seconds,

58
00:03:57,150 --> 00:04:01,420
and we're getting pretty good accuracy here in our small dataset.

59
00:04:01,470 --> 00:04:05,700
That's because it's a small dataset and pretty much a easy dataset for it to learn.

60
00:04:06,480 --> 00:04:11,460
And now we can just prepare a single video here and predict its action out of it.

61
00:04:12,090 --> 00:04:15,900
So I it creates a GIF in the end and we go in the end.

62
00:04:16,470 --> 00:04:17,670
This is a GIF.

63
00:04:18,070 --> 00:04:26,820
I guess the guy playing cello and or classifier predict playing cello very well, 100 percent probability.

64
00:04:27,480 --> 00:04:36,330
So that's it for this lesson on using video transfer of Transformers for video classification will be

65
00:04:36,340 --> 00:04:40,170
enjoyed a lesson and we'll stop there and I'll see you in the next lesson.

66
00:04:40,200 --> 00:04:40,620
Thank you.