1
00:00:00,690 --> 00:00:01,890
Hi and welcome back.

2
00:00:02,040 --> 00:00:09,150
So in this section, we'll be taking a look at video classification and what video classification seeks

3
00:00:09,150 --> 00:00:16,440
to do is that you have seen that something is happening like imagine you're watching sports, so you

4
00:00:16,440 --> 00:00:22,290
know, like a football game is going on or bullying game or perhaps you're watching someone in action

5
00:00:22,290 --> 00:00:23,550
movie shooting a gun.

6
00:00:24,000 --> 00:00:30,150
Those are different actions, and you can classify video as a key and classify each frame by what action

7
00:00:30,150 --> 00:00:30,960
is going on.

8
00:00:30,990 --> 00:00:38,160
So let's take a look at how we can create a CNN art and architecture network that can actually take

9
00:00:38,160 --> 00:00:40,710
scenes of a video and classify them correctly.

10
00:00:41,310 --> 00:00:44,100
So let's go into this notebook 66.

11
00:00:44,100 --> 00:00:48,420
And again, this is from the official Keros tutorial that side.

12
00:00:48,450 --> 00:00:55,050
This one is bit by, say, a pool, and what we'll be using is the USB-C one to one data set, so you

13
00:00:55,050 --> 00:00:56,970
can take a look at what a data set is.

14
00:00:57,390 --> 00:01:04,110
It's basically a bunch of different actions or scenes like archery, whatever that is.

15
00:01:04,680 --> 00:01:05,220
Yeah, it's a dual.

16
00:01:05,220 --> 00:01:09,270
It's a drum, lunges, typing different things as usual.

17
00:01:09,750 --> 00:01:14,820
So you have all these different scenes going on, so you have tons of clips of each scene.

18
00:01:15,450 --> 00:01:20,100
And then basically we have to create a classified to classify what's happening in each scene.

19
00:01:20,730 --> 00:01:22,910
So let's take a look at how we do that.

20
00:01:22,920 --> 00:01:27,660
So firstly, we what we do, we need to add TensorFlow doc.

21
00:01:27,660 --> 00:01:30,450
So we just install that there takes about 30 seconds.

22
00:01:31,080 --> 00:01:36,090
Next, we need to download a subsample version of the original dataset because it's quite big.

23
00:01:36,510 --> 00:01:40,360
So we just downloaded the top five classes, just five different classes here.

24
00:01:40,360 --> 00:01:46,290
But we'll be using next week to import our functions, sort of the libraries and then define some type

25
00:01:46,290 --> 00:01:48,540
of parameters to any books that size.

26
00:01:48,540 --> 00:01:49,320
Image size.

27
00:01:49,860 --> 00:01:51,480
Some standard stuff there.

28
00:01:52,020 --> 00:01:53,880
Then we need to prepare data.

29
00:01:53,880 --> 00:01:56,640
So this is how the dataset labels on this trend.

30
00:01:56,640 --> 00:01:57,700
Ozzy Osbourne tests.

31
00:01:58,620 --> 00:02:00,750
And you can take a look at what it looks like here.

32
00:02:00,870 --> 00:02:07,740
You can see each video, each a file, and there are many of them has a cloud tag and the targets punch,

33
00:02:07,740 --> 00:02:10,820
punch, tennis, swing, playing cello, whatever.

34
00:02:11,100 --> 00:02:17,400
So you can take a look of basically some of the challenges and the new networks.

35
00:02:17,400 --> 00:02:19,200
You can read about this blog post here.

36
00:02:20,010 --> 00:02:20,640
Different things.

37
00:02:20,640 --> 00:02:21,410
What's going on?

38
00:02:21,420 --> 00:02:23,390
So anyway, back to the lesson.

39
00:02:23,400 --> 00:02:26,100
So what we have now is some functions you'll be using.

40
00:02:26,100 --> 00:02:32,850
So props and the square function, as well as a video function that uses some open TV functions to load

41
00:02:32,850 --> 00:02:34,740
that video frame by frame.

42
00:02:35,310 --> 00:02:37,290
Then we have to get the features out of it.

43
00:02:37,290 --> 00:02:43,470
So we use the Inception V Tree model, which is trained on the image now dataset to extract features

44
00:02:43,740 --> 00:02:44,730
from our network.

45
00:02:45,270 --> 00:02:50,220
So, so we get that model here to sort of feature extractive model.

46
00:02:50,900 --> 00:02:57,120
Next, we have a pair of string lookup that just basically gets tags right here for labels.

47
00:02:57,690 --> 00:03:02,940
Then we have a function that prepares all of the videos here, and then we can create this test data

48
00:03:02,940 --> 00:03:08,940
and training data and associated labels right there, and you can see the size.

49
00:03:08,940 --> 00:03:12,510
So we have 594 different videos or labels.

50
00:03:12,690 --> 00:03:15,000
Well, essentially the data points in this.

51
00:03:15,570 --> 00:03:20,460
So now I should warn this above code here to prepare all of the videos takes quite a while.

52
00:03:20,460 --> 00:03:21,960
It takes about 20 minutes to run.

53
00:03:22,410 --> 00:03:25,260
So I've run that before, but just please be patient.

54
00:03:25,890 --> 00:03:29,550
I mean, 20 minutes isn't that long, but it does take a while.

55
00:03:30,600 --> 00:03:32,640
Now we have to create a sequence model.

56
00:03:33,090 --> 00:03:36,240
So remember, this is an hour and then CNN based architecture.

57
00:03:36,240 --> 00:03:43,110
So we have the class would have here that we load and we have our basically our sequence are an end

58
00:03:43,110 --> 00:03:44,970
model going on right there.

59
00:03:45,750 --> 00:03:51,390
And then basically we can just start running the experiments and remember we use the features and sequence

60
00:03:51,840 --> 00:03:53,490
in this model to predict.

61
00:03:53,910 --> 00:03:59,640
So we trained this model for 10 epochs, doesn't take very long at all, and we get an accuracy of 53

62
00:03:59,640 --> 00:04:00,850
percent, which is OK.

63
00:04:01,710 --> 00:04:05,540
What pure guessing would have been 20 percent because it's five different classes.

64
00:04:06,090 --> 00:04:09,540
And no, we can actually run some inferences on that.

65
00:04:09,540 --> 00:04:13,000
So let's run an inference that takes a sample.

66
00:04:13,020 --> 00:04:18,900
VIDEO A random sample then does a sequence prediction on it and then creates a GIF after.

67
00:04:19,320 --> 00:04:24,720
And then basically, you can print the label out here so you can see this video, which is clearly boxing

68
00:04:25,380 --> 00:04:25,980
or punching.

69
00:04:27,070 --> 00:04:30,970
You can see it gives two percent probability prediction that it's a punch.

70
00:04:30,990 --> 00:04:31,860
So that's quite good.

71
00:04:32,400 --> 00:04:35,430
Cricket shot tennis swing against their will and swinging motions.

72
00:04:36,150 --> 00:04:41,850
So which playing cello and cheering bit of vacancy and got it right, even though it isn't as sure as

73
00:04:41,850 --> 00:04:42,490
it should be?

74
00:04:42,720 --> 00:04:43,980
It did get it right.

75
00:04:44,100 --> 00:04:48,790
So that's pretty good for just such a quick training so you can do a lot more.

76
00:04:48,810 --> 00:04:53,670
You can take some of the advice here for next steps to see if we can get better accuracy, as well as

77
00:04:53,670 --> 00:04:55,650
try different samples of a data set.

78
00:04:55,650 --> 00:04:57,450
If you want to get more data to train.

79
00:04:57,750 --> 00:04:59,580
Remember to remember it takes a while to process that.

80
00:05:00,260 --> 00:05:06,860
So that's it for this video classification in the next section, we'll take a look at using Transformers

81
00:05:07,070 --> 00:05:08,240
to do the exact same thing.

82
00:05:08,840 --> 00:05:10,530
So I'll see you in the next lesson.

83
00:05:10,550 --> 00:05:10,940
Thank you.
