1
00:00:01,770 --> 00:00:02,760
Hi, welcome back.

2
00:00:03,180 --> 00:00:08,830
In this section will discuss image net, which is extremely important to the computer vision community.

3
00:00:09,300 --> 00:00:15,120
And I'll tell you why, because it's the largest dataset of labeled images that exist out there.

4
00:00:15,510 --> 00:00:16,500
You can see it right now.

5
00:00:16,510 --> 00:00:23,340
It's entirely if you go to the image that website, the entirety of the dataset has 14 million images

6
00:00:23,340 --> 00:00:25,110
and it actually is constantly growing.

7
00:00:25,530 --> 00:00:30,540
However, they are the different visions of image net, so we don't always use a 14 million image of

8
00:00:30,540 --> 00:00:30,960
vision.

9
00:00:31,380 --> 00:00:36,590
In fact, the one we used, it houses an object class image and that which is what a lot of those prior

10
00:00:36,600 --> 00:00:38,340
scenes that we discussed earlier.

11
00:00:38,710 --> 00:00:39,690
I'll benchmark one.

12
00:00:40,140 --> 00:00:42,750
They've got to tell us in classes, which I mentioned here.

13
00:00:43,230 --> 00:00:46,740
So let's talk a bit about what image net is used for.

14
00:00:46,860 --> 00:00:50,700
Basically, it's used to train our CNN models.

15
00:00:50,700 --> 00:00:56,040
And then by having such string, well-trained, seen and models, we can benchmark different architectures

16
00:00:56,460 --> 00:00:59,700
and see which is performing best on such a hot dataset.

17
00:01:00,150 --> 00:01:06,570
It spans 1000 object classes, which I said, and the dataset, the image net vision that's used for

18
00:01:06,570 --> 00:01:07,110
benchmarking.

19
00:01:07,110 --> 00:01:15,240
These things consist of 1.2 million training images, 50000 validation images and 100000 test images.

20
00:01:15,630 --> 00:01:20,870
This subset, because the entire dataset is 14 million, is available on Kaggle.

21
00:01:20,880 --> 00:01:25,410
So if you wanted to experiment and train and benchmark your models on these, you can.

22
00:01:25,770 --> 00:01:27,810
It's all, though it's going to take very, very long.

23
00:01:28,500 --> 00:01:34,260
So as I said, it's the most common benchmark when used when evaluating new CNN architectures, and

24
00:01:34,260 --> 00:01:39,720
you can see how well cited it as the two different flavors of images that are cited when you combine

25
00:01:39,720 --> 00:01:42,660
them just over 7000 times, which is quite a bit.

26
00:01:42,660 --> 00:01:46,920
And it's probably grew a lot since this image was taken and shown on the image that website.

27
00:01:48,690 --> 00:01:52,440
However, just so you know, it's hardly the first image image that.

28
00:01:56,980 --> 00:02:01,060
So just so, you know, imaging, it isn't the first big image dataset.

29
00:02:01,450 --> 00:02:05,110
In fact, obviously you've seen the amnesty, the said, which we have worked with before.

30
00:02:05,620 --> 00:02:07,880
However, we haven't taken a look at many of these.

31
00:02:07,900 --> 00:02:10,180
This is one we may experimental on a sign language.

32
00:02:10,180 --> 00:02:12,790
One is of use a couple of years.

33
00:02:12,820 --> 00:02:17,320
Caltech Sofar dataset is quite useful as well.

34
00:02:17,320 --> 00:02:24,430
To benchmark that works on Lotus Hill is a good segmentation model of seeing the Pascal dataset, which

35
00:02:24,430 --> 00:02:27,010
is used when benchmarking object detectors as well.

36
00:02:27,430 --> 00:02:29,650
So a lot of these are quite useful.

37
00:02:31,280 --> 00:02:32,430
But let's take a look at this.

38
00:02:32,450 --> 00:02:33,950
What makes imaging that's so good?

39
00:02:34,280 --> 00:02:39,270
Well, it's size 15 million images, basically, it's almost up to right now.

40
00:02:39,710 --> 00:02:46,550
And you can see how that compares if if you were to size, use this type of representation to size the

41
00:02:46,550 --> 00:02:48,530
image, the other datasets available.

42
00:02:48,920 --> 00:02:52,850
You can see how tiny they are and how vast emission at this in comparison.

43
00:02:54,140 --> 00:02:56,690
So there's also imaging that wouldn't.

44
00:02:56,690 --> 00:02:57,860
It wouldn't.

45
00:02:57,860 --> 00:03:06,440
It is basically a semantic mapping and an ontological structure based on the relationships of the images

46
00:03:06,440 --> 00:03:07,010
and words.

47
00:03:07,340 --> 00:03:12,880
So you can see things like German Shepherd belongs to a dog, which belongs to animal models already,

48
00:03:13,220 --> 00:03:17,150
which belongs to a different entity, which is probably like models of four legged creatures.

49
00:03:17,870 --> 00:03:19,340
So you can see all of these.

50
00:03:19,520 --> 00:03:23,960
All of these images are mapped onto this into this logical structure.

51
00:03:24,350 --> 00:03:30,080
And this gives us the ability to bring a human level understanding to images because we can take an

52
00:03:30,080 --> 00:03:36,230
image like this when a guy standing on a scale and take all the basically get all these observations

53
00:03:36,230 --> 00:03:36,830
out of it.

54
00:03:37,340 --> 00:03:39,110
So it's quite this.

55
00:03:39,260 --> 00:03:45,530
This is going to be quite a good stepping stone for AI to be developed further and get better analysis

56
00:03:45,530 --> 00:03:47,780
and understanding from images.

57
00:03:48,470 --> 00:03:55,730
So you've seen what emission it is and you've seen that we use it commonly to train and benchmark CNN's

58
00:03:56,300 --> 00:03:58,620
now who is doing the best on imaging it?

59
00:03:58,640 --> 00:04:05,900
Well, in late 2021, when I recorded this video, you can see meter through two levels the sufficient

60
00:04:05,900 --> 00:04:12,650
net, a vision of an efficient net efficient that dash L2 is currently getting the best performance.

61
00:04:12,680 --> 00:04:17,670
This is around King Hill and papers which could go to this to get up more up to date information.

62
00:04:18,380 --> 00:04:22,160
And it's ninety point two percent accuracy, which is incredible.

63
00:04:22,670 --> 00:04:24,920
However, take a look at a number of parameters.

64
00:04:24,980 --> 00:04:32,150
It is massive 480 million parameters and the network is huge, and it's probably not going to ever be

65
00:04:32,150 --> 00:04:35,720
deployed, at least not right now in real world applications.

66
00:04:36,080 --> 00:04:38,120
It's purely research and experimental.

67
00:04:38,660 --> 00:04:41,060
All of these are huge networks, as you can see.

68
00:04:41,480 --> 00:04:44,270
And they do do get extremely good results.

69
00:04:44,270 --> 00:04:48,890
You can see the top five accuracy is close to 100 percent, which is remarkable.

70
00:04:49,190 --> 00:04:51,520
That's very, very good generalization.

71
00:04:51,950 --> 00:04:57,710
So we'll stop there for now and we'll go into the code and start using some of these networks.

72
00:04:57,710 --> 00:04:59,630
We'll start with Alex, Nate and Leonard.

73
00:04:59,960 --> 00:05:04,460
We'll learn how to to build them and pay to watch and carrots.

74
00:05:04,970 --> 00:05:09,650
Then we'll take a look at offloading some of these pre-trained networks like Viji and Raisinets and

75
00:05:09,650 --> 00:05:12,320
experimenting and inferencing images on them.

76
00:05:13,280 --> 00:05:19,090
So that's it for now, and I'll see you in the next section where we start building Lynnette and Alex

77
00:05:19,100 --> 00:05:21,980
net interests, so I'll see you shortly.

78
00:05:21,990 --> 00:05:22,280
But.