1
00:00:03,510 --> 00:00:09,210
You're the World is a zero shot object detection model, which means we can do object detection without

2
00:00:09,210 --> 00:00:11,160
training the object detection model.

3
00:00:11,160 --> 00:00:17,550
So in YOLO world, all we have to do is prompt the model by specifying the list of classes that we are

4
00:00:17,550 --> 00:00:20,580
looking for and in an image or video.

5
00:00:20,580 --> 00:00:21,720
And that's it.

6
00:00:21,720 --> 00:00:25,170
No training is required in case of YOLO world.

7
00:00:25,260 --> 00:00:31,740
So YOLO world is this is designed to solve a limitation of existing zero shot object detection model

8
00:00:31,740 --> 00:00:33,090
which is speed.

9
00:00:33,090 --> 00:00:34,890
Or you can say latency.

10
00:00:34,890 --> 00:00:36,000
So what is latency?

11
00:00:36,030 --> 00:00:42,390
Latency is basically the time taken by the object detection model to do object detection when an input

12
00:00:42,390 --> 00:00:42,900
image.

13
00:00:42,900 --> 00:00:49,320
So in case of YOLO world we solve this limitation of existing zero shot object detection models.

14
00:00:49,590 --> 00:00:57,180
So the other state of the art object zero shot object detection models like grounding dino uh, use

15
00:00:57,180 --> 00:01:00,870
transformer based architecture, which is typically slow architecture.

16
00:01:00,870 --> 00:01:07,290
But in case of YOLO world, YOLO world is being designed using faster CNN based YOLO architecture.

17
00:01:07,290 --> 00:01:11,790
So YOLO world is being designed using faster CNN based YOLO architecture.

18
00:01:12,270 --> 00:01:16,290
So the downside of Granadino is its speed.

19
00:01:16,290 --> 00:01:21,090
So the downside of grounding 0 or 0 shot object detection model is its speed.

20
00:01:21,090 --> 00:01:27,480
So the time like grounding model takes around one second to process a single image, which is pretty

21
00:01:27,480 --> 00:01:27,840
slow.

22
00:01:27,840 --> 00:01:32,760
If you are thinking about processing live video streams using Grounding Dino.

23
00:01:36,750 --> 00:01:40,350
The following chart is being taken from the Yellow World paper.

24
00:01:40,830 --> 00:01:47,550
So from the chart we can see that yellow world maintains almost the same accuracy, and it is 20 times

25
00:01:47,550 --> 00:01:52,890
faster and five times smaller than the leading zero shot object detection models.

26
00:01:53,280 --> 00:02:01,140
According to the YOLO world paper, the small version of the YOLO YOLO world model achieves up to 74.1

27
00:02:01,140 --> 00:02:06,210
fps on Nvidia V 100 GPU, which is quite impressive.

28
00:02:09,010 --> 00:02:16,420
Traditional object detection models such as faster R-cnn, SSD, and YOLO models are designed to detect

29
00:02:16,420 --> 00:02:19,810
objects within a predefined set of categories.

30
00:02:20,170 --> 00:02:26,650
For example, models which are trained on Coco dataset are limited to limited to 80 categories.

31
00:02:26,650 --> 00:02:32,620
So if you want a model to detect new objects that does not exist in the Coco data set.

32
00:02:32,620 --> 00:02:37,480
For example, if I want to detect a pistol or if I want to detect personal protective equipment, I

33
00:02:37,480 --> 00:02:40,240
if I want to do personal protective equipment detection.

34
00:02:40,240 --> 00:02:46,210
So then what we need to do, we need to, uh, create a data of personal protective equipment detection

35
00:02:46,390 --> 00:02:50,230
where we have the images of, like, different personal protective equipment.

36
00:02:50,230 --> 00:02:55,540
And if you want to do, uh, pistol detection, I need to create a data set where we have different

37
00:02:55,540 --> 00:02:56,770
images of pistols.

38
00:02:56,770 --> 00:02:59,440
So then we need to annotate those images.

39
00:02:59,440 --> 00:03:03,310
And then I need to train the object detection model on this data set.

40
00:03:03,310 --> 00:03:07,000
So this is of course very much time consuming okay.

41
00:03:07,000 --> 00:03:12,970
So uh due to this limitation, like we have to, uh, collect the data set and data data set and train

42
00:03:12,970 --> 00:03:14,200
the object detection model.

43
00:03:14,200 --> 00:03:18,370
So this is what we need to do in case of traditional object detection model.

44
00:03:18,370 --> 00:03:23,680
So due to this limitations the researchers began to develop open vocabulary models.

45
00:03:23,680 --> 00:03:30,940
So the traditional object detectors are object detector detectors are basically fixed vocabulary detection

46
00:03:30,940 --> 00:03:31,570
models.

47
00:03:31,570 --> 00:03:37,810
So the researchers that have done tried to develop some open vocabulary detection models like YOLO world

48
00:03:37,810 --> 00:03:40,060
is an open vocabulary detection model.

49
00:03:40,060 --> 00:03:43,840
Ground in Reno is an open vocabulary detection model.

50
00:03:43,840 --> 00:03:47,770
So due to this limitations like so much time consuming process.

51
00:03:47,770 --> 00:03:53,590
So the due to this limitations, researchers begin to develop open vocabulary detection models.

52
00:03:53,590 --> 00:03:59,230
A few months back, uh, like a few months back, Ronaldinho was introduced, which is an open vocabulary

53
00:03:59,230 --> 00:03:59,980
detection model.

54
00:03:59,980 --> 00:04:02,590
Or you can say zero shot detection model.

55
00:04:02,590 --> 00:04:05,080
Why we say it a zero shot object detection model.

56
00:04:05,080 --> 00:04:09,520
Because, uh, we do not require to train the model to do object detection.

57
00:04:10,560 --> 00:04:17,490
So in zero short object detection model or in open vocabulary detection object detection models, all

58
00:04:17,490 --> 00:04:22,440
we have to do is prompt the model by specifying the list of classes that you are looking for.

59
00:04:22,440 --> 00:04:25,980
So in case of hello world, we will see uh, in next lecture.

60
00:04:25,980 --> 00:04:31,740
So in case of YOLO world, what we need to do is we just need to specify the list of classes that we

61
00:04:31,740 --> 00:04:33,930
want to detect in an image or video.

62
00:04:33,930 --> 00:04:40,080
And we just want to need to pass that list of classes into the input prompt of the YOLO world model.

63
00:04:40,080 --> 00:04:40,590
Okay.

64
00:04:41,100 --> 00:04:47,370
And we don't need to train the model on that specific classes that you want to detect in image or video.

65
00:04:48,660 --> 00:04:53,880
So this is what it, uh, it's we're doing, uh, open vocabulary detection model, which include the

66
00:04:53,880 --> 00:04:55,230
underworld and growling dino.

67
00:04:55,230 --> 00:05:01,410
We just need to pass the list of classes that we want to detect in image or video, uh, as a prompt

68
00:05:01,410 --> 00:05:03,750
to the underworld or bounding dino model.

69
00:05:03,750 --> 00:05:06,690
But YOLO world outperforms the grounding dino model.

70
00:05:06,690 --> 00:05:11,130
There is a some limitations or you can say downsides of grounding Dino model.

71
00:05:11,130 --> 00:05:16,620
So the downside of downside of grounding dino model is its speed, so that the grounding dino model

72
00:05:16,620 --> 00:05:19,860
takes around one second to process a single image.

73
00:05:20,160 --> 00:05:25,500
Uh, which is good enough if you don't care about the amount of latency, but pretty slow if you are

74
00:05:25,500 --> 00:05:29,760
thinking about processing a live video streams using grounding Dino model.

75
00:05:31,310 --> 00:05:34,670
This reason is that the grounding model and other zeros.

76
00:05:35,060 --> 00:05:41,780
Uh uh, zero shot object detector models, except for the world, use heavy transformer based architecture

77
00:05:41,780 --> 00:05:46,610
and require simultaneous processing of text and images during the inference.

78
00:05:46,610 --> 00:05:54,110
And that slows down the processing and it increases the inference time or it increases the latency.

79
00:05:54,350 --> 00:05:54,920
Okay.

80
00:05:55,220 --> 00:05:59,720
And here comes the YOLO world, uh, which is a zero shot object detector model.

81
00:05:59,720 --> 00:06:02,690
And it is equally accurate, unlike Grounding Dino.

82
00:06:02,690 --> 00:06:08,720
And it is 20 times faster than grounded Dino and other, uh, zero shot models.

83
00:06:08,720 --> 00:06:13,910
And it is five times smaller than its predecessors, like Grounding Dino.

84
00:06:14,420 --> 00:06:20,540
So the YOLO world model outperforms other zero shot object detection models like it is 20 times faster,

85
00:06:20,960 --> 00:06:23,450
five times smaller, and it is equally accurate.

86
00:06:25,290 --> 00:06:26,370
So going ahead.

87
00:06:26,370 --> 00:06:27,750
So what is YOLO world?

88
00:06:27,870 --> 00:06:34,320
So YOLO world is introduced in the research paper, uh, title YOLO World Real Time Open Vocabulary

89
00:06:34,320 --> 00:06:40,050
Object Detection, which shows a significant advancement in the field of open vocabulary object detection

90
00:06:40,050 --> 00:06:45,810
by demonstrating that lightweight detector, uh, such as these from the YOLO series.

91
00:06:45,810 --> 00:06:53,760
So like the YOLO world model, use faster CNN based YOLO architecture like in YOLO E8, and it achieves

92
00:06:53,760 --> 00:06:55,350
a strong vocabulary performance.

93
00:06:55,350 --> 00:06:55,920
Like it?

94
00:06:55,920 --> 00:07:01,290
Uh, it is equally accurate than the other zero shot object detection world object detector models.

95
00:07:01,320 --> 00:07:08,130
Okay, so YOLO world basically introduces the prompt then detect paradigm which is a novel approach.

96
00:07:08,130 --> 00:07:14,340
So YOLO world a novel approach by the name prompt that then detect paradigm is being introduced that

97
00:07:14,340 --> 00:07:20,880
avoids the need for real time text encoding or like a property of other zero shot object detector,

98
00:07:21,120 --> 00:07:23,220
uh, object detector models like grounding.

99
00:07:23,640 --> 00:07:28,740
So in ground, uh, in YOLO world, we don't perform real time text encoding.

100
00:07:28,740 --> 00:07:31,530
That basically reduces the speed of the model.

101
00:07:31,530 --> 00:07:37,080
And this real time text encoding is being performed in grounding Reno and other zero shot object detector

102
00:07:37,080 --> 00:07:38,370
models as well.

103
00:07:39,790 --> 00:07:43,150
So the yellow world provides comes with three different models.

104
00:07:43,150 --> 00:07:49,690
You have the world small, which has 313 million parameters, and when Reparameterized goes to 77 million,

105
00:07:49,690 --> 00:07:55,810
and the yellow world medium model has 29 million parameters, and when it is reparameterized, it goes

106
00:07:55,810 --> 00:07:56,680
to 92 million.

107
00:07:56,680 --> 00:08:00,190
And yellow world large model has 48 million parameters.

108
00:08:00,190 --> 00:08:04,810
And when it is reparameterized, it is close to 110 million parameters.

109
00:08:05,140 --> 00:08:11,440
So the yellow world team benchmarked the model on Elvis data set and measured the performance on a V

110
00:08:11,440 --> 00:08:19,120
and V 100 GPU without any performance acceleration mechanism like quantization or tensor RT.

111
00:08:19,720 --> 00:08:26,710
So according to the paper, Yellow World uh reached uh between 35.4 and mean average precision with

112
00:08:26,710 --> 00:08:37,180
52.0 on the large model, uh, and 26.2 mean average precision with 74.1 for the large version.

113
00:08:37,180 --> 00:08:43,240
So this performance is quite, uh, low if we compare with other object detector models.

114
00:08:43,240 --> 00:08:48,550
But this performance is quite compatible if we compare it with other zero shot object detector models.

115
00:08:48,550 --> 00:08:53,680
But if you compare with like state of the art object detector models like YOLO, yolo, v9, this performance

116
00:08:53,680 --> 00:09:00,040
is low, but it is, uh, equally accurate when we compare it with grounding dino Glip like these models.

117
00:09:01,110 --> 00:09:02,790
So here comes the conclusion.

118
00:09:03,270 --> 00:09:11,340
So, is your world a golden solution or is that golden ocean the model that ends training on the custom

119
00:09:11,340 --> 00:09:11,820
dataset?

120
00:09:11,820 --> 00:09:14,910
So do I consider a YOLO world a golden solution?

121
00:09:15,030 --> 00:09:19,950
Like it ends training on the custom dataset, you just need to pass the name of the classes that you

122
00:09:19,950 --> 00:09:26,040
want to detect in image or video in the uh, as a prompt, uh, to the YOLO world model.

123
00:09:26,040 --> 00:09:26,910
So.

124
00:09:27,930 --> 00:09:33,300
But I don't think so that your world is a golden solution, because there are still cases where I would

125
00:09:33,300 --> 00:09:35,940
choose the model trained on a custom data set.

126
00:09:36,270 --> 00:09:41,670
Uh, like the Yolo v eight or YOLO v nine model trained on the custom data set over zero shot object

127
00:09:41,670 --> 00:09:43,140
detector like YOLO world.

128
00:09:44,270 --> 00:09:46,640
Because there is one issue which is latency.

129
00:09:46,640 --> 00:09:52,730
Like the Hello World takes around, uh, hello world takes more time than other state of the art object

130
00:09:52,730 --> 00:09:59,510
detector, uh, to process an image or the latency is a much higher than the other state of the art

131
00:09:59,510 --> 00:10:00,800
object detection models.

132
00:10:01,340 --> 00:10:07,670
So although YOLO world is faster than Reno, but it is slow as compared to other object detection models

133
00:10:07,670 --> 00:10:09,260
like V8 or YOLO Nye.

134
00:10:11,080 --> 00:10:16,990
Okay, so if we required faster processing and have limited limited computational resources, for example,

135
00:10:16,990 --> 00:10:23,770
if I'm working on an Nvidia T4 GPU, like I have limited, uh, the resources like limited GPU requirements

136
00:10:23,770 --> 00:10:29,860
and uh, I require fast processing in real time, then I will be using the traditional object detectors

137
00:10:29,860 --> 00:10:37,000
like YOLO v9, Yolo V8 plus Yolo V8 is less accurate and less reliable as compared to other object detectors

138
00:10:37,000 --> 00:10:42,370
like YOLO seven, Yolo V8 or YOLO v nine when trained or custom data set.

139
00:10:42,370 --> 00:10:49,390
So in short, YOLO world is is an important step in making open vocabulary object detection models faster,

140
00:10:49,390 --> 00:10:55,930
cheaper and widely available, making nearly the same accuracy like other its predecessors like Ground

141
00:10:55,930 --> 00:10:56,290
and Reno.

142
00:10:56,290 --> 00:11:02,800
But and YOLO we will avoid is 20 times faster and five, uh, times smaller than the leading zero shot

143
00:11:03,340 --> 00:11:04,600
detectors models.

144
00:11:04,600 --> 00:11:06,010
So that's all from hysteria.

145
00:11:06,040 --> 00:11:06,340
Thank you.