1
00:00:02,530 --> 00:00:08,770
The whole world is a zero shot object detection model, which enables us to do object detection without

2
00:00:08,770 --> 00:00:13,390
requiring the need to train the object detection model.

3
00:00:14,130 --> 00:00:21,360
In YOLO world, we only need to define a prompt as a list of class names that we want to detect in an

4
00:00:21,360 --> 00:00:25,530
image or a video frame, or on a in a complete video.

5
00:00:26,460 --> 00:00:32,190
So one of the limitation of existing zero shot object detection models, which include grounding Dino

6
00:00:32,190 --> 00:00:33,120
is the speed.

7
00:00:33,540 --> 00:00:35,010
Or you can say latency.

8
00:00:35,010 --> 00:00:40,980
Latency is basically the time taken by the object detection model to do object detection on an image.

9
00:00:41,310 --> 00:00:47,190
So, uh, the existing zero shot object detection model takes quite some time to do object detection

10
00:00:47,190 --> 00:00:50,910
on an image or on a frame of a video.

11
00:00:50,940 --> 00:00:54,720
So in YOLO world, this limitation has been addressed.

12
00:00:54,720 --> 00:01:00,000
So YOLO world was designed to solve a limitation of existing zero shot object detection model, which

13
00:01:00,000 --> 00:01:01,410
is speed or latency.

14
00:01:02,100 --> 00:01:08,580
The existing state of the art object, uh, state of the art zero shot object detection models such

15
00:01:08,580 --> 00:01:16,170
as ground in Dino use transformers in their architecture, which is a powerful architecture but a slower

16
00:01:16,170 --> 00:01:17,190
architecture.

17
00:01:17,520 --> 00:01:19,800
YOLO world in YOLO world architecture.

18
00:01:21,710 --> 00:01:22,160
Watch this.

19
00:01:22,160 --> 00:01:28,910
CNN based YOLO architecture is being used according to the paper or according to the YOLO world paper

20
00:01:29,270 --> 00:01:31,610
that the team has published.

21
00:01:31,640 --> 00:01:42,290
YOLO world achieves a mean average precision of 35.4, with 52.0 for the large version and 26.2 average

22
00:01:42,290 --> 00:01:46,730
precision, with 74.1 for the small version.

23
00:01:47,900 --> 00:01:52,880
While on the 300 GPU and Nvidia 300 GPU, which is a powerful GPU.

24
00:01:53,270 --> 00:02:01,880
Um, achieving such high FPS, which is around 52.0 fps with the mean average precision of 35.4%, is

25
00:02:02,240 --> 00:02:03,500
very impressive.

26
00:02:04,340 --> 00:02:11,780
So as I told you in the hello world, you only need to pass the class names as a list into the input

27
00:02:11,780 --> 00:02:12,410
prompt.

28
00:02:12,410 --> 00:02:18,710
So now you can say that, uh, I just if you want to, if I want to detect the, uh, person in red,

29
00:02:18,740 --> 00:02:24,170
I just need to write in the prompt the person in the red, and it will offer you the world model will

30
00:02:24,170 --> 00:02:28,040
reject this person, which is in red jacket.

31
00:02:28,040 --> 00:02:35,420
And if I want to detect, uh, uh, the brown animals in an image which has many different animals.

32
00:02:35,420 --> 00:02:41,360
So I just need to pass the in the input prompt, the brown animal, and it will reject the brown animal

33
00:02:41,360 --> 00:02:42,140
in the image.

34
00:02:42,140 --> 00:02:46,310
And if you want to detect the tallest person in the image, you just need to pass the input from the

35
00:02:46,310 --> 00:02:49,400
tallest person and it will detect the tallest person in the image.

36
00:02:49,400 --> 00:02:53,780
Similarly, if you want to drink the person with a white shirt, you just need to pass the prompt as

37
00:02:53,780 --> 00:02:57,020
a person with a white shirt and it will detect a person with a white shirt.

38
00:02:58,310 --> 00:03:04,010
But before we run the script, uh, let me, uh, let's make sure that we have selected the runtime

39
00:03:04,010 --> 00:03:04,820
as GPU.

40
00:03:04,820 --> 00:03:07,190
So we have selected the runtime as T4 GPU.

41
00:03:07,190 --> 00:03:10,520
So if you just go over here, you can see that we have selected t4 GPU.

42
00:03:10,520 --> 00:03:17,660
So in this tutorial we will see how we can do object detection or image or on a video using YOLO world

43
00:03:18,260 --> 00:03:18,920
model.

44
00:03:20,750 --> 00:03:22,190
So let's run this cell.

45
00:03:22,190 --> 00:03:26,960
So this will just make us sure that we have selected the runtime as GPU like GPU memory usage.

46
00:03:26,960 --> 00:03:27,890
And that's fine.

47
00:03:28,220 --> 00:03:36,110
So now, uh, to, uh, manage dataset images and models, let us set the our own uh, directory as

48
00:03:36,110 --> 00:03:36,530
home.

49
00:03:37,550 --> 00:03:39,500
So now we will install the required packages.

50
00:03:39,500 --> 00:03:43,700
So we will be using two packages with the two main packages which we will be using.

51
00:03:43,700 --> 00:03:48,620
There are other packages or libraries as well but that there are two main packages are Python packages

52
00:03:48,620 --> 00:03:50,210
that we will be using this tutorial.

53
00:03:50,210 --> 00:03:53,240
One is inference and other is uh supervision.

54
00:03:53,240 --> 00:03:56,870
So these are uh both are available under Roboflow.

55
00:03:56,870 --> 00:04:02,690
So inference for zero shot object detection uh using YOLO world and supervision for post processing

56
00:04:02,690 --> 00:04:06,590
after doing zero shot object detection on image or on video.

57
00:04:06,980 --> 00:04:10,850
So as I told you, both of these packages are available under Roboflow.

58
00:04:10,850 --> 00:04:16,250
So Roboflow inference is an open source platform designed to simplify the deployment of computer vision,

59
00:04:16,400 --> 00:04:17,450
uh, models.

60
00:04:17,450 --> 00:04:23,780
So it enables developers to perform so using inference we can uh perform object detection classification,

61
00:04:23,780 --> 00:04:25,100
instance segmentation.

62
00:04:25,100 --> 00:04:31,070
So using uh inference you can use YOLO v9 or Yolo V8 model YOLO world model.

63
00:04:31,070 --> 00:04:36,140
So these are different models that you can use in in a single package.

64
00:04:37,060 --> 00:04:43,000
So YOLO world or inference package enables developers to perform object detection, classification and

65
00:04:43,000 --> 00:04:44,350
instance segmentation.

66
00:04:44,620 --> 00:04:49,300
Uh, like you can use different segmentation models which include segment and anything model as well.

67
00:04:49,300 --> 00:04:55,870
Plus our YOLO world is also available under this um Python package which is inference.

68
00:04:55,870 --> 00:04:56,290
Okay.

69
00:04:56,530 --> 00:05:04,570
So uh, to do object detection image or video, we will be using um inference package so that we can

70
00:05:04,570 --> 00:05:10,480
do object detection on image or video using your world and to do post-processing on the results.

71
00:05:10,480 --> 00:05:13,420
Then we will be using the supervision package okay.

72
00:05:13,540 --> 00:05:15,220
So let's install this package.

73
00:05:20,650 --> 00:05:23,020
So the package installation will take quite some time.

74
00:05:23,020 --> 00:05:26,440
So let's wait for this package to get installed and then I will go back.

75
00:05:29,780 --> 00:05:31,580
But the inference package is being installed.

76
00:05:31,580 --> 00:05:34,400
So next we will install the supervision package.

77
00:05:35,600 --> 00:05:39,290
So this will take a few more seconds before this package will get installed.

78
00:05:41,520 --> 00:05:42,120
Okay.

79
00:05:42,120 --> 00:05:44,340
So okay so this package is being installed.

80
00:05:44,340 --> 00:05:51,120
So we require supervision package for post processing after we have done detection using uh inference

81
00:05:51,120 --> 00:05:53,910
package uh using that YOLO world model.

82
00:05:55,010 --> 00:06:00,110
So now I'm also ensuring that you don't package because I will be downloading as sample image and a

83
00:06:00,110 --> 00:06:01,760
video from my Google drive.

84
00:06:01,790 --> 00:06:08,000
So to download a sample image from or an video from Google Drive directly in the Google Colab notebook,

85
00:06:08,000 --> 00:06:10,220
we required that ground package.

86
00:06:11,480 --> 00:06:13,550
So next we will be importing the required libraries.

87
00:06:13,550 --> 00:06:18,800
We require the OpenCV Python library so that we can draw bounding box bounding boxes.

88
00:06:18,800 --> 00:06:21,710
We can add text or all this stuff.

89
00:06:21,710 --> 00:06:24,440
Then we require supervision package for post-processing.

90
00:06:24,440 --> 00:06:26,840
Then we require math uh package.

91
00:06:26,840 --> 00:06:29,600
So our libraries so that we can calculate the confidence score.

92
00:06:29,630 --> 00:06:31,670
Then we also require Tqdm library.

93
00:06:31,670 --> 00:06:37,670
And then we require the inference library so that our inference package so that we can do object detection

94
00:06:37,670 --> 00:06:39,440
on image or video using your world.

95
00:06:39,440 --> 00:06:42,890
Then to display an image in the Google Colab notebook.

96
00:06:42,980 --> 00:06:45,530
Uh, we require the IP from IPython dot display.

97
00:06:45,530 --> 00:06:48,260
We require the code image package.

98
00:06:49,520 --> 00:06:52,340
So now I'm downloading a sample image.

99
00:06:52,990 --> 00:06:59,170
And our two demo videos from the My Google Drive directly into this Google Colab notebook.

100
00:06:59,590 --> 00:07:05,020
So I've added the image and this window videos into Google Drive, and I'm just adding the link over

101
00:07:05,020 --> 00:07:05,500
here.

102
00:07:05,500 --> 00:07:10,360
So I will be able to download this image and these videos from my Google Drive directly into this Google

103
00:07:10,360 --> 00:07:11,110
Colab notebook.

104
00:07:11,290 --> 00:07:14,380
So let's run this two cells over here.

105
00:07:16,900 --> 00:07:19,570
So now you can see that we have, uh, downloaded a sample image.

106
00:07:19,570 --> 00:07:24,910
And now I'm also download trying to download two demo videos so that I will be doing object detection

107
00:07:24,910 --> 00:07:27,310
on this image and video using your world.

108
00:07:27,610 --> 00:07:31,630
So now I have just set the source image path as this image which I have downloaded.

109
00:07:31,630 --> 00:07:36,640
And you can have just set the source video path as one of the video which I have downloaded over here,

110
00:07:36,640 --> 00:07:37,510
which is test or.

111
00:07:41,000 --> 00:07:46,460
So the underworld comes with three different models the underworld Small Model, Underworld Medium Model

112
00:07:46,460 --> 00:07:47,930
and Underworld Large model.

113
00:07:47,930 --> 00:07:52,310
In this tutorial I will be using the Underworld large model, but you can try out with your old medium

114
00:07:52,310 --> 00:07:53,930
and small model as well.

115
00:07:53,930 --> 00:07:59,150
Okay, so as we are using inference package which are which is available under Roboflow.

116
00:07:59,150 --> 00:08:03,020
So to use the underworld model I'm using inference package.

117
00:08:03,260 --> 00:08:05,210
Uh so which is available under Roboflow.

118
00:08:05,210 --> 00:08:07,640
But we don't require the Roboflow API key.

119
00:08:07,670 --> 00:08:10,730
So that to use the uh, your world model.

120
00:08:11,180 --> 00:08:11,630
Okay.

121
00:08:12,080 --> 00:08:17,750
So now you can see I have just initialized the, uh, or I just loaded the YOLO model world model over

122
00:08:17,750 --> 00:08:23,570
here so you can see that we have, uh, gradient instance of the YOLO world class over here.

123
00:08:23,570 --> 00:08:29,540
And, um, the, the model ID, I have just initialized the YOLO world large model, which I will be

124
00:08:29,540 --> 00:08:30,950
using in this tutorial.

125
00:08:34,030 --> 00:08:34,390
Okay.

126
00:08:38,920 --> 00:08:44,050
Uh, so here you can see that we have loaded the model with the world model by simply creating an instance

127
00:08:44,050 --> 00:08:45,370
of your world class.

128
00:08:45,370 --> 00:08:49,330
So this pseudo world class has two basic core functions.

129
00:08:49,330 --> 00:08:51,700
One is the set classes and one is the inference.

130
00:08:51,700 --> 00:08:52,180
Okay.

131
00:08:52,810 --> 00:08:59,230
So in the previous zero shot object detection models which include uh grounding dino as well.

132
00:08:59,230 --> 00:09:04,510
So in that in this zero shot object detection model, real time text encoding is being done.

133
00:09:04,510 --> 00:09:11,770
But in case of the world, we do not do real time text boarding because the world utilizes the prompt

134
00:09:11,770 --> 00:09:13,420
that then detect paradigm.

135
00:09:13,420 --> 00:09:17,590
In the hello world, we first, uh, pass the prompt and then we do the detection.

136
00:09:17,590 --> 00:09:21,220
So we do not do real time text encoding okay.

137
00:09:22,180 --> 00:09:28,120
So you can see over here in that classes, I've just created a list of different objects that I want

138
00:09:28,120 --> 00:09:30,550
to detect in my input image okay.

139
00:09:30,850 --> 00:09:31,600
So.

140
00:09:33,750 --> 00:09:35,880
So you can see over here this is my problem.

141
00:09:35,880 --> 00:09:40,740
So this is you can say that this is my input prompt which I'm passing over here okay.

142
00:09:40,860 --> 00:09:46,980
So when I just call the set classes method in this hello world class we have the set classes method

143
00:09:46,980 --> 00:09:47,700
as well.

144
00:09:47,700 --> 00:09:52,380
So uh, when we I call the set uh classes method over here.

145
00:09:52,650 --> 00:09:59,520
So basically this input prompt which is the classes over here are being encoded into offline vocabulary.

146
00:09:59,520 --> 00:10:02,370
So when I cross call the set classes method.

147
00:10:02,370 --> 00:10:06,840
So this uh these classes will be converted into offline vocabulary okay.

148
00:10:06,840 --> 00:10:09,450
So we don't need to perform real time text encoding.

149
00:10:09,450 --> 00:10:12,540
And this uh basically speed up the process as well.

150
00:10:12,540 --> 00:10:12,990
Okay.

151
00:10:15,100 --> 00:10:19,330
So now when I run this cell, you can see that or this something is being downloaded.

152
00:10:19,330 --> 00:10:24,940
So basically, uh, the clip model is being downloaded in the background, which convert this to the

153
00:10:24,940 --> 00:10:27,070
list of classes into a embeddings.

154
00:10:27,070 --> 00:10:30,670
So we are just creating embeddings in the background okay.

155
00:10:33,320 --> 00:10:38,900
So now we need to perform object detection on the sample image using this YOLO world large model, which

156
00:10:38,900 --> 00:10:40,490
we have initialized over here.

157
00:10:40,520 --> 00:10:47,330
Okay, so as I told you in the start that this hello world class has two core functions.

158
00:10:47,330 --> 00:10:48,440
One is a set classes.

159
00:10:48,440 --> 00:10:49,880
So which we have used over here.

160
00:10:49,880 --> 00:10:55,460
The set classes basically convert the uh our input prompt, which is the class names over which we have

161
00:10:55,460 --> 00:10:58,100
defined over here into embeddings.

162
00:10:58,130 --> 00:10:58,670
Okay.

163
00:10:58,670 --> 00:11:01,250
Or you can say we have created an offline vocabulary.

164
00:11:01,250 --> 00:11:05,270
Then we have a second method or second function which is infer okay.

165
00:11:05,270 --> 00:11:07,160
So this is not inference.

166
00:11:07,520 --> 00:11:08,690
So this is inferred.

167
00:11:09,170 --> 00:11:09,980
So.

168
00:11:12,400 --> 00:11:16,210
So now we just pass the loaded image to our inferred method.

169
00:11:16,210 --> 00:11:17,620
The second method which is inferred.

170
00:11:17,620 --> 00:11:22,450
And using this we will be able to do object detection on that input image okay.

171
00:11:23,180 --> 00:11:28,370
So now you can say that I'm just passing the loaded, loaded image to the second method, which is inference.

172
00:11:28,370 --> 00:11:35,030
And that's, uh, draw the bounding boxes around each of the detected object using CV two dot rectangle.

173
00:11:35,030 --> 00:11:38,450
And then I will be adding the confidence score and label name as well.

174
00:11:38,450 --> 00:11:38,990
Label.

175
00:11:38,990 --> 00:11:41,960
In the label we have the confidence score and the class name.

176
00:11:42,080 --> 00:11:43,790
So let's do all this.

177
00:11:43,790 --> 00:11:49,010
And to save the output image I'm using cv2 dot imwrite method okay.

178
00:11:49,370 --> 00:11:54,110
So okay so my output image is being saved by the name output dot jpg.

179
00:11:54,530 --> 00:11:56,570
So you can see we have the output dot jpg.

180
00:11:56,570 --> 00:11:58,880
So let me display this output image over here.

181
00:12:02,240 --> 00:12:03,980
So now you can see over here.

182
00:12:04,880 --> 00:12:06,920
In the list of classes which I want to direct.

183
00:12:06,920 --> 00:12:13,940
I posses the handbag, shoes, bicycle, purse, traffic light, backpack and purse on.

184
00:12:13,940 --> 00:12:14,240
Okay.

185
00:12:15,230 --> 00:12:19,070
So it has directed a handbag over here as well.

186
00:12:19,670 --> 00:12:24,770
Okay, well it has shoes as well.

187
00:12:24,800 --> 00:12:26,330
Bicycle as well.

188
00:12:27,470 --> 00:12:29,870
But you can see over here there are two persons.

189
00:12:29,870 --> 00:12:31,310
This person and not detected.

190
00:12:31,310 --> 00:12:34,640
This traffic light is not detected over here as well.

191
00:12:34,850 --> 00:12:40,580
So, uh, lets us, uh, by default in the other world model, the confidence score is being set to

192
00:12:40,580 --> 00:12:41,600
0.5.

193
00:12:41,600 --> 00:12:47,690
So we will now, uh, like, you can see that now that many classes from our programs were not detected,

194
00:12:47,690 --> 00:12:51,860
like you can see over here, these shows were not detected over here as well.

195
00:12:51,860 --> 00:12:54,350
And, uh, these persons were not detected.

196
00:12:54,350 --> 00:12:58,130
So, uh, there are many different classes that were not detected.

197
00:12:58,430 --> 00:12:58,730
Okay.

198
00:12:58,730 --> 00:13:00,680
Like this traffic light is not being detected.

199
00:13:00,680 --> 00:13:02,000
The background traffic light.

200
00:13:02,000 --> 00:13:05,060
So we will just, uh, decrease the confidence score.

201
00:13:05,060 --> 00:13:09,530
So the by default, the inference confidence threshold is being set to 0.5.

202
00:13:09,530 --> 00:13:12,920
So we will decrease the confidence score to 0.10.

203
00:13:13,190 --> 00:13:17,480
And let's see if all the objects are detected okay.

204
00:13:17,480 --> 00:13:19,910
So and let me just show you the output image now.

205
00:13:22,960 --> 00:13:26,350
So now you can see over here we have predicted this shows as well.

206
00:13:26,530 --> 00:13:28,630
We have detected this traffic as well.

207
00:13:28,630 --> 00:13:32,440
But uh, we have detected these persons as well, these persons as well.

208
00:13:32,440 --> 00:13:34,000
So the results are pretty impressive.

209
00:13:34,000 --> 00:13:39,220
But we are just facing one other issue, which is duplicate detection or double detections.

210
00:13:39,220 --> 00:13:44,530
Like, you can see that there is one bicycle, but the bicycle is directed two times like you can see

211
00:13:44,530 --> 00:13:44,710
here.

212
00:13:44,710 --> 00:13:46,960
Bicycle with confidence score 0.15.

213
00:13:47,620 --> 00:13:51,220
Bicycle with confidence score 0.43.

214
00:13:52,510 --> 00:13:57,760
Okay, so, uh, they are double reduction, like there is one bicycle, but the bicycle is being detected

215
00:13:58,630 --> 00:13:59,290
two times.

216
00:13:59,290 --> 00:14:03,100
To solve this issue we will be using non expression okay.

217
00:14:06,350 --> 00:14:09,710
Or now you can see over here we have the issue of double detections.

218
00:14:09,710 --> 00:14:16,370
Like for a single object we have two bounding boxes one with confidence score 0.43 and other with confidence

219
00:14:16,370 --> 00:14:18,290
score 0.515.

220
00:14:18,290 --> 00:14:24,800
So now you can see that to uh remove uh, double detections or to eliminate double detections, we will

221
00:14:24,800 --> 00:14:27,140
be using a technique called norm expression.

222
00:14:27,140 --> 00:14:33,200
So norm expression is a technique which is used to remove redundant or overlapping bounding boxes.

223
00:14:33,200 --> 00:14:39,350
So now you can see that we have two overlapping bounding boxes for this uh with one with bicycle with

224
00:14:39,350 --> 00:14:43,670
confidence for 0.43 and other with confidence for 0.15.

225
00:14:44,030 --> 00:14:44,360
Okay.

226
00:14:44,540 --> 00:14:49,730
So non-maximum suppression is a technique which is used to remove overlap bounding boxes over here.

227
00:14:49,730 --> 00:14:54,590
So over double uh bounding boxes like you can say for a single object okay.

228
00:14:54,590 --> 00:15:01,520
So the norm expression we have uh threshold uh which values varies from 0 to 1.

229
00:15:01,520 --> 00:15:02,060
Okay.

230
00:15:02,060 --> 00:15:08,870
So if uh, we set the threshold value high, this will result in fewer bounding boxes being suppressed,

231
00:15:09,080 --> 00:15:12,260
uh, which may lead to more overlap, uh, bounding boxes.

232
00:15:12,260 --> 00:15:13,040
So.

233
00:15:14,230 --> 00:15:19,870
If we set the threshold value high, like the threshold value that in norm expression we have a threshold

234
00:15:19,870 --> 00:15:20,140
value.

235
00:15:20,140 --> 00:15:22,240
So the threshold value ranges from 0 to 1.

236
00:15:22,240 --> 00:15:26,110
And if I set the threshold value for example 0.8.

237
00:15:26,110 --> 00:15:30,340
So this will result in more overlap in bounding boxes.

238
00:15:30,340 --> 00:15:35,620
But if we reduce the threshold value like I set the threshold value 0.1 over here.

239
00:15:35,620 --> 00:15:39,310
So you can see over here I have set the threshold value 0.1.

240
00:15:39,370 --> 00:15:44,830
So the lowest threshold value will result in more aggressive detections okay.

241
00:15:48,110 --> 00:15:53,090
Wouldn't be a treasured value will result in a fewer but more accurate detection.

242
00:15:53,090 --> 00:15:59,420
So we have just set the low threshold value, which will result in, uh, low, uh, fewer but more

243
00:15:59,420 --> 00:16:00,290
accurate detections.

244
00:16:00,290 --> 00:16:03,110
And we have set the confidence value 0.10.

245
00:16:03,140 --> 00:16:03,560
Okay.

246
00:16:04,070 --> 00:16:05,780
So let's run this up now.

247
00:16:06,470 --> 00:16:08,930
And let's display the output image over here.

248
00:16:12,720 --> 00:16:15,090
But now you can see that this issue is being solved.

249
00:16:15,180 --> 00:16:19,200
Like, you can see that, uh, here we have been facing the issue of double detections.

250
00:16:19,200 --> 00:16:25,290
So, uh, now this issue is being solved over here and we are not facing any such issue of double detections.

251
00:16:25,290 --> 00:16:25,740
Okay.

252
00:16:26,310 --> 00:16:33,510
So using non-max suppression, uh, by setting the, uh, higher values, uh, value of, uh, non-max

253
00:16:33,510 --> 00:16:38,910
version, I just set the threshold value of, uh, fewer and lower threshold value, which results in

254
00:16:38,910 --> 00:16:39,990
more accurate detection.

255
00:16:39,990 --> 00:16:43,170
And I was able to eliminate the double detection.

256
00:16:43,170 --> 00:16:43,620
Okay.

257
00:16:44,250 --> 00:16:51,060
Or I was able to remove the overlap bounding boxes by setting the threshold value to 0.1 or 2 three

258
00:16:51,630 --> 00:16:52,860
lower threshold value.

259
00:16:53,160 --> 00:16:53,940
So now.

260
00:16:55,200 --> 00:16:57,780
And I will be doing the processing on the video.

261
00:16:57,780 --> 00:17:01,110
And the class which I want to detect in the video is backpack.

262
00:17:01,110 --> 00:17:03,900
So I want only want to detect the backpack.

263
00:17:03,900 --> 00:17:06,120
So the persons will be wearing backpack.

264
00:17:06,120 --> 00:17:08,130
And I only want to detect the backpack.

265
00:17:08,130 --> 00:17:08,670
Okay.

266
00:17:08,670 --> 00:17:11,640
So let's run this cell.

267
00:17:12,880 --> 00:17:17,200
So now we are just converting our this input prompt, which in which I have defined the class name into

268
00:17:17,200 --> 00:17:18,520
encodings over here.

269
00:17:18,520 --> 00:17:23,140
And now I'm just doing the same process over here, which I am following above.

270
00:17:23,260 --> 00:17:28,660
Uh, but just I'm just waiting a while loop and we are doing detection on each of the frame one by one.

271
00:17:28,660 --> 00:17:31,180
So now you can see the frame count over here as well.

272
00:17:31,180 --> 00:17:36,970
And uh, you can see over here the different uh, dates, which is we are only detecting backpacks.

273
00:17:36,970 --> 00:17:42,490
So now you can see that in each frame around 3 to 5 backpacks are being detected from the results,

274
00:17:42,490 --> 00:17:43,990
which I can see over here.

275
00:17:47,910 --> 00:17:50,970
So I think this video has total around three.

276
00:17:50,970 --> 00:17:56,550
This video is being split into around 341 frames, and we are doing detection on each of the frame one

277
00:17:56,550 --> 00:17:57,180
by one.

278
00:17:57,180 --> 00:17:58,800
So okay, it's done.

279
00:17:59,010 --> 00:18:01,110
Now let me display the output video over here.

280
00:18:01,110 --> 00:18:05,880
So now you can see that our output video is being saved by the name output dot API.

281
00:18:07,110 --> 00:18:11,490
Because I have written the name that output video name will be output dot Ava.

282
00:18:11,490 --> 00:18:15,180
And we are just uh, like 30 frames over here.

283
00:18:18,270 --> 00:18:25,200
So, um, very soon we will be able to display the output video over here into our Google Colab notebook.

284
00:18:33,560 --> 00:18:35,270
Well there is our output video.

285
00:18:35,270 --> 00:18:38,180
So let me download this video first and let's see.

286
00:18:38,810 --> 00:18:41,630
Uh, let me show you the results we are getting.

287
00:18:43,820 --> 00:18:45,710
So here is our output video.

288
00:18:45,710 --> 00:18:51,680
So now you can see that we are able to detect the backpacks okay so the results look quite good.

289
00:18:51,680 --> 00:18:54,890
Like we are able to detect the backpacks over here.

290
00:18:54,890 --> 00:19:01,610
So in the same way if you want to detect like the shoes or the persons uh, you can also do this by,

291
00:19:01,610 --> 00:19:08,090
you can simply pass the names of all those objects that you want to detect over in this list, and you

292
00:19:08,090 --> 00:19:12,110
will be able to detect the persons shoes or, and all these stuffs.

293
00:19:12,110 --> 00:19:13,490
So that's all from this tutorial.

294
00:19:13,490 --> 00:19:14,450
Thank you for watching.

295
00:19:14,450 --> 00:19:14,990
Bye bye.