1
00:00:00,000 --> 00:00:01,000
Hi guys.

2
00:00:01,000 --> 00:00:04,000
This lecture presents an overview of YOLO algorithm.

3
00:00:04,000 --> 00:00:08,000
Let's look at the topics which we will cover in this lecture.

4
00:00:09,000 --> 00:00:11,000
So following are the objectives.

5
00:00:11,000 --> 00:00:17,000
Or you can say that we have submitted this complete lecture into these six parts which are given over

6
00:00:17,000 --> 00:00:18,000
here.

7
00:00:18,000 --> 00:00:20,000
In the first part, we will see what is YOLO.

8
00:00:20,000 --> 00:00:26,000
In the second part we will see what is basically image classification is in the third part we will discuss

9
00:00:26,000 --> 00:00:28,000
about object localization.

10
00:00:28,000 --> 00:00:32,000
In the fourth part I will discuss about training of a neural network.

11
00:00:32,000 --> 00:00:40,000
In the fifth part I will discuss why we need YOLO algorithm, why we can't use fast R-cnn faster R-cnn

12
00:00:40,000 --> 00:00:41,000
mask R-cnn.

13
00:00:41,000 --> 00:00:47,000
And in the sixth part we will discuss in detail about the YOLO algorithm and its working.

14
00:00:48,000 --> 00:00:55,000
So YOLO is the state of the art object detection algorithm, and it is so fast that it has become a

15
00:00:55,000 --> 00:01:00,000
standard way of detecting objects in the field of computer vision, and YOLO is becoming popular day

16
00:01:00,000 --> 00:01:01,000
by day.

17
00:01:01,000 --> 00:01:03,000
It's gaining popularity.

18
00:01:03,000 --> 00:01:03,000
YOLO.

19
00:01:04,000 --> 00:01:08,000
YOLO was invented in 2015 and since then it's 2023.

20
00:01:08,000 --> 00:01:12,000
Now at eight different versions of YOLO has been emerged.

21
00:01:12,000 --> 00:01:19,000
So the computer vision community has been continuously working on different algorithms and producing

22
00:01:19,000 --> 00:01:22,000
a better version after every few months.

23
00:01:22,000 --> 00:01:28,000
So it can be possible that after few months we can see YOLO being nine and after few months more we

24
00:01:28,000 --> 00:01:34,000
can see Yolo v ten as well, because Computer Vision Community has been continuously working on it and

25
00:01:34,000 --> 00:01:36,000
we are getting better and better results.

26
00:01:36,000 --> 00:01:42,000
Like with Yolo v eight we are getting a mean average precision of 53.7, which is highest ever in the

27
00:01:42,000 --> 00:01:43,000
YOLO history.

28
00:01:43,000 --> 00:01:44,000
So.

29
00:01:44,000 --> 00:01:50,000
So YOLO is has been continuously gaining popularity and a lot of work has been done around that.

30
00:01:50,000 --> 00:01:51,000
On YOLO.

31
00:01:51,000 --> 00:01:58,000
So before YOLO people were using sliding window or sliding window object detection.

32
00:01:58,000 --> 00:02:03,000
Then a more faster versions were invented, which include region based convolutional neural network,

33
00:02:03,000 --> 00:02:08,000
fast region based convolutional neural network, faster region based convolutional neural network and

34
00:02:08,000 --> 00:02:10,000
the Mask R-cnn.

35
00:02:10,000 --> 00:02:16,000
But every each of these algorithm has its some drawbacks and some key takeaways.

36
00:02:16,000 --> 00:02:22,000
We will have a detailed look on R-cnn Fast R-cnn and faster R-cnn in the next lecture.

37
00:02:22,000 --> 00:02:27,000
So from there you will get what are the drawbacks of each of these algorithm.

38
00:02:27,000 --> 00:02:30,000
So YOLO was invented in 2015.

39
00:02:30,000 --> 00:02:34,000
YOLO outperforms all the other previous object detection algorithms.

40
00:02:35,000 --> 00:02:42,000
In image classification, we simply try to classify whether we have the desired object in the image

41
00:02:42,000 --> 00:02:43,000
or not.

42
00:02:43,000 --> 00:02:49,000
Like in the case over here, we are just trying to classify whether there is a deed or a person in the

43
00:02:49,000 --> 00:02:49,000
image.

44
00:02:49,000 --> 00:02:53,000
So you can see that is this a deer or a person in the image?

45
00:02:53,000 --> 00:02:58,000
So if we pass this to neural network, neural network, or you can say convolution neural network,

46
00:02:58,000 --> 00:03:04,000
convolutional neural network will classify as deer is equal to one because we have the deer in the image

47
00:03:04,000 --> 00:03:06,000
and the person will be zero.

48
00:03:06,000 --> 00:03:08,000
Or in other case, if we have a person over there.

49
00:03:08,000 --> 00:03:14,000
Our neural network output will be person is equal to one and T is equal to zero.

50
00:03:14,000 --> 00:03:20,000
So in image classification, our goal is to simply classify whether we have the desired object in the

51
00:03:20,000 --> 00:03:21,000
image or not.

52
00:03:22,000 --> 00:03:29,000
In case of object localization, we are not telling what class basically it is, is it a deed or a person?

53
00:03:29,000 --> 00:03:30,000
But we are.

54
00:03:30,000 --> 00:03:35,000
So we are also telling about bounding box or the position of object in the image.

55
00:03:35,000 --> 00:03:38,000
Where is exactly is that deer inside in the image.

56
00:03:38,000 --> 00:03:44,000
So in image classification will simply tell us that there is a deer or some other object in the image.

57
00:03:44,000 --> 00:03:46,000
So if there is a deer, it will be one.

58
00:03:46,000 --> 00:03:49,000
Or if there is no deer, it will be output will be zero.

59
00:03:49,000 --> 00:03:56,000
But in case of object localization, we are not only telling what class it is, it is a deer or some

60
00:03:56,000 --> 00:04:01,000
other object, but we are also telling about bounding box or the position of object in the image.

61
00:04:01,000 --> 00:04:04,000
Where exactly is the deer in the image?

62
00:04:04,000 --> 00:04:08,000
You can say what exactly is the position of deer in the image.

63
00:04:08,000 --> 00:04:12,000
So this is what object localization is.

64
00:04:13,000 --> 00:04:18,000
In case of object localization, our output will be in the form of like this.

65
00:04:18,000 --> 00:04:24,000
For example, if we have a D in the image, our neural network will output D is equal to one.

66
00:04:24,000 --> 00:04:31,000
And if there is no person inside the image, it will person is equal to zero because we are doing classification

67
00:04:31,000 --> 00:04:33,000
with about D and person.

68
00:04:33,000 --> 00:04:37,000
If there is a deer inside the image or is there a person in the image?

69
00:04:37,000 --> 00:04:43,000
So if we have a deer in the image, our neural network output will be d is equal to one and person will

70
00:04:43,000 --> 00:04:43,000
be zero.

71
00:04:44,000 --> 00:04:49,000
Plus we have the bounding box around the image, around the detected image.

72
00:04:49,000 --> 00:04:51,000
Like you can see, we have detected deer in the image.

73
00:04:51,000 --> 00:04:55,000
So we have created a bounding box around the deer.

74
00:04:55,000 --> 00:04:59,000
So in case of image classification, we don't have the bounding box.

75
00:04:59,000 --> 00:05:06,000
But in case of object localization, we have a bounding box around the detected object, which is D

76
00:05:06,000 --> 00:05:07,000
over here.

77
00:05:08,000 --> 00:05:09,000
Neural network.

78
00:05:09,000 --> 00:05:13,000
We have our output of this form for each detected object.

79
00:05:13,000 --> 00:05:17,000
We have a vector of size seven, which you can see over here.

80
00:05:19,000 --> 00:05:23,000
So here we see represent the probability of plus.

81
00:05:23,000 --> 00:05:28,000
So if there is some object in the image which we want to detect, for example, if we want to detect

82
00:05:28,000 --> 00:05:35,000
a D or a person and the image contain a D or a person, it means there is an object which we want to

83
00:05:35,000 --> 00:05:36,000
detect.

84
00:05:36,000 --> 00:05:42,000
So if there exists an object which we want to detect, then the probability of class will be one.

85
00:05:42,000 --> 00:05:48,000
The probability of the class will only be zero if there is no such object in the image which we want

86
00:05:48,000 --> 00:05:48,000
to detect.

87
00:05:48,000 --> 00:05:51,000
For example, the image is blank.

88
00:05:51,000 --> 00:05:56,000
There is no object in the image, so the probability of class is zero or there is some other object

89
00:05:56,000 --> 00:05:58,000
in the image which we don't want to detect.

90
00:05:58,000 --> 00:06:04,000
The probability of class is will be zero if there is any such object inside the image which we want

91
00:06:04,000 --> 00:06:07,000
to detect, the probability of class will be one.

92
00:06:08,000 --> 00:06:11,000
The second and the other term is B x and b y.

93
00:06:11,000 --> 00:06:12,000
B x and b y.

94
00:06:12,000 --> 00:06:17,000
Represent coordinate represents the coordinate for center, which is indicated in the yellow color.

95
00:06:17,000 --> 00:06:21,000
So this is the center point of my detector object.

96
00:06:21,000 --> 00:06:29,000
So basically B, x and b y over here represent the coordinates of the center which is indicated over

97
00:06:29,000 --> 00:06:31,000
here in the yellow color, like you can see over here.

98
00:06:32,000 --> 00:06:40,000
C1 represents class one, which is for, for example, if C if we are doing classification of deer and

99
00:06:40,000 --> 00:06:41,000
dog.

100
00:06:41,000 --> 00:06:43,000
So what denote a person.

101
00:06:43,000 --> 00:06:46,000
So C1 represents the class one which is for deer.

102
00:06:46,000 --> 00:06:54,000
C2 represents the class two, which is for the person and BW and B represent the width and height of

103
00:06:54,000 --> 00:06:55,000
this red box.

104
00:06:55,000 --> 00:06:57,000
Like you can see over here.

105
00:06:57,000 --> 00:07:04,000
Here, this red box, height and width is being represented by BW and B, so this is basically BW.

106
00:07:07,000 --> 00:07:11,000
And this is basically B edge over here.

107
00:07:11,000 --> 00:07:15,000
So this is our B edge.

108
00:07:16,000 --> 00:07:18,000
And BC is equal to one.

109
00:07:18,000 --> 00:07:23,000
If there is an object in any object in the image, if there is no object in the image which is equal

110
00:07:23,000 --> 00:07:24,000
to zero.

111
00:07:24,000 --> 00:07:28,000
And in that case, in that case, the rest of the values doesn't matter.

112
00:07:28,000 --> 00:07:33,000
So if we have a probability of class zero, then other values don't matter because we don't need these

113
00:07:33,000 --> 00:07:34,000
values.

114
00:07:34,000 --> 00:07:34,000
Okay.

115
00:07:34,000 --> 00:07:36,000
So I hope you got this point.

116
00:07:38,000 --> 00:07:44,000
At the training of a neural network to train a neural network to classify the object as well as the

117
00:07:44,000 --> 00:07:45,000
bounding box.

118
00:07:45,000 --> 00:07:49,000
So as the neural network only understand numbers.

119
00:07:49,000 --> 00:07:56,000
So for each of the image, we will have a corresponding vector of size seven like this, like this is

120
00:07:56,000 --> 00:07:59,000
the corresponding vector of size seven for the digit.

121
00:07:59,000 --> 00:08:04,000
This is the corresponding vector of size seven for the person in the image.

122
00:08:04,000 --> 00:08:08,000
So in the same way we can have ten thousands of image.

123
00:08:08,000 --> 00:08:13,000
So we need to train a neural network in such a way if input a new image.

124
00:08:13,000 --> 00:08:18,000
Now it will tell us that this particular vector in the output.

125
00:08:19,000 --> 00:08:22,000
So in this way, we want to train a neural network.

126
00:08:23,000 --> 00:08:27,000
Ian used to train a neural network in such a way.

127
00:08:27,000 --> 00:08:35,000
If I input a random image at the input at the output, it will give me a corresponding vector like this.

128
00:08:35,000 --> 00:08:43,000
And plus we will also have an output image with a bounding box around the detected object like this.

129
00:08:43,000 --> 00:08:44,000
You can see.

130
00:08:44,000 --> 00:08:47,000
So this is our aim of a neural network.

131
00:08:47,000 --> 00:08:50,000
Like we need to train a neural network in such a way.

132
00:08:50,000 --> 00:08:58,000
If I pass some random image at the input in the output, it will it will give me a vector of size seven

133
00:08:58,000 --> 00:08:58,000
like this.

134
00:08:58,000 --> 00:09:04,000
And through using this vector of size seven, I will draw a bounding box around the image which you

135
00:09:04,000 --> 00:09:06,000
can see over here as well.

136
00:09:09,000 --> 00:09:17,000
So the question you might all be thinking that if neural network works so well for object detection,

137
00:09:17,000 --> 00:09:19,000
then why do we need YOLO?

138
00:09:19,000 --> 00:09:27,000
Because neural networks works well only for a single object like you can see here.

139
00:09:27,000 --> 00:09:33,000
For example, if we have a date or a person in the image, like if we have only a single object, like

140
00:09:33,000 --> 00:09:35,000
there can be a date or a person in the image.

141
00:09:36,000 --> 00:09:42,000
Okay, so if we have a single object in the image, the neural network will perform, give us good results.

142
00:09:42,000 --> 00:09:49,000
But if we have multiple objects, like if we had if we have a date as well as the person in the image,

143
00:09:49,000 --> 00:09:52,000
then neural network will not work.

144
00:09:52,000 --> 00:09:56,000
As as good as it did was for a single object.

145
00:09:56,000 --> 00:09:57,000
So.

146
00:09:58,000 --> 00:10:00,000
In case of multiple objects in an image.

147
00:10:00,000 --> 00:10:02,000
For example, five persons, one dog.

148
00:10:03,000 --> 00:10:08,000
For example, if we have n number of objects, like ten objects in the image, then determining the

149
00:10:08,000 --> 00:10:10,000
size of neural network is very hard.

150
00:10:10,000 --> 00:10:16,000
As I told you, for each object in the image, we have a vector of size seven.

151
00:10:16,000 --> 00:10:21,000
Okay, so if we have multiple objects in the image, the vector size increases.

152
00:10:21,000 --> 00:10:25,000
Like if we have ten objects in an image, then the size of vector will be 70.

153
00:10:25,000 --> 00:10:29,000
Like we have 70 different vectors of an image.

154
00:10:29,000 --> 00:10:32,000
So this becomes computationally very expensive.

155
00:10:32,000 --> 00:10:40,000
So we can draw analysis if we have a single object in the image, like a person, dog, cat, then neural

156
00:10:40,000 --> 00:10:42,000
network works well for object detection.

157
00:10:42,000 --> 00:10:48,000
But if we have multiple objects in the image, so if we have ten objects in the image, then we have

158
00:10:48,000 --> 00:10:49,000
70 vectors.

159
00:10:49,000 --> 00:10:50,000
So.

160
00:10:51,000 --> 00:10:53,000
This becomes computationally very expensive.

161
00:10:53,000 --> 00:10:57,000
So we can't use neural network for object detection.

162
00:10:57,000 --> 00:11:03,000
Plus, if we have 1200s of objects in the image, then determining of the size of the neural network.

163
00:11:04,000 --> 00:11:05,000
Becomes very hard.

164
00:11:05,000 --> 00:11:09,000
So we need to have something that we can have.

165
00:11:09,000 --> 00:11:11,000
We need to have another approach.

166
00:11:11,000 --> 00:11:13,000
So for this we have YOLO.

167
00:11:18,000 --> 00:11:22,000
YOLO algorithm divides the image into grid cells.

168
00:11:22,000 --> 00:11:26,000
So, for example, here I am using a 4x4 grid here.

169
00:11:26,000 --> 00:11:34,000
Like you can see over here, we have four grids one, two, three, four in horizontal direction.

170
00:11:34,000 --> 00:11:41,000
And the four number of grids one, two, three, four, four grids vertically as well.

171
00:11:41,000 --> 00:11:47,000
So we can have this is not necessary that you should have have 4x4 grid.

172
00:11:47,000 --> 00:11:50,000
We can have 19 cross 19 grid three cross three grid as well.

173
00:11:51,000 --> 00:11:58,000
So if we focus on this specific grid like this, this specific grid, so as we have no bonding object

174
00:11:58,000 --> 00:12:05,000
in this grid cell, so the probability of the class will be zero and the rest of the values doesn't

175
00:12:05,000 --> 00:12:05,000
matter.

176
00:12:05,000 --> 00:12:12,000
So if as we have no object in this grid cell, so the probability of the class will be zero and the

177
00:12:12,000 --> 00:12:15,000
rest of the values does not matter to us.

178
00:12:17,000 --> 00:12:22,000
So now we can see that we have a cow and a girl in the image.

179
00:12:22,000 --> 00:12:28,000
We can see that the cow is expanding to two grid cells in width and in terms of height.

180
00:12:28,000 --> 00:12:31,000
The cow is expanding to three grid cell.

181
00:12:31,000 --> 00:12:33,000
So as we can see that.

182
00:12:34,000 --> 00:12:37,000
The call is expanding to multiple grid cells.

183
00:12:37,000 --> 00:12:44,000
We try to find the central place of the cow, and the cow belonged to that grid cell as shown below.

184
00:12:44,000 --> 00:12:48,000
So this is the central place of the cow.

185
00:12:48,000 --> 00:12:51,000
And the cow basically belongs to this grid cell.

186
00:12:51,000 --> 00:12:57,000
So you can have a have done the snapshot of this specific grid cell where the central place of the cow

187
00:12:57,000 --> 00:12:58,000
is.

188
00:12:58,000 --> 00:13:04,000
So here we have a vector for the cow represent the probability of class, which will be one as we have

189
00:13:04,000 --> 00:13:05,000
a cow in the image.

190
00:13:05,000 --> 00:13:07,000
This represents the.

191
00:13:08,000 --> 00:13:12,000
Basically there are these represent the center points.

192
00:13:12,000 --> 00:13:18,000
And here we have to represents the weight as it is covering two cells, like in the width section.

193
00:13:18,000 --> 00:13:21,000
It is covering the core is covering two cells, one, two.

194
00:13:21,000 --> 00:13:24,000
While in terms of height, the core is covering three cells.

195
00:13:24,000 --> 00:13:26,000
One, two, three.

196
00:13:26,000 --> 00:13:27,000
Okay.

197
00:13:27,000 --> 00:13:28,000
So we have two, three over here.

198
00:13:28,000 --> 00:13:33,000
And in the C1, C1 represents the class one, which is for the core.

199
00:13:33,000 --> 00:13:34,000
So we have the core here.

200
00:13:34,000 --> 00:13:36,000
So the class will be one.

201
00:13:36,000 --> 00:13:44,000
And this image does not contain the center point of the girl as it only contains the central point of

202
00:13:44,000 --> 00:13:46,000
the core central place of the core.

203
00:13:46,000 --> 00:13:52,000
So to represent the C2 represents the class two, which is for Blur and it will be zero.

204
00:13:55,000 --> 00:13:59,000
Well, as we have divided our image into 4x4 grid cell.

205
00:13:59,000 --> 00:14:05,000
So total 16 grid cells we have in a single image like you can see over here, we have divided our image

206
00:14:05,000 --> 00:14:07,000
into four cross four grid cells.

207
00:14:07,000 --> 00:14:14,000
One, two, three, four, five, six, seven, eight, nine, ten, 11, 12, 13, 14, 15, 16.

208
00:14:14,000 --> 00:14:17,000
So we have total 16 grid cells in an image.

209
00:14:17,000 --> 00:14:19,000
So for each grid cell we have a vector.

210
00:14:19,000 --> 00:14:23,000
So we have 16 such vectors and each vector length is seven.

211
00:14:23,000 --> 00:14:32,000
So we have 16 such vectors of size seven So for each object, for each grid in the image, like for

212
00:14:32,000 --> 00:14:35,000
each of the grid cell in the image, we have a vector like this.

213
00:14:35,000 --> 00:14:41,000
So for each image we'll have 16 such vectors, like we have 16 such vectors in each image.

214
00:14:41,000 --> 00:14:48,000
Similarly, for this image we will have 16 sets vectors of like this form of this form in the image.

215
00:14:48,000 --> 00:14:55,000
So for each image we have 16 such vectors and for the training we can have thousands of images for the

216
00:14:55,000 --> 00:14:56,000
training as well.

217
00:14:56,000 --> 00:14:58,000
So in this way we will train our model.

218
00:14:58,000 --> 00:15:01,000
Now in the next slide we will look at the prediction part.

219
00:15:05,000 --> 00:15:06,000
After training our model.

220
00:15:06,000 --> 00:15:10,000
If I want to test my model on some sample image.

221
00:15:10,000 --> 00:15:14,000
So this is the sample image which I have passed to test my model.

222
00:15:14,000 --> 00:15:18,000
So this is the sample image which I have passed to test my model.

223
00:15:18,000 --> 00:15:19,000
This is this image.

224
00:15:19,000 --> 00:15:28,000
So in the output I will have 16 such vectors because I have trained my model for 4x4 grid cells where

225
00:15:28,000 --> 00:15:30,000
each grid cell contains a one vector.

226
00:15:30,000 --> 00:15:34,000
So we will have 16 such vectors and the output as well.

227
00:15:34,000 --> 00:15:40,000
Like you can see here, because our output image will be divided into four cross four grid cells and

228
00:15:40,000 --> 00:15:45,000
we will have 16 such vectors of this form of the size seven.

229
00:15:45,000 --> 00:15:48,000
So each vector size will be seven in the output.

230
00:15:51,000 --> 00:15:54,000
But wait, there can be an issue.

231
00:15:54,000 --> 00:16:01,000
The algorithm might detect multiple bounding boxes for each of the object, like you can see here for

232
00:16:01,000 --> 00:16:07,000
the person, we have two bounding boxes, one with the confidence of 0.45 and other with a confidence

233
00:16:07,000 --> 00:16:10,000
of 0.85 for the baby as well.

234
00:16:10,000 --> 00:16:17,000
We have two bounding boxes, one with a confidence of 0.75 and other with a confidence of 0.65.

235
00:16:17,000 --> 00:16:24,000
And for this we have two bounding boxes as well, one with a confidence of 0.95 and other with a confidence

236
00:16:24,000 --> 00:16:26,000
of 0.35.

237
00:16:26,000 --> 00:16:34,000
So if we have multiple bounding boxes for each object, we can take max for each class, like for this

238
00:16:34,000 --> 00:16:38,000
class, we can take the max of these two bounding boxes like we can.

239
00:16:38,000 --> 00:16:43,000
Can't take the 0.95 and remove the other, which is less with less confidence score.

240
00:16:43,000 --> 00:16:46,000
And we can't take Max for this class of baby.

241
00:16:46,000 --> 00:16:53,000
Like we can't take the 0.75 bounding box and remove the 0.65 bounding box.

242
00:16:53,000 --> 00:16:59,000
And for the person we can't take the 0.85 bounding box and remove the 0.45 bounding box.

243
00:16:59,000 --> 00:17:05,000
So when you face this issue, when you have multiple bounding boxes for each of the detected object,

244
00:17:06,000 --> 00:17:12,000
you can't take max, you can't take the bounding box, which has the maximum confidence.

245
00:17:12,000 --> 00:17:16,000
There should be another way, which is IOU.

246
00:17:16,000 --> 00:17:19,000
IOU stands for Intersection Over Union.

247
00:17:19,000 --> 00:17:21,000
Let's go into detail in the next slide.

248
00:17:23,000 --> 00:17:26,000
Solve the issue of multiple bounding boxes.

249
00:17:26,000 --> 00:17:31,000
We use an approach called IOU, which means intersection over union.

250
00:17:31,000 --> 00:17:39,000
So if we only take this class person over here, if we only take this class bounding boxes so you can

251
00:17:39,000 --> 00:17:45,000
see we have two bounding boxes for this class person for only this person.

252
00:17:45,000 --> 00:17:53,000
So for class person, we find the overlapping area and to find the overlapping area we use intersection

253
00:17:53,000 --> 00:17:54,000
over union.

254
00:17:54,000 --> 00:18:01,000
If objects are overlapping, the value of intersection over union is more if they are completely overlapping,

255
00:18:01,000 --> 00:18:04,000
the value of intersection over union will be one.

256
00:18:04,000 --> 00:18:12,000
So for example, we can discard all the rectangle which have IOU greater than 0.65 and keep the rectangle

257
00:18:12,000 --> 00:18:15,000
which have the class probability as maximum.

258
00:18:15,000 --> 00:18:21,000
The same process will be repeated for other classes as well, which means for baby class and for this

259
00:18:21,000 --> 00:18:22,000
object class as well.

260
00:18:22,000 --> 00:18:28,000
And this is the formula for the IOU, which is intersect area over union area.

261
00:18:28,000 --> 00:18:33,000
And in this way I have explained this process like this is our intersection area.

262
00:18:33,000 --> 00:18:39,000
You can say the inner area of the blue lines and this is our union area.

263
00:18:39,000 --> 00:18:43,000
This is this is and this is my union area and this or whole area.

264
00:18:43,000 --> 00:18:46,000
This only area is my intersection area.

265
00:18:48,000 --> 00:18:53,000
Watch because they use IOU to solve the issue of multiple bonding of boxes.

266
00:18:53,000 --> 00:18:57,000
This approach is also named as non-max suppression.

267
00:18:57,000 --> 00:19:02,000
So you can also say this approach named this approach as non-max suppression.

268
00:19:02,000 --> 00:19:05,000
Also, after deducting all objects, we apply.

269
00:19:05,000 --> 00:19:10,000
Non-max suppression and we get the unique bounding boxes in the output.

270
00:19:17,000 --> 00:19:20,000
As we have divided our image into 4x4 grid cell.

271
00:19:20,000 --> 00:19:27,000
So we might sometime face, sometime face this issue, that one grid cell like this has center of two

272
00:19:27,000 --> 00:19:31,000
objects like this grid cell contain the center of the dog and the girl.

273
00:19:31,000 --> 00:19:37,000
So now, as this grid cell above can only represent one class, how we can represent two classes.

274
00:19:37,000 --> 00:19:40,000
So our one grid cell has a vector.

275
00:19:40,000 --> 00:19:47,000
So each grid cell has a vector of size seven, and each grid cell can only represent one single class.

276
00:19:47,000 --> 00:19:52,000
We cannot represent two classes in a single vector of size seven.

277
00:19:52,000 --> 00:19:55,000
So if we face this issue, how we can solve this issue.

278
00:19:55,000 --> 00:19:57,000
Let's look in the next slide.

279
00:20:00,000 --> 00:20:02,000
F1 grid cell has sender of two objects.

280
00:20:02,000 --> 00:20:10,000
So in such case, instead of having a single vector of size seven, we can have a vector of size 14.

281
00:20:10,000 --> 00:20:18,000
So this first seven vectors in here, the first seven numbers over here represent for the first class

282
00:20:18,000 --> 00:20:23,000
and the remaining seven numbers over here represent for the other class.

283
00:20:23,000 --> 00:20:29,000
So if one grid cell has center of two objects, in such case, instead of having a single vector of

284
00:20:29,000 --> 00:20:32,000
size seven, we can have a vector of size 40.

285
00:20:33,000 --> 00:20:34,000
To solve this issue.

286
00:20:35,000 --> 00:20:37,000
Well, thank you for watching this video.

287
00:20:37,000 --> 00:20:39,000
See you in the next video video tutorial.

288
00:20:39,000 --> 00:20:40,000
Till then, goodbye.