1
00:00:00,170 --> 00:00:06,830
Hello, everyone, and welcome to this new and exciting session in which we are going to build our own

2
00:00:06,830 --> 00:00:08,700
Yolo like model.

3
00:00:08,720 --> 00:00:17,180
So from the paper we had already seen that we have this initial conv layers which are pre-trained on

4
00:00:17,180 --> 00:00:27,140
the imagenet dataset such that they could be used to extract very useful features from our input images.

5
00:00:27,620 --> 00:00:37,970
And then this conv layers were followed by two fully connected layers which were designed in order to

6
00:00:37,970 --> 00:00:40,850
adapt to our problem of object detection.

7
00:00:41,120 --> 00:00:49,590
Now, given that we do not want to train this backbone here from scratch on the ImageNet dataset, we

8
00:00:49,600 --> 00:00:54,170
will use an already pre-trained backbone, which is the Resnet 50.

9
00:00:54,560 --> 00:01:03,150
Again, we have the output dimension defined as number of classes plus five times B from the paper.

10
00:01:03,180 --> 00:01:04,530
B is given to be two.

11
00:01:04,560 --> 00:01:11,310
We've seen this already and then five because we have the probability of obtaining an object.

12
00:01:11,310 --> 00:01:16,440
And then for the remaining four we have the bounding box.

13
00:01:16,860 --> 00:01:22,340
So we have two of this bounding box predictions.

14
00:01:22,350 --> 00:01:28,430
That's why here we have two times five, ten plus the number of classes, which in our case is equal

15
00:01:28,430 --> 00:01:29,160
to 20.

16
00:01:29,490 --> 00:01:32,930
Now we define this number of filters to be 512.

17
00:01:32,940 --> 00:01:34,500
So that's it.

18
00:01:34,500 --> 00:01:38,650
From here we have our full complete model.

19
00:01:38,670 --> 00:01:40,590
You could take this off.

20
00:01:40,620 --> 00:01:47,550
You see that we have, um, our pre-trained resnet, which is our backbone or our base model.

21
00:01:47,550 --> 00:01:54,060
And then we follow this up with several conv layers similar to what we have here in the paper.

22
00:01:54,060 --> 00:02:01,440
And then we have the global average pooling, which is what is given here in the paper.

23
00:02:01,620 --> 00:02:09,030
Now one thing we should note about the average pooling is the fact that when we have inputs, let's

24
00:02:09,030 --> 00:02:15,840
say we have this seven by seven, then by, let's say five.

25
00:02:15,870 --> 00:02:24,010
So we have one channel, um, two, three, four and then five.

26
00:02:24,030 --> 00:02:26,940
So let's suppose we have seven by seven by five input.

27
00:02:26,970 --> 00:02:33,630
Now after going through the global average pooling, what we will have here is the averaging of each

28
00:02:33,630 --> 00:02:39,930
and every value or let's say pixel in each and every channel.

29
00:02:39,930 --> 00:02:44,910
So for this channel, for example, we will have one representative value.

30
00:02:44,910 --> 00:02:48,480
For this channel, we will have one representative value, which is the average.

31
00:02:48,480 --> 00:02:53,730
So for this we will have the average, this will have the average and this would have the average.

32
00:02:53,730 --> 00:02:56,820
So we average all these values here.

33
00:02:57,720 --> 00:03:04,170
And the problem with this is information about object position is lost.

34
00:03:04,170 --> 00:03:10,890
And so instead of using this, um, average pooling is preferable to use the flattening.

35
00:03:11,280 --> 00:03:17,760
And so what we'll do is we'll take this off from here and then we will have flatten.

36
00:03:18,150 --> 00:03:18,690
Okay.

37
00:03:18,690 --> 00:03:26,160
So once we have that, just as with the paper you see here, we have the fully connected layer, which

38
00:03:26,160 --> 00:03:30,570
is this dense layer right here, and then this other fully connected layer.

39
00:03:30,570 --> 00:03:33,060
So we have that and then we reshape.

40
00:03:33,810 --> 00:03:35,970
Now, this should be actually split.

41
00:03:36,000 --> 00:03:40,560
Let's take all this here, copy and paste.

42
00:03:40,560 --> 00:03:48,720
So split by split by split or split by split by output dimension.

43
00:03:48,720 --> 00:03:50,510
So seven by seven by 30.

44
00:03:50,520 --> 00:03:53,250
So this is now our model.

45
00:03:53,700 --> 00:03:56,940
You can see we have a total of 53 million parameters.

46
00:03:56,940 --> 00:04:00,210
Um, 30 trainable and 23 non trainable.

47
00:04:00,210 --> 00:04:02,880
That's from our, um, Resnet.

48
00:04:02,910 --> 00:04:05,190
We are now going to get into the Yolo loss.

49
00:04:05,190 --> 00:04:12,960
So here we have this Yolo loss method which takes in our y true and y pred where this is our target

50
00:04:12,990 --> 00:04:13,440
y.

51
00:04:13,470 --> 00:04:15,210
And then this is our predictions.

52
00:04:15,240 --> 00:04:22,800
Now the Yolo loss we have here will be an implementation of what we already discussed from the paper.

53
00:04:22,890 --> 00:04:25,480
And so it's made of four different parts.

54
00:04:25,500 --> 00:04:28,500
The first part is for the coordinates.

55
00:04:28,950 --> 00:04:36,360
The next part is for the object, the next no object.

56
00:04:37,410 --> 00:04:41,670
And then finally we have the classification.

57
00:04:41,790 --> 00:04:43,850
So we have this four different parts.

58
00:04:43,860 --> 00:04:45,900
Now we are going to start with this first part here.

59
00:04:45,900 --> 00:04:47,310
We're going to start with the object.

60
00:04:47,310 --> 00:04:56,670
So in this case, we shall penalize the model for not detecting an object at a particular cell.

61
00:04:57,240 --> 00:04:59,970
And that's why you see here we have this.

62
00:05:00,020 --> 00:05:08,660
One obj, which denotes if the object appears in cell I and this j right here.

63
00:05:08,690 --> 00:05:14,300
Notice that here we don't have a j here we have this J denotes the fact that the j bounding box predictor

64
00:05:14,300 --> 00:05:14,990
in cell.

65
00:05:15,020 --> 00:05:17,990
I is responsible for that prediction.

66
00:05:18,800 --> 00:05:24,890
Now, if we take a look at this figure right here where we have split this image, which is actually

67
00:05:24,890 --> 00:05:29,510
224 by 224 into seven different parts.

68
00:05:29,510 --> 00:05:34,280
So it's basically now seven by seven grid cell image.

69
00:05:34,280 --> 00:05:39,710
And each and every grid cell, as we had seen, has its own predictions.

70
00:05:39,740 --> 00:05:46,430
Now, if we specifically break this grid cell right here, so let's pick this one.

71
00:05:46,790 --> 00:05:48,290
As you see, we've picked this.

72
00:05:48,320 --> 00:05:51,950
Now we have two outputs or two possible outputs.

73
00:05:51,980 --> 00:05:53,720
We have the Y.

74
00:05:53,750 --> 00:05:54,680
True.

75
00:05:55,160 --> 00:05:56,510
Y true.

76
00:05:57,530 --> 00:06:01,820
And then we have the Y predicted by the model.

77
00:06:02,300 --> 00:06:06,860
Now, you'll notice that if you count this, it's going to be a total of 13 and this is going to be

78
00:06:06,860 --> 00:06:10,940
a total of 18 as we've seen already.

79
00:06:10,940 --> 00:06:14,810
This first year represents whether we have an object or not.

80
00:06:14,840 --> 00:06:21,440
The next four, which is all in green, is the position of the object in the image.

81
00:06:21,440 --> 00:06:26,510
And then this other eight specify the class of that object.

82
00:06:27,320 --> 00:06:33,860
Now, before we move on, you should note that when we first recorded this section on the Yolo loss,

83
00:06:33,860 --> 00:06:37,460
we were working with a dataset with just eight classes.

84
00:06:38,000 --> 00:06:45,020
But since we are now dealing with a Pascal data set with 20 classes, our Y true will look instead like

85
00:06:45,020 --> 00:06:45,320
this.

86
00:06:45,320 --> 00:06:52,520
So now we're going to have this additional 12 values here and this additional 12 values here.

87
00:06:52,520 --> 00:07:02,060
So instead of having 13, 18 as with eight classes now we are going to have 25 and and 30.

88
00:07:02,070 --> 00:07:08,190
So it's like 13 plus 12 and then, um, 18 plus 12.

89
00:07:08,910 --> 00:07:13,410
Nonetheless, this doesn't change much on our Yolo loss method.

90
00:07:13,410 --> 00:07:15,480
So wherever you have.

91
00:07:16,390 --> 00:07:22,030
Eight classes you should consider in our specific case of the Pascal VOC dataset, we're dealing with

92
00:07:22,030 --> 00:07:23,170
20 classes.

93
00:07:23,320 --> 00:07:32,200
Now, we had also seen previously that although we have this Y true, which has this 13 different values

94
00:07:32,200 --> 00:07:41,770
for each cell, this y bread has actually 18 values because we have two possibilities or two possible

95
00:07:41,770 --> 00:07:43,490
positions of the object.

96
00:07:43,510 --> 00:07:45,840
We have this first possible position.

97
00:07:46,150 --> 00:07:52,420
Let's call this B1 and then we have this other possible bounding box which we will call B2.

98
00:07:52,450 --> 00:07:58,660
So we had already seen previously that we set B to two because we have two possible bounding boxes.

99
00:07:59,410 --> 00:08:05,380
And we also saw that even though our model or even though our data was designed such that the outputs

100
00:08:05,380 --> 00:08:12,490
were seven by seven by 13, the model outputs seven by seven by 18 outputs, as you could see here.

101
00:08:13,150 --> 00:08:18,890
And so when we have a loss function like this, where we are given the Y true and the Y pred, where

102
00:08:18,890 --> 00:08:28,400
the Y true is in fact seven by seven by 13, and the Y pred is seven by seven by 18.

103
00:08:28,970 --> 00:08:37,220
Since we want to start with the object part of our loss function, which is this part right here, our

104
00:08:37,220 --> 00:08:46,550
focus should only be on these two grid cells, this grid cell and this other grid cell right here.

105
00:08:46,550 --> 00:08:52,520
So we want to focus only on this grid cell and this because these are the only two grid cells where

106
00:08:52,520 --> 00:08:54,710
we have an object located.

107
00:08:54,740 --> 00:09:04,370
Now, the way we shall select this programmatically is by gathering all those grid cells where the target

108
00:09:04,550 --> 00:09:05,940
is equal to one.

109
00:09:05,960 --> 00:09:13,540
Now, our target here is simply y true, where we've selected only this object score.

110
00:09:13,550 --> 00:09:20,420
So if you look at y true here and if we pick this object score that coincides with this one we have

111
00:09:20,420 --> 00:09:20,840
here.

112
00:09:20,840 --> 00:09:25,160
So in the case where we have an object, we will obviously have a one for the Y.

113
00:09:25,190 --> 00:09:25,850
True.

114
00:09:25,850 --> 00:09:35,720
And so when we say we are going to pick all the different cells of widespread where we have a one at

115
00:09:35,720 --> 00:09:39,170
this y true level, that's this first for this first value of Y.

116
00:09:39,200 --> 00:09:39,620
True.

117
00:09:39,650 --> 00:09:49,670
It means that we will now have y pred, which is going to be all this different cells where we have

118
00:09:50,540 --> 00:09:56,450
actual objects and then we're also going to do the same for y target or y.

119
00:09:56,480 --> 00:09:56,890
True.

120
00:09:56,900 --> 00:10:04,550
So here we have y pred extract and then we have y target extract which is simply taking, getting all

121
00:10:04,550 --> 00:10:07,460
the positions where we have the objects.

122
00:10:07,460 --> 00:10:09,440
But this time around for y.

123
00:10:09,470 --> 00:10:09,980
True.

124
00:10:09,980 --> 00:10:13,880
So now we have y pred where there are objects and we have y true.

125
00:10:13,910 --> 00:10:15,290
Where there are objects.

126
00:10:15,290 --> 00:10:22,570
And with this we could now focus on only this two cells, this cell and this cell.

127
00:10:22,580 --> 00:10:27,920
Now let's take this off and run this so that you could see what this produces.

128
00:10:27,920 --> 00:10:29,330
So we run that.

129
00:10:29,930 --> 00:10:31,970
Let's, let's print this out.

130
00:10:31,970 --> 00:10:42,650
So let's print out y pred extract and let's print out y target extract.

131
00:10:42,680 --> 00:10:45,560
Well, we could also print out target, so.

132
00:10:45,560 --> 00:10:48,170
Well, let's just print out this, this word right here.

133
00:10:48,170 --> 00:10:49,520
So let's print.

134
00:10:49,790 --> 00:10:53,870
Um, tf let's copy this.

135
00:10:54,080 --> 00:10:55,850
We'll get this.

136
00:10:56,570 --> 00:10:58,880
So let's paste this out here.

137
00:10:58,910 --> 00:10:59,810
There we go.

138
00:10:59,810 --> 00:11:01,550
And that should be fine.

139
00:11:01,550 --> 00:11:03,170
So let's run this.

140
00:11:03,170 --> 00:11:06,230
And then now we will test with some inputs.

141
00:11:06,260 --> 00:11:15,260
Now the inputs we are going to be using will simply be the exact same coordinates we got from the dataset.

142
00:11:15,260 --> 00:11:24,430
So here you have this and then you have this where this here corresponds to, um, this, this car,

143
00:11:24,470 --> 00:11:25,730
this vehicle right here.

144
00:11:25,730 --> 00:11:29,330
And then this one corresponds to this other vehicle.

145
00:11:29,330 --> 00:11:36,890
Now, we had seen already how we could use generate output method to produce our Y true or our data

146
00:11:36,890 --> 00:11:39,590
set value for Y or the output.

147
00:11:39,590 --> 00:11:45,290
So once we have Y true, we are going to um, add an extra dimension.

148
00:11:45,290 --> 00:11:46,970
Now we have y true already.

149
00:11:46,970 --> 00:11:50,990
Then for y bread we will just generate this random values.

150
00:11:50,990 --> 00:11:58,850
And then what we'll do now is for this specific values or for this specific cells, that's one, four

151
00:11:58,850 --> 00:11:59,750
and three two.

152
00:11:59,780 --> 00:12:02,810
We are going to put its own values.

153
00:12:03,230 --> 00:12:08,120
You'll notice that here we have 18 different values where here we have the probability of having an

154
00:12:08,120 --> 00:12:08,750
object.

155
00:12:08,780 --> 00:12:11,750
We have the position one, two, three, four.

156
00:12:11,870 --> 00:12:12,890
There we go.

157
00:12:12,920 --> 00:12:15,290
We have the position, the probability of having an object.

158
00:12:15,400 --> 00:12:16,450
We have a position.

159
00:12:16,450 --> 00:12:18,460
One, two, three, four.

160
00:12:18,490 --> 00:12:20,800
This is one, two, three, four.

161
00:12:20,830 --> 00:12:21,280
Here.

162
00:12:21,280 --> 00:12:29,680
And then the for this rest here we have the class or the different class probabilities.

163
00:12:29,710 --> 00:12:31,780
Now, we could do the same for this other one.

164
00:12:31,780 --> 00:12:36,190
We have this and then we have one, two, three, four.

165
00:12:36,220 --> 00:12:37,840
Okay, so there we go.

166
00:12:37,840 --> 00:12:40,540
So this is what we have as our y pred.

167
00:12:40,540 --> 00:12:45,100
So we supposing that the model predicts this and then this is our Y true.

168
00:12:45,110 --> 00:12:52,330
Remember we had already seen that this y true here would produce our dataset outputs.

169
00:12:52,330 --> 00:12:57,970
So now let's run this, let's run this and then run this and then see what we get.

170
00:12:58,090 --> 00:12:59,260
Okay, so there we go.

171
00:12:59,260 --> 00:13:08,230
You can see, um, from here we have all the different positions where the target value is equal to

172
00:13:08,230 --> 00:13:08,650
one.

173
00:13:08,650 --> 00:13:10,240
So that's what we printed out here.

174
00:13:10,240 --> 00:13:16,700
And you could see clearly that we have, um, one, four and three two.

175
00:13:17,180 --> 00:13:21,500
Now the zero here is simply because this is the first batch value.

176
00:13:21,500 --> 00:13:23,120
Now that's it.

177
00:13:23,150 --> 00:13:25,790
We have this one, four, three, two for the next.

178
00:13:25,790 --> 00:13:27,710
We have widespread extract.

179
00:13:27,710 --> 00:13:34,250
So want to extract only those values where we fall in this grid cells.

180
00:13:34,280 --> 00:13:42,110
Now you will notice one thing that you have 0.90.470.31, which coincides exactly with this.

181
00:13:42,320 --> 00:13:47,750
And then we have this other 10.30.01, which coincides exactly with this.

182
00:13:47,750 --> 00:13:58,400
And so clearly we are only focusing on the model's outputs where we have actual objects from our data

183
00:13:58,400 --> 00:13:58,820
set.

184
00:13:58,820 --> 00:14:07,550
So we've picked this here and we've picked this, and then now we're ready to compare what the model

185
00:14:07,550 --> 00:14:12,890
produces at this position and what was expected, which is this Y true.

186
00:14:12,890 --> 00:14:15,950
Now, so this is y pred and then this is y true.

187
00:14:15,980 --> 00:14:19,190
We're comparing only at this two cells.

188
00:14:19,280 --> 00:14:20,750
See here we have two.

189
00:14:20,780 --> 00:14:21,530
Two.

190
00:14:21,650 --> 00:14:24,770
Now, what if we suppose that we have only one object?

191
00:14:24,770 --> 00:14:27,860
So if we have only one object, what we'll do is we'll take this one off.

192
00:14:27,860 --> 00:14:29,810
Let's run this again.

193
00:14:30,080 --> 00:14:33,710
And you'll see now that we only have one which is picked.

194
00:14:33,710 --> 00:14:37,400
So let's get back and run that.

195
00:14:37,640 --> 00:14:38,570
And there we go.

196
00:14:38,570 --> 00:14:40,610
So you see, we've we've picked this.

197
00:14:41,030 --> 00:14:48,860
We've picked this cell and this other cell, and we're ready now, as this described in the paper,

198
00:14:48,860 --> 00:14:53,390
to compare the different probability scores with one another.

199
00:14:53,390 --> 00:15:01,040
So we just simply have to subtract this, and that will be good for this, um, part of our loss.

200
00:15:01,070 --> 00:15:07,010
Now, if we only had a single bounding box prediction like here, if we didn't have this.

201
00:15:07,010 --> 00:15:10,280
So let's suppose we had only this and the Y true.

202
00:15:10,310 --> 00:15:11,060
Only this.

203
00:15:11,060 --> 00:15:17,360
Then what we'll do is we'll take this one minus whatever value we have here, and then we'll subtract.

204
00:15:17,360 --> 00:15:23,030
So if y pred is one and y true is one, we'll just have one minus one, we'll subtract that and then

205
00:15:23,030 --> 00:15:27,320
we square it exactly as we have in the paper right here.

206
00:15:27,320 --> 00:15:34,250
You see, we have, um, the object score minus the target object score and then that squared.

207
00:15:34,250 --> 00:15:40,250
And from here we add this to all the other, um, positions or all the other objects at different grid

208
00:15:40,250 --> 00:15:40,640
cells.

209
00:15:40,640 --> 00:15:42,380
That's why we have this summation.

210
00:15:42,380 --> 00:15:43,850
Anyway, we seen this already.

211
00:15:43,850 --> 00:15:50,540
So, um, getting back here now that we actually have two bounding box predictions, as you could see,

212
00:15:50,540 --> 00:15:52,280
we have B1 and B2.

213
00:15:52,310 --> 00:15:56,240
What we'll do is we'll take whatever value we have here.

214
00:15:56,240 --> 00:16:01,250
And then for this, let's, let's wipe this off and wipe this off.

215
00:16:01,250 --> 00:16:06,800
Let's say this is a dash, this is a dash and then this is a dash or some value.

216
00:16:06,800 --> 00:16:11,510
Let's, let's call this lambda and we'll call this lambda.

217
00:16:11,540 --> 00:16:16,250
What we'll do is we'll take this one minus.

218
00:16:17,220 --> 00:16:19,980
One of these that either lambda one.

219
00:16:19,980 --> 00:16:23,910
Let's if this is lambda one and this is lambda two.

220
00:16:24,150 --> 00:16:31,970
So we'll have one minus lambda one or we'll have one minus lambda two.

221
00:16:31,980 --> 00:16:38,910
And the way we shall pick between lambda one and lambda two is by looking at this coordinates right

222
00:16:38,910 --> 00:16:40,320
here, this positions.

223
00:16:40,860 --> 00:16:49,500
So if this bounding box we have here is closer to this other bounding box, that's a true bounding box.

224
00:16:49,500 --> 00:16:51,780
Then we'll pick lambda one.

225
00:16:51,780 --> 00:16:58,830
But if lambda two's bounding box, that's this is closer to this true bounding box, then we'll pick

226
00:16:58,830 --> 00:16:59,730
lambda two.

227
00:16:59,760 --> 00:17:06,780
So that's how we do the picking and the way we would compare this bounding box with this and then this

228
00:17:06,780 --> 00:17:11,160
other bounding box with this is by making use of the IOU scores.

229
00:17:11,160 --> 00:17:18,220
So let's take this off and suppose that we have this input image.

230
00:17:18,220 --> 00:17:23,950
Let's take all this off to we have this input image and then we have this bounding box.

231
00:17:23,950 --> 00:17:26,920
So let's say this is the true bounding box here.

232
00:17:27,460 --> 00:17:29,530
This is our true bounding box.

233
00:17:30,040 --> 00:17:34,840
And then we have this bounding box B one, which is something like this.

234
00:17:34,840 --> 00:17:38,740
So let's say here we have a lambda one and we have this coordinates.

235
00:17:38,740 --> 00:17:40,390
So we produce something like this.

236
00:17:40,840 --> 00:17:42,040
Let's change this color.

237
00:17:42,520 --> 00:17:46,990
Let's say we have something like this, see, say we have something like this.

238
00:17:46,990 --> 00:17:49,660
So this is what we're getting from this one here.

239
00:17:49,660 --> 00:17:53,860
And then we have some other bounding box where we have maybe say something like this.

240
00:17:55,020 --> 00:18:00,450
Now in this case, because this year let's get back.

241
00:18:00,450 --> 00:18:03,450
So we differentiate between the two bounding boxes.

242
00:18:04,140 --> 00:18:08,340
Let's redraw this here and then we suppose we had this.

243
00:18:08,760 --> 00:18:09,300
Okay.

244
00:18:09,300 --> 00:18:16,020
So because this green bounding box is closer to the red bounding box, we are going to pick Lambda one

245
00:18:16,020 --> 00:18:19,950
and simply we are going to take this into consideration.

246
00:18:19,950 --> 00:18:27,780
We're not going to take B two into consideration when computing our objectness part of our loss.

247
00:18:28,350 --> 00:18:36,720
And so getting back to the code, at this point, we have to subtract 0.9

248
00:18:37,860 --> 00:18:44,400
with one that we have here or would subtract.

249
00:18:44,580 --> 00:18:51,660
Um, this is 12340.8 and one.

250
00:18:51,660 --> 00:18:59,380
So we have, we actually have this our this and the choice between this and this will depend on this

251
00:18:59,380 --> 00:19:02,740
coordinates and this coordinates.

252
00:19:03,250 --> 00:19:12,520
But then our IOU method is designed such that it actually takes in pixel values.

253
00:19:12,520 --> 00:19:22,060
So what that means is we could design our IOU method, which will take the center and the width and

254
00:19:22,060 --> 00:19:24,970
the height of these two boxes.

255
00:19:25,150 --> 00:19:28,720
Let's say we have this this box and this other box or this rectangle.

256
00:19:28,720 --> 00:19:35,800
And so the rectangle we take the center of this, its width, its height, we take this other one's

257
00:19:35,800 --> 00:19:38,320
center, its width, its height.

258
00:19:38,320 --> 00:19:41,770
And based on that, we compute the IOU score.

259
00:19:42,730 --> 00:19:48,700
But the problem we have in here is that what we get in here is not the pixel values.

260
00:19:48,730 --> 00:19:52,180
What we have here is this.

261
00:19:52,180 --> 00:19:53,650
Pre-processed.

262
00:19:54,540 --> 00:19:56,660
Values which we had seen already.

263
00:19:56,670 --> 00:20:00,870
And so if we have the center here, you could read that from here.

264
00:20:00,900 --> 00:20:07,560
If we had the center, which is at position, let's say 43, that's 43.

265
00:20:07,560 --> 00:20:09,210
Well, let's let's fix this to this.

266
00:20:09,240 --> 00:20:12,000
We have 46 140.

267
00:20:12,000 --> 00:20:16,470
So we have this center, which is at around 46, 140.

268
00:20:16,500 --> 00:20:20,040
Now we have well, 46.

269
00:20:20,970 --> 00:20:22,080
Take this off.

270
00:20:22,110 --> 00:20:24,630
We have 46

271
00:20:25,720 --> 00:20:28,290
140.

272
00:20:28,800 --> 00:20:41,070
If we focus on only this first coordinate here, then we would have 46 modulo 32, which will give us

273
00:20:41,070 --> 00:20:43,830
12 or rather 14.

274
00:20:43,860 --> 00:20:44,780
Let's get back.

275
00:20:44,790 --> 00:20:46,350
This will give us 14.

276
00:20:46,350 --> 00:20:51,570
And then now this 14 here will move to the next step.

277
00:20:51,570 --> 00:21:00,360
14 divided by 32 would give us 14 divided by 32 will give us 0.44.

278
00:21:00,370 --> 00:21:03,160
So we would have 0.44.

279
00:21:03,760 --> 00:21:08,520
And this is a percentage occupied in this grid cell.

280
00:21:08,530 --> 00:21:17,410
Remember, we to obtain this coordinate or to obtain this value from the original center value of x,

281
00:21:17,440 --> 00:21:27,910
we would have to make sure that we get this position of the center with respect to this grid cells center,

282
00:21:27,910 --> 00:21:29,020
which is right here.

283
00:21:29,020 --> 00:21:33,730
So this distance from year to year is about 0.44 as we've just calculated.

284
00:21:33,730 --> 00:21:36,280
And that's exactly what Yolo takes in.

285
00:21:36,790 --> 00:21:45,250
But since now we need to get back to this original values from these values we use to train because

286
00:21:45,250 --> 00:21:52,020
we need to be able to compute the IOU scores between a given box and some other box.

287
00:21:52,030 --> 00:21:56,410
Then what we'll do is now reverse this process.

288
00:21:56,410 --> 00:22:04,720
So we are going to go from what we get here, like for example, this value back to this value.

289
00:22:04,720 --> 00:22:05,800
46.

290
00:22:06,280 --> 00:22:13,030
Now if you're wondering why we pick 32 right here, you should note that this image is too 24 by 224.

291
00:22:13,030 --> 00:22:16,840
And if we divide 224 by seven, you would have 32.

292
00:22:16,870 --> 00:22:20,050
So each cell, this design from year to year is 32.

293
00:22:20,050 --> 00:22:22,390
From year to year is 32, and so on and so forth.

294
00:22:22,390 --> 00:22:30,370
So if from here to this point here is 44 or rather 46, then if you want to get this remaining distance,

295
00:22:30,370 --> 00:22:34,780
you just need to do 46 modulo 32 to have this remaining distance in here.

296
00:22:34,780 --> 00:22:45,160
And then you divide this distance by 32 to obtain the fraction of occupied by this distance in this

297
00:22:45,160 --> 00:22:47,710
full grid cell.

298
00:22:48,310 --> 00:22:53,650
So with that, we are going to also repeat the same process for this height coordinate.

299
00:22:53,650 --> 00:22:57,460
So this is the center X and center Y.

300
00:22:57,490 --> 00:23:00,610
We will take 140 modulo 32.

301
00:23:00,610 --> 00:23:02,710
We obtain this value divided by 32.

302
00:23:02,740 --> 00:23:07,720
We now get this fraction occupied from this year to the center.

303
00:23:08,020 --> 00:23:09,280
We've gotten this already.

304
00:23:09,280 --> 00:23:11,160
This is 0.44.

305
00:23:11,410 --> 00:23:13,120
We could also get this Y.

306
00:23:13,120 --> 00:23:17,160
And so that's how we we obtain this fraction as we have here.

307
00:23:17,170 --> 00:23:23,950
Anyway, we've seen this already, but now what we're trying to do is get back from this to a value

308
00:23:23,950 --> 00:23:27,970
like 46 or from this to a value like 140.

309
00:23:28,210 --> 00:23:33,610
Now, it should be noted that for the width and the height is going to be quite easy because remember,

310
00:23:33,610 --> 00:23:40,390
when designing this or when obtaining this, what we had simply done was we took the distance or we

311
00:23:40,390 --> 00:23:45,580
took the width of this and divide it by the image width.

312
00:23:45,580 --> 00:23:52,600
So if we take this and divide by the image width, then to obtain the original width, all we need to

313
00:23:52,600 --> 00:23:55,350
do is multiply by the image width.

314
00:23:55,360 --> 00:24:01,240
So if we had originally, let's say, let's say this width was originally 20, for example, then to

315
00:24:01,240 --> 00:24:10,540
obtain what we have here, this width and this height we take to 2020 divided by 224, we get some value,

316
00:24:10,540 --> 00:24:15,040
let's say approximately this is like one and 11, something like that.

317
00:24:15,040 --> 00:24:20,110
So it's like one on 11 or let's say approximately 0.11.

318
00:24:20,320 --> 00:24:20,590
Okay.

319
00:24:20,590 --> 00:24:28,540
So if we have 0.11 then now to obtain 20 from this, we just take 0.11 times 224.

320
00:24:28,540 --> 00:24:30,340
So that will give us back 20.

321
00:24:30,350 --> 00:24:31,600
So that's how we reverse this.

322
00:24:31,630 --> 00:24:37,630
It's easier compared to this where we need to go through all this two steps before getting back to this

323
00:24:37,630 --> 00:24:38,650
original values.

324
00:24:38,650 --> 00:24:42,730
And again, the reason why I want to get this original values is simply because we want to be able to

325
00:24:42,970 --> 00:24:50,590
compute the IOU between a given bounding box and this two bounding boxes here so we can make the choice

326
00:24:50,590 --> 00:24:54,010
of whether to say one minus lambda 1 or 1 minus.

327
00:24:54,140 --> 00:24:54,860
Under two.

328
00:24:54,890 --> 00:24:59,600
Now let's dive into how we will move from 0.44.

329
00:25:00,200 --> 00:25:02,330
0.375.

330
00:25:02,360 --> 00:25:11,510
It's actually 0.375 to 46 and 140.

331
00:25:12,110 --> 00:25:18,110
So the first thing we'll do is we'll simply multiply this one four by 32.

332
00:25:18,140 --> 00:25:23,540
Now one will become 32 and four will become 128.

333
00:25:23,550 --> 00:25:28,520
And this will take us from this origin right up to this position.

334
00:25:28,520 --> 00:25:35,180
So this is where we will be found At this point here, at this point, you have from year to year is

335
00:25:35,180 --> 00:25:35,690
32.

336
00:25:35,720 --> 00:25:39,170
We've already seen that each grid cell here has a size of 32.

337
00:25:39,170 --> 00:25:41,270
So from year to year is 32.

338
00:25:41,270 --> 00:25:50,080
From year to year is 128 because we have 32, 64, 96 and then 128.

339
00:25:50,090 --> 00:25:55,920
So you see we're already getting close to the, um, center of our object.

340
00:25:55,920 --> 00:25:59,550
And so that's why we have this scalar right here.

341
00:25:59,550 --> 00:26:08,520
We have this, um, re scalar which is gotten by multiplying those values where the target equal one

342
00:26:08,520 --> 00:26:10,200
by 32.

343
00:26:10,320 --> 00:26:15,960
So all the positions where the target equal one, we multiply by 32.

344
00:26:15,960 --> 00:26:24,150
And so now let's run this and you see that we have 32, 128, and for 32 we have 9664.

345
00:26:24,150 --> 00:26:34,230
So for this other object here, for this object, we have, um, one, two, three, that's 96 and then

346
00:26:34,230 --> 00:26:36,420
64 is one two.

347
00:26:36,450 --> 00:26:40,500
So we have this position, we now left to get to the center.

348
00:26:40,500 --> 00:26:44,490
Center should be around this year anyway.

349
00:26:44,490 --> 00:26:45,870
We've taken this first step.

350
00:26:45,870 --> 00:26:51,720
Now the next thing we want to do is get rid of this batch here, this batch dimension.

351
00:26:51,720 --> 00:26:52,710
We don't need that.

352
00:26:52,710 --> 00:27:01,320
So we'll just have 32, one, 28, 96, 64, and then we'll add two zeros for the width and the height.

353
00:27:01,320 --> 00:27:06,780
So here is width and height while this is X center and y center.

354
00:27:06,780 --> 00:27:11,400
Now to do this, we are going to simply take off the batch.

355
00:27:11,400 --> 00:27:13,110
So you see, we start from one.

356
00:27:13,110 --> 00:27:14,400
We take off the batch dimension.

357
00:27:14,400 --> 00:27:16,320
We do not consider the zeroth index.

358
00:27:16,320 --> 00:27:19,900
We take that off and then we add two zeros.

359
00:27:19,950 --> 00:27:22,680
So here we just add this two zeros.

360
00:27:22,680 --> 00:27:28,260
Now the number of lines we will have will depend on the length of our scalar.

361
00:27:28,260 --> 00:27:32,550
If our scalar has length of two, as in this case, then we'll have two by two.

362
00:27:32,790 --> 00:27:39,480
Um, a two by two matrix which we are going to concatenate with this rescaled outputs.

363
00:27:39,480 --> 00:27:43,830
So essentially we had this year we had this 32.

364
00:27:43,860 --> 00:27:50,820
Let's rewrite that when we take only this first index or from the first index to the end, we have 32

365
00:27:50,820 --> 00:28:02,430
128 and then we have um, 96, 64 and then we concatenate that with a matrix which is two by two where

366
00:28:02,430 --> 00:28:04,200
we fill all those with zeros.

367
00:28:04,470 --> 00:28:14,370
And so now we have this here which represents x, y and this height or width and height.

368
00:28:14,370 --> 00:28:21,340
So the next thing we'll do is we are going to take this target coordinates now that we take from 1 to

369
00:28:21,340 --> 00:28:21,630
5.

370
00:28:21,630 --> 00:28:30,960
So it's essentially X, Y, W and H and then we'll multiply by 32, 32 to 20 4 to 24.

371
00:28:30,990 --> 00:28:34,290
Now let's explain why we choose to multiply by others.

372
00:28:34,290 --> 00:28:41,970
So let's suppose that you have, for example, 0.44, you know that this is 0.44 right here and then

373
00:28:41,970 --> 00:28:52,980
you have 0.375 for Y for the center now taking 0.44 and multiplying by 32 will give you this distance

374
00:28:52,980 --> 00:28:55,830
from year to year in pixels.

375
00:28:55,830 --> 00:29:00,450
So if you have 0.4, four times 32, you should have 14.

376
00:29:00,480 --> 00:29:06,000
Now remember that to have 0.44, you had 14 divided by 32.

377
00:29:06,000 --> 00:29:15,690
And so now to get back 14, you simply have to do, um, 0.4, four times 32.

378
00:29:16,320 --> 00:29:17,730
And that's exactly what we do here.

379
00:29:17,730 --> 00:29:27,810
You have this 32 times 0.44, and then you have 32 times 0.375.

380
00:29:28,050 --> 00:29:35,040
So once we have this first two, now you have the width, the height and the width where you have the

381
00:29:35,040 --> 00:29:42,930
values multiplied by 224 and 224, which makes sense because as we've seen before, to obtain the width

382
00:29:42,930 --> 00:29:48,510
and the height, we divided by 224 And so now to get back to the original width and height, we need

383
00:29:48,510 --> 00:29:50,220
to multiply by 224.

384
00:29:50,220 --> 00:29:53,970
So we essentially have in this year multiplied.

385
00:29:54,680 --> 00:30:00,080
By the coordinates or by bounding box coordinates.

386
00:30:01,010 --> 00:30:07,640
Now, the reason why we are repeating here is simply because we could have several objects.

387
00:30:07,670 --> 00:30:12,650
Now in case we have two objects, so we'll just repeat this twice and carry out the same calculation

388
00:30:12,650 --> 00:30:14,270
for the two different objects.

389
00:30:14,270 --> 00:30:20,690
And this repeats here where we specify the length of the scalar, the length of the scalar is equal

390
00:30:20,690 --> 00:30:20,990
to.

391
00:30:21,020 --> 00:30:23,420
So this permits us to repeat twice.

392
00:30:24,470 --> 00:30:30,980
So that's it for the target where we've taken this value, this values, because this is one, this

393
00:30:30,980 --> 00:30:34,340
is zero one, two, three, four.

394
00:30:34,340 --> 00:30:37,610
And so we go when we say 1 to 5, we're taking one, two, three, four.

395
00:30:37,610 --> 00:30:44,600
So we're taking this, this, this and this multiplying by 32 for this, multiplying by 32 for this,

396
00:30:44,600 --> 00:30:47,690
multiplying by two, 24 for this and this by 224.

397
00:30:47,690 --> 00:30:53,690
And that's how we obtain this distance from year to year in terms of pixels.

398
00:30:53,690 --> 00:30:56,940
So from year to year we get the distance.

399
00:30:56,940 --> 00:31:04,440
And then now we could also repeat the same process for the other two predictions we have for the prediction

400
00:31:04,440 --> 00:31:05,880
one, and then we have the prediction.

401
00:31:05,880 --> 00:31:06,930
Two for prediction.

402
00:31:06,930 --> 00:31:08,940
One, we simply do the same, you see here.

403
00:31:09,300 --> 00:31:13,950
But the difference is widespread and yes, widespread unlike here, which is why Target, we still have

404
00:31:13,950 --> 00:31:15,540
the same calculation.

405
00:31:15,540 --> 00:31:22,680
And note how here we go from 6 to 10 because this is 1 to 5 and then this is 6 to 10 for bounding box

406
00:31:22,680 --> 00:31:23,130
two.

407
00:31:23,130 --> 00:31:25,440
So that's the only difference we have here.

408
00:31:25,440 --> 00:31:30,000
And then we would get the part one and part two.

409
00:31:30,780 --> 00:31:40,530
So at this point, for this object here, we note the distance from the origin right up to the nearest

410
00:31:42,060 --> 00:31:46,260
cell, which is this distance in this case is 32.

411
00:31:46,290 --> 00:31:47,730
We had seen that already.

412
00:31:47,730 --> 00:31:55,470
And then we also know this distance from here to the center of the object in terms of pixels.

413
00:31:55,470 --> 00:31:56,850
And that's what we just calculated.

414
00:31:56,850 --> 00:31:59,280
And that gives us about 15.21.

415
00:31:59,280 --> 00:32:02,490
So we have now 32 and we have 15.21.

416
00:32:02,490 --> 00:32:07,500
So it means we have the center or the distance from here to the center in terms of pixels.

417
00:32:07,530 --> 00:32:11,850
Now for this 128, you see we're going to go from here to here.

418
00:32:11,850 --> 00:32:17,810
We know that this is 128 and we also have this distance from year to year, which is ten.

419
00:32:17,820 --> 00:32:20,940
So now we could add this up for the width and the height.

420
00:32:20,970 --> 00:32:26,640
We have simply taken what we had and then multiplied by 224 and that's what we've had.

421
00:32:26,640 --> 00:32:29,250
So we don't need to do any modifications here.

422
00:32:29,250 --> 00:32:35,940
Now you'll notice now that because we have this, we could add this with this, this with this zero,

423
00:32:35,940 --> 00:32:45,330
with this, zero with this, and then we could get our output now, which will be this width and the

424
00:32:45,360 --> 00:32:52,260
or rather this center with respect to this origin and the width and the height with respect to the full

425
00:32:52,260 --> 00:32:52,980
image.

426
00:32:53,190 --> 00:33:00,150
Now, to make it very clear, let's let's consider we had only one year, so let's take off this prediction

427
00:33:00,150 --> 00:33:02,910
and then take off the sort of prediction.

428
00:33:02,910 --> 00:33:05,940
And suppose that we had only one object.

429
00:33:05,940 --> 00:33:08,400
So you could see on that very clearly.

430
00:33:08,400 --> 00:33:13,170
You see here we have, um, 96, 64.

431
00:33:13,170 --> 00:33:17,100
Well, this should be on the order because we took this one off.

432
00:33:17,580 --> 00:33:21,480
Um, we should instead take this year off.

433
00:33:21,480 --> 00:33:22,200
Oops.

434
00:33:22,200 --> 00:33:23,970
This should be the one taken off.

435
00:33:24,060 --> 00:33:26,310
Okay, so let's run this again.

436
00:33:26,310 --> 00:33:27,390
And there we go.

437
00:33:27,390 --> 00:33:30,990
So you see here we have 32 120 800.

438
00:33:31,020 --> 00:33:39,540
We have 15, ten, 28, 52 where we will take now 32 plus 15, and then we take 128 plus ten, we take

439
00:33:39,540 --> 00:33:41,190
zero plus this zero plus that.

440
00:33:41,190 --> 00:33:45,810
And that gives us the the width and the height and also the center.

441
00:33:45,810 --> 00:33:52,320
And then for the predictions, we we will take this, plus this, and then we take this.

442
00:33:52,320 --> 00:33:53,310
Plus this.

443
00:33:53,310 --> 00:33:55,650
Take this, plus this, this, plus this.

444
00:33:55,650 --> 00:33:56,400
And that will be it.

445
00:33:56,400 --> 00:34:05,010
So finally, we'll be able to obtain the bounding boxes in terms of pixels for the target, for and

446
00:34:05,010 --> 00:34:06,540
for the two predictions.

447
00:34:06,930 --> 00:34:14,540
So finally now we are going to take what we had here and then add with this other one target, this

448
00:34:14,580 --> 00:34:16,950
of scalar one and target of scalar two.

449
00:34:16,950 --> 00:34:19,590
So let's run this so you could see the outputs.

450
00:34:20,580 --> 00:34:21,450
Uh, there we go.

451
00:34:21,450 --> 00:34:22,860
You see, we have now 47.

452
00:34:22,860 --> 00:34:30,750
We've taken the 32, plus that around 15 and we've had 47, 138 this, this, we've had this, this,

453
00:34:30,750 --> 00:34:32,490
this, this and then this.

454
00:34:32,490 --> 00:34:37,860
So we now have this bounding boxes in terms of pixels.

455
00:34:38,280 --> 00:34:45,990
So the next step we want to follow is actually compare this box with this and then compare this box

456
00:34:45,990 --> 00:34:52,230
with this and then see which of these two are actually closer to this.

457
00:34:53,750 --> 00:34:59,450
Now, one thing we did while designing this bread was to ensure that the first one, this one was,

458
00:34:59,570 --> 00:35:05,770
is going to be the closer one, because you could see here 4747 they are actually almost the same um,

459
00:35:06,770 --> 00:35:07,520
values.

460
00:35:07,520 --> 00:35:11,270
So that when testing you will see that this one is closer.

461
00:35:11,270 --> 00:35:16,100
So to compare this, we will need the IOU score.

462
00:35:16,100 --> 00:35:25,190
And to compute this IOU, let's consider this simple example where we have two boxes, this B1 and B2.

463
00:35:25,700 --> 00:35:31,640
Now the IOU, as we saw already, is simply the intersection over the union.

464
00:35:31,640 --> 00:35:39,890
So we have to compute this is going to be this here intersection divided by the union.

465
00:35:39,890 --> 00:35:48,380
So it's essentially, um, this region as we've seen already, divided by all this, including this,

466
00:35:48,380 --> 00:35:50,180
um, intersection region.

467
00:35:50,450 --> 00:36:01,730
So it is, um, this divided by this now starting with this intersection, what will need will be this

468
00:36:01,730 --> 00:36:06,680
year, this coordinate year, and also this we need this position and this year.

469
00:36:06,680 --> 00:36:13,700
So the way we're going to get this point is by starting by getting this point, this point here, this

470
00:36:13,700 --> 00:36:16,580
point, this point and this point.

471
00:36:16,580 --> 00:36:21,770
But remember that we actually dealing with the center and the width and the height.

472
00:36:21,770 --> 00:36:31,100
So we are not having the x min, y, min x, max, y max and x min min max y max values by pass in year.

473
00:36:31,100 --> 00:36:34,100
What we pass in here is what we actually receive.

474
00:36:34,130 --> 00:36:38,540
There is the center, the width and the height.

475
00:36:38,570 --> 00:36:43,850
Now the first thing we have to do is to convert this coordinates given to us in this form of center

476
00:36:43,850 --> 00:36:51,530
width and height to one in which we have this x mean y, mean and this x max, y max.

477
00:36:51,530 --> 00:36:55,080
And to do that, we have this piece of code right here.

478
00:36:55,080 --> 00:37:02,340
Now, first thing you have to note is the fact that we have, um, zero one, two, three where this

479
00:37:02,340 --> 00:37:08,790
represents our X center, y center width and height.

480
00:37:08,820 --> 00:37:17,130
Now, if, if I'm giving this year X center, um, y center to obtain this position, X mean y mean

481
00:37:17,160 --> 00:37:24,720
it suffices to take, for example, this x C, which is this distance from year to year and then subtract

482
00:37:24,720 --> 00:37:32,850
it or take away half of the width from it because from year to year is a width and then from year to

483
00:37:32,850 --> 00:37:34,020
year is half of the width.

484
00:37:34,020 --> 00:37:38,070
And if we take x minus half of the width, then we get back to this position.

485
00:37:38,220 --> 00:37:44,580
And then if we take y C minus half of the height, then we get this position here.

486
00:37:44,580 --> 00:37:52,050
So at the end of the day we get back to the origin obviously of this specific, um, box.

487
00:37:52,050 --> 00:37:57,090
So if we want to get back to the origin of the specific box which happens to be X mean y mean I need

488
00:37:57,090 --> 00:38:02,640
to take x minus half of the width and y, c minus half of the height.

489
00:38:02,640 --> 00:38:06,270
And so you see here we have this is x, c this is x.

490
00:38:06,300 --> 00:38:09,270
We've seen this already minus half of the width.

491
00:38:09,270 --> 00:38:15,870
The width is to see the width, half of the width that is C width divided by two and then x, y, c

492
00:38:15,900 --> 00:38:20,700
which is this minus half of the height C3 here.

493
00:38:20,700 --> 00:38:24,900
So that's what we do to have X mean y mean.

494
00:38:25,380 --> 00:38:32,460
And now if we want to have x max, y max, then we have to take x plus half of the width and then y

495
00:38:32,490 --> 00:38:34,650
c plus half of the height.

496
00:38:34,650 --> 00:38:35,760
And that's what we do here.

497
00:38:35,820 --> 00:38:38,220
Just simply replace and put the plus and that's fine.

498
00:38:38,220 --> 00:38:39,930
So that's how this box is.

499
00:38:39,930 --> 00:38:47,490
Now, uh, this box one is converted to box one where we now have X mean y, mean x max max.

500
00:38:47,490 --> 00:38:50,100
And we do the same for this second box.

501
00:38:50,100 --> 00:38:56,450
Now, once we have this, the next step will be to get this coordinates here.

502
00:38:56,450 --> 00:39:02,840
Now it happens that if we want to get this X mean because remember this is the intersection now this

503
00:39:02,840 --> 00:39:11,690
intersection forms another box now to get this year it suffices to compare this X and Y mean and this

504
00:39:11,690 --> 00:39:12,860
other x mean y mean.

505
00:39:12,860 --> 00:39:19,670
So we take the x mean mean of be one and the x mean y mean of B two and look for the maximum between

506
00:39:19,670 --> 00:39:20,330
the two.

507
00:39:20,330 --> 00:39:27,080
And is this maximum or the one which is rightmost because our origin, our origin is at the top left

508
00:39:27,080 --> 00:39:27,590
corner.

509
00:39:27,590 --> 00:39:31,940
So increasing is this direction increasing, is this direction decreasing?

510
00:39:31,940 --> 00:39:34,520
Is this direction and this direction.

511
00:39:34,520 --> 00:39:43,280
So when we say maximum of this and this, then we're looking for the one which is in the rightmost direction.

512
00:39:43,280 --> 00:39:48,740
So when we're comparing here, you see we've taken the first two and the first two here.

513
00:39:48,740 --> 00:39:53,540
So we're comparing the Xs and X mean Y means.

514
00:39:53,650 --> 00:39:54,190
Actually.

515
00:39:54,190 --> 00:39:59,410
So when we compare the X and Y, I mean, we get the one which is maximum and that's the one which will

516
00:39:59,410 --> 00:40:05,110
play the role of the mean of this intersection rectangle, which we formed here.

517
00:40:05,110 --> 00:40:09,820
And then for the x max, max, we need instead a minimum.

518
00:40:09,820 --> 00:40:16,180
So we need the one between these two, between this and this, we need a one which is to the which is

519
00:40:16,180 --> 00:40:16,930
leftmost.

520
00:40:16,930 --> 00:40:21,610
So that's why we instead take the minimum and we compare the last two indices.

521
00:40:21,610 --> 00:40:23,860
That's X, max, Y and Max, as you could see.

522
00:40:23,860 --> 00:40:30,920
And then that's how we obtain this position and this position for this intersection rectangle.

523
00:40:30,940 --> 00:40:36,730
Now, once we've gotten that, that is x min min x max max for this intersection, the next thing we

524
00:40:36,730 --> 00:40:42,730
want to do is actually compute the width and the height so that we could multiply and get the area of

525
00:40:42,730 --> 00:40:43,550
this year.

526
00:40:43,570 --> 00:40:45,700
Now to get the width and the height is quite simple.

527
00:40:45,700 --> 00:40:51,790
We just need to take we need to subtract this simple because here you have let's, let's write that

528
00:40:51,790 --> 00:40:52,030
again.

529
00:40:52,030 --> 00:40:58,720
We have X mean Y min x max y max.

530
00:40:58,720 --> 00:41:08,680
So you just take x m that's x max minus x and then you multiply by y max minus Y.

531
00:41:08,710 --> 00:41:09,310
That's all.

532
00:41:09,310 --> 00:41:11,530
So this is the width and then this is the height.

533
00:41:11,560 --> 00:41:15,780
Take this minus this times this, minus this.

534
00:41:15,790 --> 00:41:16,360
That's all.

535
00:41:16,360 --> 00:41:19,240
So that's what you see which is actually done here.

536
00:41:19,270 --> 00:41:23,980
You see, we've done the subtraction from here, we've done the subtraction.

537
00:41:23,980 --> 00:41:29,850
And then we now multiply this two and that's how we get the area, the intersection area.

538
00:41:29,860 --> 00:41:35,260
Now once we have the intersection area, the next thing we want to do is obtain the union area.

539
00:41:35,260 --> 00:41:45,250
So we take the box one, you see two is the width and box or one again the height, multiply the width

540
00:41:45,250 --> 00:41:45,850
times height.

541
00:41:45,880 --> 00:41:50,410
We have this area for the box to width times height.

542
00:41:50,410 --> 00:41:53,950
We have this area now we add this two areas up.

543
00:41:53,980 --> 00:41:55,810
You see here, we add this two areas up.

544
00:41:55,810 --> 00:42:02,380
We remove the intersection because we want to get only this and not we do not want to add this twice

545
00:42:02,380 --> 00:42:05,230
because when you calculate this area, you already have this.

546
00:42:05,230 --> 00:42:07,060
And when you see calculate the area, you still have this.

547
00:42:07,060 --> 00:42:10,450
So you want to take off the intersection because it's going to be computed twice.

548
00:42:10,450 --> 00:42:14,500
And then we now have the full union.

549
00:42:14,500 --> 00:42:18,190
And so now we have intersection divided by union and that's it.

550
00:42:18,190 --> 00:42:25,780
So once we have the intersection divided by union, now we will be able to compare this with this and

551
00:42:25,780 --> 00:42:26,350
this.

552
00:42:26,380 --> 00:42:35,560
And so now in order to compare this different boxes, we would have the target box compared with Pred.

553
00:42:35,730 --> 00:42:42,370
The second prediction and we also have the target box compared with the first prediction box.

554
00:42:42,790 --> 00:42:47,110
Well, before doing this, we could actually print this two out separately.

555
00:42:47,110 --> 00:42:50,410
So let's print this one out.

556
00:42:50,440 --> 00:42:52,500
Let's do TF print.

557
00:42:52,990 --> 00:42:59,290
Um, second, second box that's comparing the target with the second box.

558
00:42:59,860 --> 00:43:05,020
Let's have that there and then we'll compare the target with the first box.

559
00:43:05,020 --> 00:43:08,800
So let's print now first box.

560
00:43:09,040 --> 00:43:10,030
There we go.

561
00:43:10,030 --> 00:43:12,310
We have now print one.

562
00:43:12,550 --> 00:43:13,630
Okay, so that's it.

563
00:43:13,660 --> 00:43:15,190
We'll run this before having this.

564
00:43:15,190 --> 00:43:18,160
So you better understand what we're doing here.

565
00:43:18,160 --> 00:43:19,870
So let's run this.

566
00:43:20,050 --> 00:43:21,820
Um, let's take this off.

567
00:43:23,230 --> 00:43:27,520
We could take this off and then run this.

568
00:43:28,030 --> 00:43:30,910
You see that with the second box, we have 0.16.

569
00:43:30,910 --> 00:43:37,480
And with the first box, we have 0.93, which makes much sense because when you look at this predictions,

570
00:43:38,290 --> 00:43:47,260
the first box looks much more similar to the target as compared to the second box, which is this one.

571
00:43:47,260 --> 00:43:52,240
So it makes more sense that the first box has a higher IOU score.

572
00:43:52,960 --> 00:44:00,310
Now, applying the TensorFlow math grater method, we will be able to have this boolean output, which

573
00:44:00,310 --> 00:44:07,480
tells us whether this second prediction is greater than the first prediction or the IOU between the

574
00:44:07,480 --> 00:44:12,580
target and the second position is greater than the IOU between the first and the target.

575
00:44:12,580 --> 00:44:18,400
So, um, given that in this case, for example, this IOU between the first and the target is greater

576
00:44:18,400 --> 00:44:24,970
than the IOU between the two and the target, then this will output false.

577
00:44:25,090 --> 00:44:32,530
And since this will output false, casting this to an integer will produce an output of zero, and the

578
00:44:32,530 --> 00:44:35,950
zero will simply mean that between the two options.

579
00:44:35,950 --> 00:44:39,250
That's um, the first output and the second output.

580
00:44:39,250 --> 00:44:40,930
This is actually the first output.

581
00:44:41,650 --> 00:44:46,060
And then this is the second output or second bounding box between the first and the second bounding

582
00:44:46,060 --> 00:44:46,570
box.

583
00:44:46,570 --> 00:44:50,710
This is the one which has, which is closer to the target.

584
00:44:50,710 --> 00:44:53,110
And that's exactly what we want to have here.

585
00:44:53,110 --> 00:44:53,470
So.

586
00:44:53,490 --> 00:44:53,880
There we go.

587
00:44:53,880 --> 00:44:55,470
We have this output zero.

588
00:44:55,500 --> 00:44:59,640
Now let's include this other example.

589
00:44:59,640 --> 00:45:03,330
So now we suppose that we have two objects, this object and this object.

590
00:45:03,930 --> 00:45:08,340
By adding this, we've added this other one here, which happens to be this object.

591
00:45:08,370 --> 00:45:12,780
Then we'll uncomment this part and then see what we have for the mask.

592
00:45:13,230 --> 00:45:14,220
So there we go.

593
00:45:14,220 --> 00:45:21,360
You see, we have zero one meaning that for the first object that is this object here.

594
00:45:22,510 --> 00:45:31,920
It is this second box, which is this one here, which is closer to the object or to the target.

595
00:45:31,930 --> 00:45:40,260
And then for the second box, that for the second object, here it is this first, which is closer.

596
00:45:40,270 --> 00:45:47,230
Now let's let's interchange this, let's take this off here and paste out here and then take this from

597
00:45:47,230 --> 00:45:51,400
here and then paste out here and see what we get.

598
00:45:51,430 --> 00:45:54,970
Take this off and add a comma right here.

599
00:45:54,970 --> 00:46:01,780
So now, because this one here will be closer, we should have this one being picked.

600
00:46:01,780 --> 00:46:03,640
So we should have one one.

601
00:46:03,670 --> 00:46:06,520
Now let's run this and see what we get.

602
00:46:06,520 --> 00:46:08,020
You see, we have one one.

603
00:46:08,020 --> 00:46:09,760
And you could see that from here.

604
00:46:09,790 --> 00:46:14,890
The second boxes have higher IOU scores compared to the first boxes.

605
00:46:15,010 --> 00:46:22,090
Okay, so now we have the bounding boxes which are closer to the target.

606
00:46:22,090 --> 00:46:29,440
Like in this case, we know that B1B1 is closer to the target as compared to be two.

607
00:46:29,470 --> 00:46:36,100
So the next thing we need to do now is get Lambda one and lambda two, and then from this lambda one,

608
00:46:36,100 --> 00:46:39,370
lambda two, choose um, lambda one.

609
00:46:39,370 --> 00:46:41,530
Now let's start by getting lambda one.

610
00:46:41,530 --> 00:46:44,020
Lambda two is going to be quite easy.

611
00:46:44,020 --> 00:46:49,510
All we need to do is pick this zero value and this fifth value.

612
00:46:49,510 --> 00:46:55,750
Remember, this is zero one, two, three, four and then five.

613
00:46:55,750 --> 00:46:56,890
So this is a fifth.

614
00:46:57,040 --> 00:46:58,660
So that's why you see, we pick this.

615
00:46:58,730 --> 00:47:04,750
This is the probability of having an object for B two, and then this is that for B one.

616
00:47:04,750 --> 00:47:07,810
In fact, this is Lambda two and then this is Lambda one.

617
00:47:07,810 --> 00:47:12,190
So we concatenate this and then we rearrange this by transposing.

618
00:47:12,220 --> 00:47:14,500
So let's run this and see what we get.

619
00:47:14,530 --> 00:47:15,490
There we go.

620
00:47:15,490 --> 00:47:20,140
As you can see, we have 0.9 and 0.8.

621
00:47:20,170 --> 00:47:20,920
That's it.

622
00:47:20,920 --> 00:47:26,270
And then we have 0.3 and 0.98.

623
00:47:26,270 --> 00:47:33,020
And so what we'll do now is based on the mask, we'll say, okay, for the first value, which is zero,

624
00:47:33,050 --> 00:47:37,400
it means that we are going to take this one instead of this.

625
00:47:37,490 --> 00:47:43,160
So we're going to take this probability instead and then move it to the next position or move it to

626
00:47:43,160 --> 00:47:44,210
the next object.

627
00:47:44,240 --> 00:47:50,660
We are going to select the value number, this first index or this first or the second value.

628
00:47:50,660 --> 00:47:53,840
So we'll skip zero and then we will pick one.

629
00:47:53,840 --> 00:47:59,270
So for the first one, this is our lambda lambda one we picked, and then for this other one, lambda

630
00:47:59,270 --> 00:48:00,320
two will be picked.

631
00:48:00,350 --> 00:48:06,830
Remember that if we, um, change this positions, then we will pick Lambda two for this one and lambda

632
00:48:06,830 --> 00:48:08,900
two for this one because this will be one one.

633
00:48:08,900 --> 00:48:13,580
So since this is zero one, we'll pick lambda one for this first object.

634
00:48:13,580 --> 00:48:18,950
And then since this is one here, we'll pick lambda two for this other object.

635
00:48:19,160 --> 00:48:26,900
So now to do that programmatically, we are going to gather for all this, um, lambda values.

636
00:48:26,900 --> 00:48:30,710
That is probabilities here based on the mask.

637
00:48:30,710 --> 00:48:39,110
So doing that you'll see that we'll be able to pick those values of lambda corresponding to this bounding

638
00:48:39,110 --> 00:48:44,900
boxes with the higher scores with respect to the targets.

639
00:48:44,900 --> 00:48:49,340
So we've seen how when given this, let's let's, let's take a single example.

640
00:48:49,340 --> 00:48:51,410
So let's comment this again.

641
00:48:52,430 --> 00:48:56,870
We'll comment this and this run that.

642
00:48:57,230 --> 00:49:05,600
We'll see how we're able to take this two bounding boxes that this bounding box and this other bounding

643
00:49:05,600 --> 00:49:06,350
box.

644
00:49:06,620 --> 00:49:16,400
And then from this pick, the one with the higher IOU score and hence pick the probability which we

645
00:49:16,400 --> 00:49:19,370
are going to use to subtract from one.

646
00:49:19,370 --> 00:49:23,870
So now we know that we will have one -0.9.

647
00:49:23,900 --> 00:49:25,640
Now let's reverse this.

648
00:49:25,640 --> 00:49:27,590
Let's reverse this again.

649
00:49:28,250 --> 00:49:36,740
Um, we'll reverse this, take this off and then we paste this out here and then we take this off.

650
00:49:37,650 --> 00:49:39,740
Then we paste this here.

651
00:49:39,770 --> 00:49:43,820
Okay, so let's run this and then see what we get.

652
00:49:44,030 --> 00:49:52,310
You see, now we have 0.8 that we've picked this other one instead because this bounding box is having

653
00:49:52,310 --> 00:49:57,200
a higher, higher score with respect to the target as compared to this other bounding box.

654
00:49:57,200 --> 00:50:04,130
So this is how we are going to pick the bounding box which we are going to use for computing the objectness

655
00:50:04,130 --> 00:50:05,750
part of our loss.

656
00:50:06,230 --> 00:50:06,800
Okay.

657
00:50:06,800 --> 00:50:12,590
So finally, now we just need to compute the difference between what we have here that's in this case,

658
00:50:12,590 --> 00:50:16,160
0.8, for example.

659
00:50:16,160 --> 00:50:17,120
Let's, let's get back.

660
00:50:17,120 --> 00:50:21,140
So we have what we had originally, which is going to be 0.9.

661
00:50:21,140 --> 00:50:21,970
So different between this.

662
00:50:22,020 --> 00:50:23,380
0.9 and one.

663
00:50:23,380 --> 00:50:28,590
So you see here we have once, so you could see the length is based on the scalar.

664
00:50:28,600 --> 00:50:31,300
So since we have only one object, this is going to be one.

665
00:50:31,300 --> 00:50:35,800
So one -0.9 and that will be it.

666
00:50:35,830 --> 00:50:42,940
Now, this difference method we have here has been defined is basically the square of y minus x.

667
00:50:42,940 --> 00:50:49,300
So we take to subtract, find the square as defined the paper, and then we have this reduced sum method

668
00:50:49,300 --> 00:50:50,770
to get a single value.

669
00:50:50,770 --> 00:50:51,490
So that's it.

670
00:50:51,520 --> 00:50:58,330
We run all this and we could now print out our um, object loss.

671
00:50:58,330 --> 00:51:01,960
So we have print object loss.

672
00:51:01,990 --> 00:51:02,850
There we go.

673
00:51:02,860 --> 00:51:04,210
Let's run that.

674
00:51:04,210 --> 00:51:06,840
And you see what we have.

675
00:51:06,850 --> 00:51:08,110
0.01.

676
00:51:08,140 --> 00:51:08,650
You see that?

677
00:51:08,650 --> 00:51:12,970
That's quite small because 0.9 is close to 0.1.

678
00:51:12,970 --> 00:51:16,450
Now let's just modify this and say, okay, let's say 0.09.

679
00:51:16,480 --> 00:51:18,550
So that's going to be having a higher loss.

680
00:51:18,550 --> 00:51:20,530
You see that we have higher loss now.

681
00:51:20,530 --> 00:51:21,190
That's it.

682
00:51:21,190 --> 00:51:23,460
So that's what we do now.

683
00:51:23,460 --> 00:51:29,520
You see that no matter what we do here, even if we put here 1.0, we'll never have a loss of zero here.

684
00:51:29,520 --> 00:51:34,350
And that's because it's this one which is used to compute that loss.

685
00:51:34,350 --> 00:51:39,840
We are now going to move to the No objectness part of our loss, which is this here.

686
00:51:39,840 --> 00:51:43,260
And it's kind of similar to this calculation.

687
00:51:43,260 --> 00:51:49,320
But the difference is that now we focusing on those regions where we do not have objects.

688
00:51:49,710 --> 00:52:00,690
So unlike here where we focused on this here, this cell where there's this object and this other cell,

689
00:52:01,860 --> 00:52:10,140
what we'll do now is we'll be focusing on all the other cells except for this and this one and this

690
00:52:10,140 --> 00:52:10,650
one.

691
00:52:11,700 --> 00:52:14,420
So as you could see, we get all the Y parts.

692
00:52:14,430 --> 00:52:20,730
We gather all this predictions where the target is zero.

693
00:52:20,880 --> 00:52:23,400
That is simply where there is no object.

694
00:52:23,400 --> 00:52:25,230
So let's print this out.

695
00:52:25,230 --> 00:52:32,640
Let's print out this one different positions where we have no object and the corresponding predictions.

696
00:52:32,850 --> 00:52:37,770
Here we have tf prints y pred extract.

697
00:52:37,770 --> 00:52:42,990
So this is seven by seven, meaning that we have 49 different cells right here.

698
00:52:42,990 --> 00:52:47,580
Now two cells have objects and the remaining 47 do not have objects.

699
00:52:47,580 --> 00:52:50,400
So let's run this and see what we get.

700
00:52:50,400 --> 00:52:51,330
And there we go.

701
00:52:51,330 --> 00:52:52,380
Yes, what we have.

702
00:52:52,410 --> 00:53:00,330
So you could see clearly from here that we have all these different cells and let's do print so we could

703
00:53:00,330 --> 00:53:01,470
get its shape.

704
00:53:02,250 --> 00:53:04,440
Um, there we go.

705
00:53:04,800 --> 00:53:05,880
Let's run that.

706
00:53:06,000 --> 00:53:09,720
You can see this actually 48 now.

707
00:53:09,720 --> 00:53:10,800
Um, that's because.

708
00:53:10,830 --> 00:53:13,290
Okay, so that's because we had only one object.

709
00:53:13,290 --> 00:53:17,970
So if we have two objects, two objects.

710
00:53:18,300 --> 00:53:18,630
Oops.

711
00:53:18,630 --> 00:53:19,860
Let's get back.

712
00:53:21,060 --> 00:53:23,700
Um, we have two objects.

713
00:53:23,700 --> 00:53:27,000
So let's run this again and see what we get.

714
00:53:27,000 --> 00:53:31,200
Okay, so you see now we have 47 cells where there is no object.

715
00:53:31,200 --> 00:53:38,430
So that tells you that, um, out of the 49, we have two where there is an object.

716
00:53:39,000 --> 00:53:47,850
And then this y pred extract here now contains all this, um, score and bounding boxes.

717
00:53:48,900 --> 00:53:52,800
You could see from here that this is 47 by ten.

718
00:53:52,800 --> 00:53:54,360
So let's get to the bottom.

719
00:53:54,360 --> 00:54:01,200
You see 47 by ten, because for each and every one of this, we could take this off from here.

720
00:54:01,260 --> 00:54:05,730
We could make this connection from, um, a cell where there is no object.

721
00:54:05,730 --> 00:54:09,120
So this cell, for example, here.

722
00:54:09,650 --> 00:54:20,210
We have this ten different, um, predictions and then now we will make use of this to compute the loss.

723
00:54:20,210 --> 00:54:27,920
So obviously the no objectness part of our loss will be computed using this lambda and this lambda.

724
00:54:27,920 --> 00:54:32,570
So we'll take this one or rather zero because we expect it to be zero.

725
00:54:32,570 --> 00:54:37,130
So this time around our Y true will be zero because when there is no object like yours, zero.

726
00:54:37,130 --> 00:54:40,490
So this is going to be zero minus lambda.

727
00:54:41,780 --> 00:54:44,150
Here, minus lambda one.

728
00:54:45,420 --> 00:54:51,570
Um, square plus zero minus lambda two square.

729
00:54:51,690 --> 00:54:59,670
So the idea here is to ensure that those probabilities equal zero when there is no object and equal

730
00:54:59,670 --> 00:55:01,380
one when there is an object.

731
00:55:01,920 --> 00:55:05,100
So getting back to the code, you see that we break this up into two parts.

732
00:55:05,100 --> 00:55:09,210
This like getting the Lambda one or competing with the lambda one, and this is like competing with

733
00:55:09,210 --> 00:55:10,140
the lambda two.

734
00:55:11,310 --> 00:55:19,410
So unlike with the object, um, part of the loss where we're trying to pick which of this was responsible

735
00:55:19,410 --> 00:55:24,330
for the prediction here, we just simply take this minus this, plus this, minus this.

736
00:55:24,330 --> 00:55:30,800
So, um, our target is obviously zeros, unlike where our target was once.

737
00:55:30,810 --> 00:55:38,790
So now we have zeros and then we compute the difference between our Y target here and the y pred.

738
00:55:38,820 --> 00:55:45,870
Now we pick the zero for this one, and then we pick this five for this other box right here.

739
00:55:45,880 --> 00:55:46,780
So that's it.

740
00:55:46,810 --> 00:55:51,850
We simply, um, find the difference or compute this difference using this method which we've defined

741
00:55:51,850 --> 00:55:53,230
already right here.

742
00:55:53,230 --> 00:55:59,320
And then we sum those two up to obtain our no object loss.

743
00:55:59,320 --> 00:56:08,560
So that's it, let's print out or let's take off this here and then print out our no object loss.

744
00:56:09,250 --> 00:56:13,840
We have no object loss.

745
00:56:14,740 --> 00:56:15,610
There we go.

746
00:56:15,610 --> 00:56:17,290
You see, we have 110.

747
00:56:17,950 --> 00:56:19,750
Now we'll move to this next part.

748
00:56:19,750 --> 00:56:23,890
That's for the classification where we focus on the objects class.

749
00:56:23,890 --> 00:56:32,290
You'll see that we are going to only compute this or get this loss for cells where we have an object

750
00:56:32,430 --> 00:56:39,130
here only where we have an object and we do not care about which of the bounding boxes is responsible

751
00:56:39,160 --> 00:56:42,100
because we actually focusing on only the classes.

752
00:56:42,100 --> 00:56:48,880
So instead of I here we have just I because we are not focused on choosing or we are not interested

753
00:56:48,880 --> 00:56:53,590
in choosing any specific bounding box, given that it doesn't really matter since we're focusing on

754
00:56:54,010 --> 00:56:54,850
classes.

755
00:56:55,000 --> 00:57:04,360
So getting back here, we have our object class loss, we have the predictions and the target.

756
00:57:04,390 --> 00:57:11,860
Now, if you check from here, you remember that the target will start from five because this is zero

757
00:57:11,860 --> 00:57:15,340
one, two, three, four and then five.

758
00:57:16,000 --> 00:57:18,940
But for the predictions, it will start from ten.

759
00:57:18,940 --> 00:57:28,600
So you see here we have zero, one, two, three, four, five, six, seven, eight, nine and then

760
00:57:28,600 --> 00:57:29,110
ten.

761
00:57:29,110 --> 00:57:30,940
So we start from ten right here.

762
00:57:30,940 --> 00:57:32,860
So this starts from five and this starts from ten.

763
00:57:32,860 --> 00:57:38,050
Now we go from ten to the last and then we go from five to the last and then we'll simply compute the

764
00:57:38,050 --> 00:57:40,030
difference between this and this.

765
00:57:40,030 --> 00:57:41,470
So that's it.

766
00:57:41,500 --> 00:57:44,370
We also make sure that where there's an object.

767
00:57:44,380 --> 00:57:50,860
So now you have this difference and you obtain your class loss, you see that we obtain a class loss

768
00:57:50,860 --> 00:57:52,840
of 5.47.

769
00:57:53,170 --> 00:58:00,460
Now we get to the last part of our loss, which is that involving the coordinates, which itself is

770
00:58:00,460 --> 00:58:02,310
broken up into two sub parts.

771
00:58:02,320 --> 00:58:08,470
This first part is just for the center and then this other is for the width and height.

772
00:58:08,740 --> 00:58:16,420
Now this part is more similar compared to this object, this part of the loss, simply because here

773
00:58:16,420 --> 00:58:26,470
we are going to focus only on cells where we have an object and only on those bounding boxes which are

774
00:58:26,470 --> 00:58:28,780
responsible for the prediction.

775
00:58:29,230 --> 00:58:33,490
So again, we are going to gather all our predictions here.

776
00:58:33,520 --> 00:58:39,520
We gather all our predictions where the target is equal to one.

777
00:58:39,520 --> 00:58:41,710
So simply where we have objects.

778
00:58:42,420 --> 00:58:42,900
Then.

779
00:58:42,900 --> 00:58:45,300
Now we combine the centers.

780
00:58:45,300 --> 00:58:48,600
So we see we go from 1 to 3 and then from 6 to 8.

781
00:58:48,630 --> 00:58:54,150
Reason being that this year, because this is zero, this is one, this is two.

782
00:58:54,180 --> 00:58:57,030
So this year represents our centers.

783
00:58:57,330 --> 00:59:08,010
And then when you have here, this is actually, um, six seven and this is representing the other centers.

784
00:59:08,010 --> 00:59:11,220
So this one is the center, this two center.

785
00:59:11,220 --> 00:59:19,560
So that's why you see we take this and this and stack them up to form our center joined then similar

786
00:59:19,560 --> 00:59:28,020
to what we had done already with the Objectness loss, we are going to only pick a given center based

787
00:59:28,020 --> 00:59:32,640
on whether it's the bounding box which is responsible or not for the prediction.

788
00:59:32,640 --> 00:59:34,410
And again, we're going to use this mask.

789
00:59:34,410 --> 00:59:39,300
Remember, we had seen this already previously with the Objectness loss.

790
00:59:39,300 --> 00:59:44,620
So we had this already seen so exact same process we're following here.

791
00:59:44,650 --> 00:59:49,270
We just want to make sure we're picking the bounding box which is responsible for the prediction.

792
00:59:49,270 --> 00:59:52,540
And since we've computed this mask already, that's what we're going to do.

793
59:52.540 --> 1:00:03.160
Now, let's, let's print out, um, our center joined and then let's print out the center pred So we

794
1:00:03.160 --> 1:00:12.610
see that we actually pick out only some bounding boxes from the two choices we have center print.

795
1:00:13.980 --> 1:00:20.790
You see here that for the first object, which is this, we have this option and we have this option

796
1:00:20.790 --> 1:00:26.390
now because for the first object, this is the first or the zeroth index that's responsible, you see

797
1:00:26.400 --> 1:00:26.990
here.

798
1:00:27.000 --> 1:00:30.600
And then for the second object, we have this option and this option.

799
1:00:30.600 --> 1:00:35.910
But because this is this one that responsible or the second bounding box or the or the first index that's

800
1:00:35.910 --> 1:00:37.740
responsible, we actually pick this.

801
1:00:37.740 --> 1:00:41.970
So you see that this one is discarded and this one, too, is discarded.

802
1:00:41.970 --> 1:00:44.970
So we focus only on this.

803
1:00:45.750 --> 1:00:51.750
Now for the target, we simply just pick out this one and two.

804
1:00:51.840 --> 1:00:52.680
So that's it.

805
1:00:52.680 --> 1:00:53.610
So we pick this.

806
1:00:53.760 --> 1:00:57.210
Obviously, going from 1 to 3 is simply taking one and then taking two.

807
1:00:57.210 --> 1:01:03.960
So we pick this and then now we compare with whichever one of this is responsible for the prediction

808
1:01:03.960 --> 1:01:08.010
and comparing that is simply applying our difference method.

809
1:01:08.430 --> 1:01:14.550
So now that we're done with the center, we we finished with the center that's actually this part here.

810
1:01:14.550 --> 1:01:17.370
We're now going to move to the width and the height.

811
1:01:18.020 --> 1:01:22.520
So here is the exact same thing with just the difference that we're picking the width and the height.

812
1:01:22.520 --> 1:01:26.390
So now instead of one two, we're going to pick three, four.

813
1:01:26.390 --> 1:01:30.430
So that way you see we go from 3 to 5 and then here we pick eight, ten.

814
1:01:30.440 --> 1:01:31.850
Well, okay, this is a prediction.

815
1:01:31.850 --> 1:01:34.880
So instead of picking this, this, we pick three, four.

816
1:01:34.880 --> 1:01:36.590
So we have this.

817
1:01:37.250 --> 1:01:38.420
Um, let's pick this.

818
1:01:38.420 --> 1:01:41.840
We have this three, four and then eight, ten.

819
1:01:41.840 --> 1:01:43.040
So that's what we do here.

820
1:01:43.160 --> 1:01:50.390
We stack them up, we carry out the selection, and then we also get the target.

821
1:01:50.390 --> 1:01:54.830
So we take this three, four, see that, and then we compute the difference.

822
1:01:54.860 --> 1:01:59.990
Now remember that when computing this difference, we have to make or take the square root.

823
1:01:59.990 --> 1:02:01.790
So you see here we have square root.

824
1:02:02.240 --> 1:02:07.070
And since the square root takes in only positive numbers, we make sure to compute the square root of

825
1:02:07.070 --> 1:02:08.450
the absolute values.

826
1:02:08.450 --> 1:02:12.830
So that is it pred and size target.

827
1:02:13.130 --> 1:02:15.500
And from here we've gotten the center loss.

828
1:02:15.500 --> 1:02:17.060
We've also gotten the size loss.

829
1:02:17.060 --> 1:02:22.170
This now forms our box loss and that's it for all the different loss functions.

830
1:02:22.170 --> 1:02:25.640
We're now simply going to add them up now before adding up.

831
1:02:25.650 --> 1:02:30.600
Remember from the paper we had seen that lambda coordinate is going to be five and lambda no object

832
1:02:30.600 --> 1:02:32.130
is going to be 0.5.

833
1:02:32.160 --> 1:02:33.690
We have seen this already from here.

834
1:02:33.720 --> 1:02:36.570
We have this lambda coordinate and this lambda no object.

835
1:02:36.570 --> 1:02:37.320
So that's it.

836
1:02:37.350 --> 1:02:38.460
We have this.

837
1:02:38.490 --> 1:02:41.940
We make sure when adding this up, we take this into consideration.

838
1:02:41.940 --> 1:02:45.660
So let's run this and then, well, let's print out the loss.

839
1:02:45.660 --> 1:02:48.720
So let's print out the loss.

840
1:02:49.830 --> 1:02:50.790
There we go.

841
1:02:50.790 --> 1:02:53.010
And that should be fine.

842
1:02:53.010 --> 1:02:59.610
We are then going to define our model checkpoint where our file path is this here.

843
1:02:59.640 --> 1:03:01.680
Then we're going to save only the weights.

844
1:03:01.680 --> 1:03:03.750
We're going to monitor the validation loss.

845
1:03:04.050 --> 1:03:12.030
Um, we're going to obviously save the model which produces the minimum or the smallest validation loss

846
1:03:12.030 --> 1:03:13.110
and that's it.

847
1:03:13.110 --> 1:03:15.930
We save the base, we save the best weights only.

848
1:03:15.930 --> 1:03:21.480
So we run that, and then now we move to the scheduling here.

849
1:03:21.480 --> 1:03:23.730
If the number of epochs is less than 40.

850
1:03:23.730 --> 1:03:29.380
So the first 40 epochs we use, um, a learning rate of one times ten to the negative three between

851
1:03:29.380 --> 1:03:30.420
40 and 80.

852
1:03:30.450 --> 1:03:32.970
We use a learning rate of five times ten to the negative four.

853
1:03:32.970 --> 1:03:36.840
And then after that we use a learning rate of one times ten to the negative four.

854
1:03:36.840 --> 1:03:37.980
So that's it.

855
1:03:38.160 --> 1:03:43.110
We compile our model and then we start with the training.

856
1:03:43.710 --> 1:03:48.630
Now after training for a few epochs, you'll notice that the model starts to overfit.

857
1:03:48.630 --> 1:03:56.250
And so in the next section we are going to treat our use several techniques to help solve this or resolve

858
1:03:56.250 --> 1:03:58.020
the problem of overfitting.

859
1:03:58.200 --> 1:04:05.190
Now we've been training for over 20 epochs and you could see clearly from the loss and the validation

860
1:04:05.190 --> 1:04:11.380
or the train loss and the validation loss that our model starts performing well and at some point it

861
1:04:11.400 --> 1:04:12.660
starts overfitting.

862
1:04:12.690 --> 1:04:17.790
As you could see here, we have the training loss which keeps dropping right here.

863
1:04:17.790 --> 1:04:24.060
And then the validation loss drops and then at some point starts increasing.

864
1:04:24.060 --> 1:04:27.360
So clearly our model is overfitting.