1
00:00:01,000 --> 00:00:01,000
Hi guys.

2
00:00:01,000 --> 00:00:09,000
So in this lecture I will present an overview of object detection algorithms which include R-cnn, Faster,

3
00:00:09,000 --> 00:00:12,000
CNN, faster R-cnn and Mask R-cnn.

4
00:00:12,000 --> 00:00:18,000
We will look at the key takeaways from each of the algorithm and the primary challenges associated with

5
00:00:18,000 --> 00:00:20,000
each of the algorithm.

6
00:00:21,000 --> 00:00:22,000
Pursuing art objectives.

7
00:00:22,000 --> 00:00:26,000
Or you can say following are the topics which we will cover in this lecture.

8
00:00:26,000 --> 00:00:29,000
The first one is convolution neural network.

9
00:00:29,000 --> 00:00:36,000
We will not go into much details regarding convolution neural network like what is the activation function

10
00:00:36,000 --> 00:00:37,000
in the convolution neural network?

11
00:00:37,000 --> 00:00:43,000
What are the kernels and what is why we use dense layer and other things?

12
00:00:43,000 --> 00:00:50,000
We will be only focusing on why we use convolution neural network and why can't we use convolution neural

13
00:00:50,000 --> 00:00:52,000
network for object detection?

14
00:00:52,000 --> 00:00:55,000
Next, we will discuss about R-cnn.

15
00:00:55,000 --> 00:01:03,000
We will see what is r-cnn and what is the primary goal of R-cnn, how we can implement R-cnn and what

16
00:01:03,000 --> 00:01:07,000
are the primary challenges associated with R-cnn.

17
00:01:07,000 --> 00:01:14,000
Next, we will discuss about Fast R-cnn and the primary challenges associated with Fast R-cnn.

18
00:01:14,000 --> 00:01:21,000
Next, we will look at faster R-cnn hope faster R-cnn is different from R-cnn net fast r-cnn.

19
00:01:22,000 --> 00:01:25,000
Why and what is region proposal network?

20
00:01:25,000 --> 00:01:34,000
In faster R-cnn And last we will see what are the key takeaways from the faster r-cnn and why we need

21
00:01:34,000 --> 00:01:39,000
mask CNN And in the last we will also look at mask r-cnn and its void game.

22
00:01:41,000 --> 00:01:45,000
For now, we will look at the convolutional neural network.

23
00:01:45,000 --> 00:01:50,000
So convolutional neural network works best if we want to do image classification.

24
00:01:50,000 --> 00:01:56,000
For example, if there is a cat or a dog in the image, or if we want to do multi-class classification,

25
00:01:56,000 --> 00:02:00,000
like is there a cat dog person deer in the image?

26
00:02:00,000 --> 00:02:01,000
So there are four classes.

27
00:02:01,000 --> 00:02:06,000
So if there even ten classes convolutional neural network works best.

28
00:02:06,000 --> 00:02:10,000
Function neural network works best for image classification.

29
00:02:10,000 --> 00:02:12,000
So what basically image classification is.

30
00:02:12,000 --> 00:02:14,000
Let me explain over here.

31
00:02:14,000 --> 00:02:19,000
For example, we have this image and let's for example, we have a cat in this image.

32
00:02:20,000 --> 00:02:20,000
Okay?

33
00:02:20,000 --> 00:02:21,000
So.

34
00:02:21,000 --> 00:02:24,000
With image classification, we will have the output as.

35
00:02:27,000 --> 00:02:33,000
Like we have that cat in the image, or if we have multiple objects in the image, we will have the

36
00:02:33,000 --> 00:02:36,000
output like cat dog person.

37
00:02:36,000 --> 00:02:40,000
And like this is, this is the output we get using.

38
00:02:40,000 --> 00:02:43,000
Convolutional neural network for image classification.

39
00:02:43,000 --> 00:02:49,000
But in case of object detection, we need to figure out where is the specific object in the image.

40
00:02:49,000 --> 00:02:53,000
Like we need to find out where is the cat in the image.

41
00:02:53,000 --> 00:02:59,000
So in the case of object detection, we need to draw a bounding box like this bounding box around the

42
00:02:59,000 --> 00:03:00,000
cat in the image.

43
00:03:00,000 --> 00:03:04,000
So why can't we use object detection?

44
00:03:05,000 --> 00:03:09,000
Or why can't we do object detection using convolutional neural network?

45
00:03:09,000 --> 00:03:11,000
You might be thinking this.

46
00:03:11,000 --> 00:03:17,000
Okay, so if we have a single object in the image, if we have a single object in the image, then we

47
00:03:17,000 --> 00:03:20,000
can use convolution neural network for object detection.

48
00:03:20,000 --> 00:03:24,000
Like here we have only a single object in the image.

49
00:03:24,000 --> 00:03:30,000
Here we can use convolutional neural network for object detection, but if we have multiple objects

50
00:03:30,000 --> 00:03:35,000
in the image, we can't use convolutional neural network for object detection.

51
00:03:35,000 --> 00:03:41,000
But if is there a way out, you might be thinking, yes, there is a way you can use convolution neural

52
00:03:41,000 --> 00:03:44,000
network for if we have multiple objects in the image.

53
00:03:44,000 --> 00:03:46,000
But this will not work.

54
00:03:46,000 --> 00:03:47,000
This will fail.

55
00:03:47,000 --> 00:03:53,000
This approach will fail if you try because we can't use convolutional neural network for object detection

56
00:03:53,000 --> 00:03:56,000
if we have multiple objects in the images or video.

57
00:03:56,000 --> 00:04:03,000
But if you try to do it like like I have this out listed over here, one way to do object detection

58
00:04:03,000 --> 00:04:08,000
using CNN is to divide our image into smaller grids, blocks or grid, for example.

59
00:04:08,000 --> 00:04:09,000
This is my image.

60
00:04:10,000 --> 00:04:16,000
And here I have multiple objects like this is these are the two objects and this is the third object.

61
00:04:16,000 --> 00:04:22,000
So I will divide my image into multiple grids, like let me change the color per grid.

62
00:04:22,000 --> 00:04:28,000
So here I'm dividing my image into multiple grids, like this is one, two, three.

63
00:04:28,000 --> 00:04:32,000
So we are dividing we have divided our image into three cross three grids.

64
00:04:32,000 --> 00:04:32,000
Okay.

65
00:04:33,000 --> 00:04:38,000
So one way to do object detection using a CNN is to divide our image into smaller grids.

66
00:04:38,000 --> 00:04:44,000
But if we have objects of different sizes, image like we have different objects, multiple objects

67
00:04:44,000 --> 00:04:47,000
like 70 objects or ten objects in the image.

68
00:04:47,000 --> 00:04:50,000
And all of the objects are different of different sizes.

69
00:04:50,000 --> 00:04:56,000
We can divide image into 1000, cross 2000 grid like this is we here we have divided the image into

70
00:04:56,000 --> 00:04:58,000
three cross three grid.

71
00:04:58,000 --> 00:05:05,000
So now we can divide the image into 1000, cross 1000 small grids, but it will definitely become computationally

72
00:05:05,000 --> 00:05:06,000
expensive.

73
00:05:06,000 --> 00:05:10,000
Like if you try to run this script on GPU, your system will crash, GPU will crash.

74
00:05:10,000 --> 00:05:12,000
So this will not be work.

75
00:05:13,000 --> 00:05:19,000
We can use CNN for object detection, but if we have a single object in the image, but if we have multiple

76
00:05:19,000 --> 00:05:22,000
objects in the image, CNN fails for object detection.

77
00:05:22,000 --> 00:05:30,000
CNN works best for image classification, but CNN can't be used for object detection if we have multiple

78
00:05:30,000 --> 00:05:31,000
objects in the image.

79
00:05:34,000 --> 00:05:37,000
Now I will discuss the region based convolutional neural network.

80
00:05:37,000 --> 00:05:42,000
So let's first look at the goal of the region based convolutional neural network.

81
00:05:42,000 --> 00:05:49,000
The goal of the region based convolutional neural network is to take an input image and correctly identify

82
00:05:49,000 --> 00:05:52,000
where are the objects in the image via bounding box.

83
00:05:52,000 --> 00:05:55,000
So let me explain you what basically is this?

84
00:05:55,000 --> 00:05:58,000
For example, this is my input image.

85
00:05:59,000 --> 00:06:02,000
Okay, so let me change colors.

86
00:06:02,000 --> 00:06:05,000
So for example, this is my object number one.

87
00:06:05,000 --> 00:06:06,000
This is my object.

88
00:06:06,000 --> 00:06:08,000
Number two, this is my object.

89
00:06:08,000 --> 00:06:11,000
Number three, this is my object number four.

90
00:06:11,000 --> 00:06:14,000
So this is my object number five.

91
00:06:14,000 --> 00:06:16,000
So I have five different objects in the image.

92
00:06:16,000 --> 00:06:21,000
So the goal of region based convolution neural network is to take.

93
00:06:21,000 --> 00:06:22,000
This is my input image.

94
00:06:22,000 --> 00:06:29,000
Okay, so region based convolutional neural network will take this input image and in the output we

95
00:06:29,000 --> 00:06:37,000
will get the objects with bounding box like through region based convolution neural network.

96
00:06:38,000 --> 00:06:42,000
We can identify where are the objects in the image via bounding box.

97
00:06:42,000 --> 00:06:44,000
So in the output we will get.

98
00:06:44,000 --> 00:06:47,000
So output in the output.

99
00:06:50,000 --> 00:06:56,000
You can have the bounding box around each object like this is the bounding box for three objects.

100
00:06:56,000 --> 00:06:58,000
This is the bounding box for the four objects.

101
00:06:58,000 --> 00:07:01,000
This is the bounding box for the five objects.

102
00:07:01,000 --> 00:07:06,000
So one region based convolutional neural network takes an input image and in the output it correctly

103
00:07:06,000 --> 00:07:11,000
and identifies where are the objects in the image via bounding boxes.

104
00:07:11,000 --> 00:07:12,000
So how it works.

105
00:07:12,000 --> 00:07:18,000
Input images pass while in output we have the bounding boxes plus label for each objects in the image

106
00:07:18,000 --> 00:07:19,000
like this can be a dog.

107
00:07:20,000 --> 00:07:22,000
This can be on the card.

108
00:07:22,000 --> 00:07:26,000
So these are the labels for Cat and this can be a cow.

109
00:07:27,000 --> 00:07:29,000
Or this can be a camel.

110
00:07:31,000 --> 00:07:33,000
Or this can be a sheep.

111
00:07:33,000 --> 00:07:34,000
Okay.

112
00:07:35,000 --> 00:07:42,000
For region based convolutional neural network was proposed in 2014 by Ross Girshick and other team members.

113
00:07:42,000 --> 00:07:45,000
What are CNN is called?

114
00:07:45,000 --> 00:07:47,000
CNN basically work or CNN?

115
00:07:49,000 --> 00:07:54,000
The secret of what our CNN does is it divides the image into a number of boxes instead of grids.

116
00:07:54,000 --> 00:07:55,000
Okay.

117
00:07:55,000 --> 00:08:02,000
So instead of dividing dividing the image into grids, like what is basically grid is like in we divide

118
00:08:02,000 --> 00:08:04,000
the image into grids like this.

119
00:08:04,000 --> 00:08:06,000
And these are the grids we are creating.

120
00:08:06,000 --> 00:08:08,000
So in our CNN, we don't do this.

121
00:08:08,000 --> 00:08:11,000
So this approach isn't followed in our CNN.

122
00:08:11,000 --> 00:08:17,000
So our CNN basically divide the image into a number of boxes instead of grids like you can see over

123
00:08:17,000 --> 00:08:18,000
here.

124
00:08:19,000 --> 00:08:19,000
I just.

125
00:08:19,000 --> 00:08:21,000
Just highlight it.

126
00:08:22,000 --> 00:08:24,000
It is the highlighter.

127
00:08:25,000 --> 00:08:31,000
So you can see over here in our and we divide the image into a number of boxes instead of grids.

128
00:08:31,000 --> 00:08:37,000
So our CNN uses selective search algorithm to extract 2000 regions from the images.

129
00:08:37,000 --> 00:08:45,000
So in our CNN using selective search algorithm, we extract 2000 regions from the images called region

130
00:08:45,000 --> 00:08:46,000
proposals.

131
00:08:46,000 --> 00:08:47,000
Okay.

132
00:08:47,000 --> 00:08:53,000
In our CNN, we identify different regions in the image and then pass it to the feature extractor,

133
00:08:53,000 --> 00:08:54,000
which is function neural network.

134
00:08:54,000 --> 00:09:00,000
So, for example, in our CNN, we don't divide our image into grids.

135
00:09:01,000 --> 00:09:03,000
So let me explain this point to you.

136
00:09:04,000 --> 00:09:05,000
But look.

137
00:09:07,000 --> 00:09:14,000
So, for example, here we have the image in Rcnn and this is the object in the images.

138
00:09:15,000 --> 00:09:20,000
So in CNN, we will divide our image into 1000 boxes like here we can.

139
00:09:20,000 --> 00:09:22,000
We have a one box here.

140
00:09:22,000 --> 00:09:23,000
We have another box here.

141
00:09:23,000 --> 00:09:24,000
We have another box.

142
00:09:25,000 --> 00:09:25,000
Okay.

143
00:09:25,000 --> 00:09:29,000
So in our CNN, we are using selective search algorithm.

144
00:09:29,000 --> 00:09:34,000
We extract 2000 image regions from the images called region proposals.

145
00:09:34,000 --> 00:09:41,000
And after extracting 2000 regions from the images, we pass it to the feature extractor of the convolutional

146
00:09:41,000 --> 00:09:42,000
neural network.

147
00:09:42,000 --> 00:09:43,000
So this is what we do.

148
00:09:43,000 --> 00:09:49,000
That is why it is called region based convolutional neural network, because we extract 2000 regions

149
00:09:49,000 --> 00:09:51,000
from the images which are called region proposals.

150
00:09:51,000 --> 00:09:54,000
So this is how our CNN works.

151
00:09:55,000 --> 00:10:01,000
What region based convolutional neural network is based on the following steps, like we generate a

152
00:10:01,000 --> 00:10:06,000
set of region proposals using selective search algorithm for the bounding boxes like we.

153
00:10:07,000 --> 00:10:09,000
We basically have an input image.

154
00:10:09,000 --> 00:10:18,000
Then we select 2000 regions from the input image, and then we pass the 2000 regions we selected from

155
00:10:18,000 --> 00:10:22,000
the input image through the CNN feature extractor to extract the image.

156
00:10:22,000 --> 00:10:27,000
So basically CNN feature extractor here is a Pre-trained Alexnet.

157
00:10:27,000 --> 00:10:35,000
Okay, then we have an SVM to see what object the image has in the bounding box is like now, basically.

158
00:10:37,000 --> 00:10:44,000
Basically after having an in passing, after getting 2000 regions from the input image and passing the

159
00:10:44,000 --> 00:10:51,000
image through the Pretrained Alexnet for feature extractor and then using SVM, we classify what are

160
00:10:51,000 --> 00:10:54,000
the different objects we have inside the bounding box.

161
00:10:54,000 --> 00:11:01,000
Okay, like we can have a cat dog go inside the bounding boxes and then we run the bounding boxes through

162
00:11:01,000 --> 00:11:07,000
a linear regression model to output detected coordinates for the boxes once the object has been classified.

163
00:11:07,000 --> 00:11:11,000
So using linear regression, we find the coordinates of the bounding boxes.

164
00:11:11,000 --> 00:11:15,000
So what are the issues with our CNN like in our CNN?

165
00:11:15,000 --> 00:11:20,000
For each image, we extract 2000 regions and then we pass to the CNN feature extractor.

166
00:11:20,000 --> 00:11:26,000
So when we extract 2000 regions from the image and pass it to the CNN feature extractor, this makes

167
00:11:26,000 --> 00:11:34,000
all the process very slow and it cannot be implemented in real time because each image takes 47 second

168
00:11:34,000 --> 00:11:37,000
to test to find, to be implemented.

169
00:11:37,000 --> 00:11:43,000
So if each image takes 47 seconds for the processing, so this is too much time.

170
00:11:43,000 --> 00:11:51,000
So we can't use a CNN for in real time because what are the issues with our CNN?

171
00:11:51,000 --> 00:11:56,000
Is that for each input image we extract 2000 regions and then we pass through the function neural network

172
00:11:56,000 --> 00:11:58,000
for the feature extractor.

173
00:11:58,000 --> 00:12:05,000
So what we can do is if we have an input image and we pass it to the feature extractor and then we select

174
00:12:05,000 --> 00:12:13,000
2000 regions after the feature like selective algorithm or using CNN feature selector after passing

175
00:12:13,000 --> 00:12:13,000
through the.

176
00:12:14,000 --> 00:12:20,000
Feature selection algorithm extract 2000 regions from the images or through a selective search algorithm,

177
00:12:20,000 --> 00:12:23,000
and then we can do further processing.

178
00:12:23,000 --> 00:12:26,000
This can make the whole process very much better.

179
00:12:28,000 --> 00:12:33,000
So here is the block diagram of the region based on neural network.

180
00:12:33,000 --> 00:12:34,000
Here we have the input image.

181
00:12:34,000 --> 00:12:40,000
We extract 2000 regions from the image and using CNN feature Extractor, then we extract features and

182
00:12:40,000 --> 00:12:43,000
then we classify like there is an aeroplane per server monitor.

183
00:12:43,000 --> 00:12:50,000
So the process can be made very efficient if we have this region proposal network after this.

184
00:12:50,000 --> 00:12:56,000
CNN feature extractor like first we have the input image, then we have the input feature extractor.

185
00:12:57,000 --> 00:13:03,000
Then we can have the region proposal network at the third point over here, Like we can move this over

186
00:13:03,000 --> 00:13:05,000
here, okay?

187
00:13:05,000 --> 00:13:09,000
And then we can classify like this is aeroplane person or TV monitor.

188
00:13:10,000 --> 00:13:12,000
But let's look at the first CNN.

189
00:13:13,000 --> 00:13:19,000
What basically we're doing, our CNN is that we identify 200,000 regions in the image and then we pass

190
00:13:19,000 --> 00:13:23,000
it to the feature extractor, CNN based feature extractor.

191
00:13:23,000 --> 00:13:27,000
So what we do in Fast CNN is that we have the input image.

192
00:13:27,000 --> 00:13:34,000
Instead of identifying 2000 regions from the image, we send the whole image to the CNN feature extractor.

193
00:13:34,000 --> 00:13:38,000
And then in the feature space, we have the region proposal method.

194
00:13:38,000 --> 00:13:38,000
Okay.

195
00:13:38,000 --> 00:13:46,000
So in 2015, the authors of our CNN came with the idea of fast CNN to address the drawbacks in the CNN

196
00:13:46,000 --> 00:13:47,000
in Fast CNN.

197
00:13:48,000 --> 00:13:53,000
The input image is passed to the function neural network based feature extractor to generate convolution

198
00:13:53,000 --> 00:13:54,000
feature map.

199
00:13:54,000 --> 00:13:59,000
And from the convolution feature map, we identify the region of proposals.

200
00:13:59,000 --> 00:14:07,000
The reason why fast r-cnn is faster than our CNN is that we don't need to feed 2000 regions proposals

201
00:14:07,000 --> 00:14:09,000
to the conventional neural network each time.

202
00:14:09,000 --> 00:14:16,000
Instead, the convolution operation is done once per image and the feature map is generated from it.

203
00:14:16,000 --> 00:14:20,000
So here we have the reasons that fast CNN is faster than R-cnn.

204
00:14:20,000 --> 00:14:28,000
Is that because we don't need to identify 2000 regions from each of the image and then pass it to the

205
00:14:28,000 --> 00:14:29,000
convolution neural network?

206
00:14:29,000 --> 00:14:33,000
What we do is fast r-cnn is that we have the input image.

207
00:14:33,000 --> 00:14:39,000
We pass it to the CNN feature map and after passing it to the CNN feature map, we get that 2000 regions

208
00:14:39,000 --> 00:14:40,000
from the image.

209
00:14:40,000 --> 00:14:41,000
Okay.

210
00:14:47,000 --> 00:14:49,000
Block diagram of the CNN.

211
00:14:49,000 --> 00:14:55,000
So this image is passed directly to the basically convolution feature map for feature Extractor.

212
00:14:55,000 --> 00:15:02,000
Then we have the ROI pooling layer where we extract 2000 regions from the image, and then we have the

213
00:15:02,000 --> 00:15:06,000
SVM function and then we apply linear regression like all the steps.

214
00:15:06,000 --> 00:15:13,000
But instead of getting 2000 regions from the image, we pass the image directly to the condition feature

215
00:15:13,000 --> 00:15:16,000
map and then we extract 2000 regions from the image.

216
00:15:19,000 --> 00:15:26,000
Women are the primary challenges associated with a fast r-cnn like fast R-cnn perform better than R-cnn.

217
00:15:26,000 --> 00:15:33,000
But we observed the performance of the fast R-cnn during testing time slows down while using region

218
00:15:33,000 --> 00:15:40,000
proposal like what it means by region Proposal is that during testing time when we extract 2000 regions

219
00:15:40,000 --> 00:15:43,000
from the image using selective search algorithm.

220
00:15:44,000 --> 00:15:51,000
So the performance of the first during testing time gets down as compared to not using region proposals.

221
00:15:51,000 --> 00:15:57,000
Therefore, region proposal is the major reason resulting in that degradation of the performance of

222
00:15:57,000 --> 00:15:59,000
the fast r-cnn.

223
00:15:59,000 --> 00:16:02,000
Okay, so here is the chart as well.

224
00:16:02,000 --> 00:16:04,000
Like comparison of object detection algorithm.

225
00:16:04,000 --> 00:16:09,000
You can overdo it further and see by including region proposals.

226
00:16:09,000 --> 00:16:14,000
Our performance slows down while without excluding region proposals.

227
00:16:14,000 --> 00:16:18,000
Our performance in testing time is better like including region proposal.

228
00:16:18,000 --> 00:16:21,000
It takes 49 seconds while.

229
00:16:23,000 --> 00:16:24,000
Exploding this region.

230
00:16:25,000 --> 00:16:29,000
Region proposal will take 47 second for CNN.

231
00:16:29,000 --> 00:16:29,000
Okay.

232
00:16:29,000 --> 00:16:37,000
While with the fast r-cnn using the region proposal, it takes 2.3 seconds while without region proposal

233
00:16:37,000 --> 00:16:39,000
it takes 0.32 seconds.

234
00:16:39,000 --> 00:16:46,000
So using region proposals is a major reason that degrades over the performance of fast r-cnn.

235
00:16:48,000 --> 00:16:50,000
Eyes are CNN and faster.

236
00:16:50,000 --> 00:16:57,000
CNN basically use selective search algorithm to find the region proposals or like or you can say 2000

237
00:16:57,000 --> 00:17:03,000
regions in the image, which basically slows down the processing of like we have discussed that in the

238
00:17:03,000 --> 00:17:10,000
previous slide in at in testing times using region proposals using if you use region proposal, our

239
00:17:11,000 --> 00:17:15,000
processing becomes very slow and it is a quite time consuming process as well.

240
00:17:15,000 --> 00:17:24,000
So in faster CNN instead of using region proposal like extracting 2000 regions from the images using

241
00:17:24,000 --> 00:17:31,000
selective search algorithm in faster CNN, we don't use selective search algorithm it in faster.

242
00:17:31,000 --> 00:17:35,000
As CNN, we allow the network learn the region proposals.

243
00:17:35,000 --> 00:17:43,000
There is a set separate layer for the region proposals in the faster rcnn so in faster r-cnn the author

244
00:17:43,000 --> 00:17:49,000
made this improvement that they eliminated the use of selective search algorithm in faster R-cnn We

245
00:17:49,000 --> 00:17:51,000
don't use selective search algorithm.

246
00:17:51,000 --> 00:17:59,000
Why in faster r-cnn we have a separate layer for to find 2000 regions from the images.

247
00:17:59,000 --> 00:18:01,000
Okay, so faster.

248
00:18:01,000 --> 00:18:07,000
CNN is object detection algorithm which eliminates the use of selective search algorithm and let the

249
00:18:07,000 --> 00:18:13,000
network learn the region proposal so we don't use selective search algorithm in faster r-cnn because

250
00:18:13,000 --> 00:18:21,000
selective search algorithm is the major reason of slowing the processing in r-cnn and fast r-cnn similar

251
00:18:21,000 --> 00:18:22,000
to fast r-cnn.

252
00:18:22,000 --> 00:18:28,000
The image is provided and as an input to convolution network which provides a convolution feature map.

253
00:18:28,000 --> 00:18:38,000
So in like we do in fast r-cnn is that the the workflow of fast r-cnn faster r-cnn is same like in fast

254
00:18:38,000 --> 00:18:43,000
R-cnn we have an input image and then we pass through the convolution neural network based feature map.

255
00:18:43,000 --> 00:18:49,000
So the same we do in faster r-cnn as well, like in faster r-cnn.

256
00:18:49,000 --> 00:18:54,000
The image is provided as an input to a convolution neural network which provides a convolution feature

257
00:18:54,000 --> 00:18:57,000
map instead of using selective search algorithm.

258
00:18:57,000 --> 00:19:03,000
So in fast CNN we use selective search algorithm while in faster r-cnn we don't use selective search

259
00:19:03,000 --> 00:19:09,000
algorithm on the feature map to identify the region proposal instead of.

260
00:19:10,000 --> 00:19:10,000
In faster.

261
00:19:10,000 --> 00:19:14,000
Our CNN, we have the region Proposal network.

262
00:19:14,000 --> 00:19:14,000
Okay.

263
00:19:16,000 --> 00:19:23,000
So in faster r-cnn we use region proposal network instead of selective search algorithm to identify

264
00:19:23,000 --> 00:19:25,000
2000 regions in the image.

265
00:19:25,000 --> 00:19:26,000
Okay.

266
00:19:30,000 --> 00:19:37,000
Architecture of faster r-cnn As you can see here in faster R-cnn we don't use selective search algorithm

267
00:19:37,000 --> 00:19:42,000
instead of we have a separate layer which is used to find 2000 regions in the image.

268
00:19:42,000 --> 00:19:43,000
Okay.

269
00:19:43,000 --> 00:19:49,000
So using region proposal network layer, this is the layer which point 2000 regions in the image.

270
00:19:49,000 --> 00:19:53,000
So this is the difference between foster asking and the prior algorithms.

271
00:19:54,000 --> 00:19:58,000
Following are the key takeaways from the faster arsenin.

272
00:19:58,000 --> 00:20:05,000
Like in CNN, faster CNN, We use selective search algorithm to find the region proposals, which makes

273
00:20:05,000 --> 00:20:08,000
the processing very slow, as we have seen in the previous slides.

274
00:20:08,000 --> 00:20:16,000
In faster r-cnn we do not use selective search algorithm instead of faster r-cnn directly let the network

275
00:20:16,000 --> 00:20:23,000
learn the region proposals, which means it directly allows the network to find 2000 regions in the

276
00:20:23,000 --> 00:20:27,000
images instead of using selective search algorithm in faster r-cnn.

277
00:20:27,000 --> 00:20:33,000
A separate network is used to predict the region proposals instead of selective search algorithm.

278
00:20:33,000 --> 00:20:37,000
So what are the primary challenges with faster, Faster r-cnn.

279
00:20:37,000 --> 00:20:38,000
Faster.

280
00:20:38,000 --> 00:20:44,000
R-cnn gives the bounding boxes only, but no semantic segmentation or semantic segmentation.

281
00:20:44,000 --> 00:20:47,000
We will use mask r-cnn.

282
00:20:47,000 --> 00:20:49,000
Let's move towards the mask.

283
00:20:49,000 --> 00:20:50,000
R-cnn.

284
00:20:53,000 --> 00:20:55,000
Look at mask R-cnn.

285
00:20:55,000 --> 00:20:58,000
Mask R-cnn is built using faster r-cnn.

286
00:20:58,000 --> 00:21:06,000
The main idea behind mask R-cnn is to extend faster r-cnn to pixel level segmentation in addition to

287
00:21:06,000 --> 00:21:08,000
bounding boxes and class label in mask.

288
00:21:08,000 --> 00:21:13,000
R-cnn we have an object mask on our output.

289
00:21:13,000 --> 00:21:19,000
Like for example, if we have a person A in our output which we have a bounding box drawn around the

290
00:21:19,000 --> 00:21:20,000
person.

291
00:21:20,000 --> 00:21:22,000
So we have a mask as well on it.

292
00:21:22,000 --> 00:21:26,000
So instead, in addition to bounding boxes and class label.

293
00:21:26,000 --> 00:21:33,000
Mask R-cnn outputs an object mask as well in mask R-cnn Fully Convolutional neural network is added

294
00:21:33,000 --> 00:21:40,000
at the top of CNN features of CNN features of faster r-cnn which generate the mask output.

295
00:21:40,000 --> 00:21:50,000
Mask R-cnn uses a trick called ROI to ROI align to locate relevant areas down to pixel level.

296
00:21:50,000 --> 00:21:54,000
The backbone of mask R-cnn is resnet 101.

297
00:22:01,000 --> 00:22:04,000
This is an overview of output of mask r-cnn.

298
00:22:04,000 --> 00:22:07,000
Like here you can see that we have detected car.

299
00:22:07,000 --> 00:22:10,000
We have a mask on the object like car as well.

300
00:22:10,000 --> 00:22:12,000
Now here we have detected a person.

301
00:22:12,000 --> 00:22:15,000
Now you can see that we have a mask on the person as well.

302
00:22:15,000 --> 00:22:17,000
Like there is a mask.

303
00:22:17,000 --> 00:22:20,000
Thus we have a label as well for the person.

304
00:22:20,000 --> 00:22:22,000
Thus here we have the traffic light.

305
00:22:22,000 --> 00:22:25,000
We have detected traffic light and we have a mask on it as well.

306
00:22:25,000 --> 00:22:31,000
And you can see here we have also detected a traffic light and we have a mask on it as well.

307
00:22:31,000 --> 00:22:35,000
And here we have the bounding box and this is the confidence code.

308
00:22:35,000 --> 00:22:39,000
So this is the label, the confidence score and the mask on the object here.

309
00:22:41,000 --> 00:22:44,000
Well, let's see where we can use mask R-cnn.

310
00:22:44,000 --> 00:22:51,000
Mask R-cnn is used in Detectron two, so Detectron two is a framework and main model used in Detectron.

311
00:22:51,000 --> 00:22:53,000
Two is mask R-cnn.

312
00:22:53,000 --> 00:23:02,000
To use mask R-cnn in pytorch it requires Detectron two mask R-cnn and backbone is Resnet 101 mask R-cnn

313
00:23:02,000 --> 00:23:09,000
is used for instance segmentation and U-net is a great framework and use that is used for semantic segmentation.

314
00:23:09,000 --> 00:23:15,000
In the previous slide I have might have mistakenly told you that mask R-cnn is used for semantic segmentation.

315
00:23:15,000 --> 00:23:16,000
Sorry for that mask.

316
00:23:16,000 --> 00:23:21,000
Our CNN is used for instance segmentation and u-net is used for semantic segmentation.

317
00:23:24,000 --> 00:23:25,000
For watching this video tutorial.

318
00:23:25,000 --> 00:23:27,000
See you in the next video tutorial.

319
00:23:27,000 --> 00:23:28,000
Till then, bye bye.

