1
00:00:00,540 --> 00:00:06,270
And welcome to our lecture on our CNN's first hour of CNN's and faster CNN's.

2
00:00:07,020 --> 00:00:07,820
So let's begin.

3
00:00:07,860 --> 00:00:15,300
So firstly, what are CNN's well, not to be confused with recurrent neural networks, so Arnolds are

4
00:00:15,300 --> 00:00:18,900
CNN's were one of the first deep learning based object detectors.

5
00:00:19,410 --> 00:00:22,250
And it gave us very nice results like this output here.

6
00:00:23,580 --> 00:00:29,430
No, they were introduced in 2014 by researchers at the University College of Berkeley and our CNN's

7
00:00:29,430 --> 00:00:33,570
Opt-In dramatically high performance in the Pascal Video Challenge.

8
00:00:33,570 --> 00:00:39,150
That's a dataset that's similar to cocoa that was a standard used back then to evaluate optical detectors.

9
00:00:39,780 --> 00:00:42,120
So but what exactly are our CNN's?

10
00:00:42,120 --> 00:00:45,900
Well, it stands for regions with CNN's or our science for short.

11
00:00:46,350 --> 00:00:50,100
And now let's dig in to the technology behind our CNN's.

12
00:00:50,460 --> 00:00:57,450
So firstly, our CNN's attempted to solve the exhaustive search problem that was previously formed by

13
00:00:57,480 --> 00:00:58,290
sliding windows.

14
00:00:58,310 --> 00:01:00,420
Remember in the sliding window problem.

15
00:01:00,720 --> 00:01:05,880
You can just have an infinite number of classifications, almost because depending on the image size

16
00:01:05,880 --> 00:01:11,880
and what size of box with stride to use, it basically was an exhaustive and slow searching problem.

17
00:01:12,810 --> 00:01:15,270
So how does the algorithm?

18
00:01:15,270 --> 00:01:17,100
How does our CNN's solve this?

19
00:01:18,600 --> 00:01:24,060
Well, they use something called the selective search algorithm, and we can see a quick overview here

20
00:01:24,060 --> 00:01:24,990
of our CNN's.

21
00:01:24,990 --> 00:01:26,640
Basically, we have an input image.

22
00:01:27,000 --> 00:01:30,900
We extract some region proposals that's done by the selective search algorithm.

23
00:01:31,350 --> 00:01:36,960
Then we pass those proposals to CNN to get the features and then we classify based on those extracted

24
00:01:36,960 --> 00:01:39,720
features whether what class it is basically.

25
00:01:39,720 --> 00:01:44,250
So we classify each of the regions proposed here by the selective search algorithm.

26
00:01:44,460 --> 00:01:45,570
So it's simple, isn't it?

27
00:01:46,260 --> 00:01:48,990
Well, let's talk a bit more about the selective search algorithm.

28
00:01:49,500 --> 00:01:51,900
A selective search attempts to segment the image.

29
00:01:52,230 --> 00:01:57,990
So an unsupervised learning method here to into different groups by combining similar areas such as

30
00:01:57,990 --> 00:02:02,850
colors, textures and proposed these regions as interesting bounding boxes.

31
00:02:03,360 --> 00:02:05,390
So you can take a look at this analysis here.

32
00:02:05,400 --> 00:02:06,670
You can see this is a region.

33
00:02:06,700 --> 00:02:12,360
So this is a segmented region here, different types of segmentation as we use different skills and

34
00:02:12,360 --> 00:02:16,470
you can see the bounding boxes that are proposed out of each segmented region.

35
00:02:16,480 --> 00:02:20,040
So it's almost like industrial control over each of these here.

36
00:02:20,040 --> 00:02:23,160
And those were basically our bounding box proposals.

37
00:02:23,550 --> 00:02:25,500
And you can see another example of it here.

38
00:02:26,820 --> 00:02:28,740
So let's talk a bit more about selective search.

39
00:02:29,220 --> 00:02:35,880
So when selective search isn't a fight to regions of boxes, it passes those extracted images to our

40
00:02:35,880 --> 00:02:36,450
CNN.

41
00:02:36,480 --> 00:02:40,820
That's usually a one free tree, and CNN creates an image net of music.

42
00:02:40,830 --> 00:02:41,110
The image?

43
00:02:41,110 --> 00:02:42,000
Not almost always.

44
00:02:43,110 --> 00:02:48,420
And we don't use a CNN directly for classification, although we can.

45
00:02:48,900 --> 00:02:51,300
What we do is we extract the features.

46
00:02:51,310 --> 00:02:57,090
Remember how we used pre-trained models as feature extractors and we just get a vector out of it.

47
00:02:57,330 --> 00:03:05,190
Flattening lastly, the output of the CNN here and we just passed it into something like SVM or logistic

48
00:03:05,190 --> 00:03:08,340
regression, a linear classifier to classify basically.

49
00:03:08,360 --> 00:03:14,310
And now, after our region has been classified, we then use basically in simple linear regression to

50
00:03:14,310 --> 00:03:18,540
generate a title bounding box around the initial proposal bounding box.

51
00:03:19,260 --> 00:03:22,330
So that's it for the initial our CNN's.

52
00:03:22,350 --> 00:03:24,180
However, there were a lot of problems with it.

53
00:03:25,170 --> 00:03:30,120
So while they were quite good, they were notoriously slow as each bounding box had to be classified

54
00:03:30,140 --> 00:03:30,840
by CNN.

55
00:03:31,230 --> 00:03:38,160
And basically this if you get extracted boxes and you have like a hundred boxes, that's a hundred inferences

56
00:03:38,160 --> 00:03:39,660
you're going to have to make on CNN.

57
00:03:40,050 --> 00:03:42,930
So you can see doing this on a video is going to be quite slow.

58
00:03:43,380 --> 00:03:48,090
So the reason for the heavy computation is mostly due to using trees separately.

59
00:03:48,090 --> 00:03:50,490
Train models, we had one for feature extraction.

60
00:03:50,910 --> 00:03:55,470
We had a SVM to predict the final pass and then we had a linear regression to tighten the voting box.

61
00:03:55,920 --> 00:04:02,820
So in 2015, fast as our CNN's were introduced to solve a lot of these problems.

62
00:04:03,420 --> 00:04:09,150
So combining the training of the CNN, the classifier and the bounding box progress into the simultaneous

63
00:04:09,180 --> 00:04:13,470
into a single model was how the researchers were able to solve a number of these problems.

64
00:04:13,470 --> 00:04:20,130
So fast to our CNN's firstly reduced the number of proposals on the boxes by removing the overlap generated.

65
00:04:20,520 --> 00:04:21,660
But how did it do this?

66
00:04:22,200 --> 00:04:28,320
Well, we first rerun the CNN across the image just once using a technique called region and region,

67
00:04:28,320 --> 00:04:30,260
region of interest pooling or our life.

68
00:04:30,990 --> 00:04:37,800
And then our light pool allows us to shift the focus of the CNN for the image across its subregions.

69
00:04:39,000 --> 00:04:45,660
And this works because previously regions who are simply extracted from the CNN feature map and then

70
00:04:45,660 --> 00:04:46,050
pooled.

71
00:04:46,680 --> 00:04:51,120
Now this means that therefore it is only need to run out CNN once on the image, and we can reuse those

72
00:04:51,120 --> 00:04:56,070
extracted features, the extracted from the regions in the areas of the up to the vector.

73
00:04:56,310 --> 00:04:58,470
So we save a lot of processing time there.

74
00:04:59,340 --> 00:04:59,730
So.

75
00:04:59,960 --> 00:05:06,380
Also, training time was improved by combining treating up to CNN Classify and the bounding box aggressive

76
00:05:06,380 --> 00:05:07,580
into a single model.

77
00:05:08,240 --> 00:05:14,300
So our SVM feature classify that we use previously well, that became a soap box layered on top of the

78
00:05:14,300 --> 00:05:18,200
CNN and also the linear regression bounding box type.

79
00:05:18,200 --> 00:05:20,870
No, that just became a bounding box upwardly.

80
00:05:21,140 --> 00:05:23,690
Also a parallel to the soft max coming out of the CNN.

81
00:05:24,140 --> 00:05:30,050
And you can see how drastically fast are CNN's improved training time, as well as the inference time

82
00:05:30,050 --> 00:05:30,280
here.

83
00:05:30,290 --> 00:05:34,520
It was substantial, so this actually allowed it to be used real time in videos.

84
00:05:35,150 --> 00:05:36,860
However, the research is being done.

85
00:05:37,220 --> 00:05:42,440
In 2016, there were more improvements because first our CNN's wildly improved speed.

86
00:05:42,450 --> 00:05:47,030
Significantly, this still relied on the relatively slow selective switch algorithm.

87
00:05:47,540 --> 00:05:52,220
But fortunately, Microsoft researchers found a way to eliminate this bottleneck.

88
00:05:53,900 --> 00:05:59,840
And you can see how much faster, faster RC it ends off a better, faster net recipients.

89
00:06:00,230 --> 00:06:07,310
It's basically a 10x increase in speed, and this one has to do with substantial jumps between our CNN's

90
00:06:07,310 --> 00:06:10,070
Faster CNN's and faster are CNN's.

91
00:06:10,580 --> 00:06:12,920
So let's see how this sped up the original proposal.

92
00:06:13,340 --> 00:06:17,240
So selective search relies on features extracted from an image.

93
00:06:18,200 --> 00:06:25,010
Now what if we just reuse this, which is to do region proposal instead of basically something like

94
00:06:25,010 --> 00:06:30,020
this here where we have the conveyors are putting a region proposal would feature maps here.

95
00:06:30,140 --> 00:06:35,960
We just get proposals out of it and pass those proposals using our way pooling to a classifier.

96
00:06:37,520 --> 00:06:41,510
That was the insight that made faster our CNN's extremely efficient.

97
00:06:44,980 --> 00:06:51,820
So fast to our CNN's added fully convolutional layer on top of the features of the CNN to create the

98
00:06:51,820 --> 00:06:53,170
region proposal network.

99
00:06:54,810 --> 00:06:56,920
So what is of the people actually state?

100
00:06:56,920 --> 00:07:04,270
The region proposal network slides a window over the features of the CNN, so each window location,

101
00:07:04,420 --> 00:07:07,630
the network outputs a score and a bounding box put anchor.

102
00:07:08,080 --> 00:07:14,920
Hence, we have four key where key is no anchors box coordinates because a box has four dimensions.

103
00:07:15,000 --> 00:07:17,350
OK, that's a we have for box coordinates.

104
00:07:18,010 --> 00:07:25,300
Now, after each pass of the sliding window, it outputs key potential bounding boxes and a score or

105
00:07:25,300 --> 00:07:28,390
confidence of how good this box is expected to be.

106
00:07:29,350 --> 00:07:32,830
That's basically it for fast to our CNN's.

107
00:07:33,010 --> 00:07:34,330
So I hope you enjoy this lesson.

108
00:07:34,900 --> 00:07:40,540
And what we'll do next, we'll take a look at an introduction to single shot detectors, also known

109
00:07:40,540 --> 00:07:48,340
as Essos, which are a very cool one shot, which I'll explain in that section type of up detector model.

110
00:07:48,460 --> 00:07:49,520
So stay tuned for that.

111
00:07:49,630 --> 00:07:50,080
Thank you.