1
00:00:00,840 --> 00:00:06,750
Hi and welcome back in this section, we'll take a look at single shot detectors, commonly known as

2
00:00:06,750 --> 00:00:07,560
SS cities.

3
00:00:07,980 --> 00:00:09,000
So let's get started.

4
00:00:09,120 --> 00:00:15,300
So firstly, we just discussed to our CNN family and you've seen how they can be very successful.

5
00:00:15,540 --> 00:00:22,680
However, they have a weakness, and their weakness is still the inference time is not optimal in hardware

6
00:00:22,680 --> 00:00:26,160
and the best hardware back in 2017 2016.

7
00:00:26,580 --> 00:00:31,200
It typically achieved roughly seven frames per second, and that was on a very powerful, expensive

8
00:00:31,200 --> 00:00:31,590
hardware.

9
00:00:32,160 --> 00:00:35,640
So it was infeasible to use that in real time systems.

10
00:00:36,750 --> 00:00:43,470
So Estes aimed to improve the speed by eliminating one of two stages in typical object detectors, and

11
00:00:43,470 --> 00:00:47,650
that's the region proposal network stage, hence to name a single shot.

12
00:00:47,670 --> 00:00:55,650
So SS, these are one stage optical detector, whereas our CNN's with two stage and you can see summary

13
00:00:55,650 --> 00:01:02,370
here how well SSD is performed in terms of map into view c 2007 image dataset.

14
00:01:02,760 --> 00:01:09,960
It's getting seventy nine percent, beating our CNN's Here Faster CNN's dealers and getting roughly

15
00:01:10,020 --> 00:01:11,810
more than twice their fees.

16
00:01:12,330 --> 00:01:16,410
So you can see it's quite good and can see this is in input resolution that we use.

17
00:01:16,410 --> 00:01:22,110
So even even use a much smaller input resolution, which allows it to be much faster and achieved a

18
00:01:22,110 --> 00:01:23,040
much better score.

19
00:01:23,130 --> 00:01:25,780
So that means that it's a very positive sign versus these.

20
00:01:26,430 --> 00:01:28,750
So how do you assess these improve speed?

21
00:01:28,800 --> 00:01:37,170
Well, these use multi scale features and default boxes, as well as dropping the resolution of the

22
00:01:37,170 --> 00:01:38,910
images, all of which improve speed.

23
00:01:39,810 --> 00:01:45,750
No cities do not resemble pixel pixels or features for bounding box hypotheses.

24
00:01:46,350 --> 00:01:50,120
And and it's still as accurate as it pushes that do so.

25
00:01:50,150 --> 00:01:51,210
Doesn't have any loss.

26
00:01:51,220 --> 00:01:53,340
Doesn't take any loss in doing that in accuracy.

27
00:01:54,060 --> 00:02:00,690
This allows these achieve real time speed, which is perfect for when doing video analytics with almost

28
00:02:00,690 --> 00:02:03,300
no drop and sometimes even improved accuracy.

29
00:02:04,080 --> 00:02:06,430
So let's take a look at the SSD structure.

30
00:02:06,430 --> 00:02:07,830
But so SS.

31
00:02:07,830 --> 00:02:14,160
These are composed of two main parts the feature map extractor and the researchers initially used.

32
00:02:14,160 --> 00:02:15,210
VIDEO 16.

33
00:02:15,210 --> 00:02:20,820
Although in future iterations of resonant mobile, not dense, not depending on application, they got

34
00:02:20,820 --> 00:02:23,370
better results or faster results with mobile that.

35
00:02:24,030 --> 00:02:28,770
And the second part is the convolutional filter for object detection.

36
00:02:28,950 --> 00:02:30,330
So let's take a look at this now.

37
00:02:31,080 --> 00:02:38,850
So SSD, these do something called the disguising feature maps, so it uses two biggies is con four

38
00:02:39,000 --> 00:02:43,200
trillion and it makes four to six, which we can set it in.

39
00:02:43,230 --> 00:02:45,090
So use a set parameter.

40
00:02:45,780 --> 00:02:49,160
It makes 5:56 object predictions for each.

41
00:02:49,170 --> 00:02:49,620
So.

42
00:02:49,750 --> 00:02:57,570
So imagine the images split up into sounds like this and each so it makes tree was four to six different

43
00:02:57,570 --> 00:02:58,140
predictions.

44
00:02:58,140 --> 00:03:03,630
Here it predicts class scores and adds one extra one extra class if no object was found.

45
00:03:04,230 --> 00:03:09,300
So if we were to use fewer sounds, as you can imagine here in this image, in the middle image here,

46
00:03:09,960 --> 00:03:13,020
you'd be able to take a lot more granular objects.

47
00:03:13,350 --> 00:03:18,090
However, you want to use larger cells, you would probably just detect bigger objects like a dog as

48
00:03:18,090 --> 00:03:18,600
well here.

49
00:03:19,170 --> 00:03:24,420
So it's a compromise you have to do depending on what performance requirements you have application.

50
00:03:25,770 --> 00:03:26,940
So what we did?

51
00:03:27,150 --> 00:03:33,720
So what researchers did was use PGD to extract feature maps, and for each bounding box, we obtain

52
00:03:33,730 --> 00:03:36,600
the probabilities of all classes within that region.

53
00:03:37,290 --> 00:03:40,980
This allows us to produce overlapping boxes where multiple objects occur.

54
00:03:41,550 --> 00:03:45,600
So if you had a box like this table, there was a box puzzle here.

55
00:03:46,050 --> 00:03:49,070
But now, for instance, there was a book proposal and the next box.

56
00:03:49,090 --> 00:03:52,320
Here you can wear a cat with a dog was.

57
00:03:52,770 --> 00:03:54,960
You can have boxes that overlap each other there.

58
00:03:55,320 --> 00:03:58,030
So that's a good feature that's called a multi box.

59
00:03:58,030 --> 00:04:04,980
So making multiple predictions for funding this possibly bounding that boundary, bounding boxes and

60
00:04:04,980 --> 00:04:07,320
confident school is called multiverse.

61
00:04:08,110 --> 00:04:11,820
So estes and training we use a few different lost functions.

62
00:04:12,270 --> 00:04:15,510
We firstly use to class predictions for class collections.

63
00:04:15,510 --> 00:04:19,920
We use a categorical cross entropy and for localization.

64
00:04:20,370 --> 00:04:23,050
That means the funding box lost.

65
00:04:23,070 --> 00:04:25,230
We effectively the localization loss.

66
00:04:25,800 --> 00:04:28,050
We use something called smoothed L1 loss.

67
00:04:28,500 --> 00:04:34,830
So Estes only penalizes predictions of positive matches and ignores the negative matches as we want

68
00:04:34,830 --> 00:04:39,900
to get the best kind of positives as close to the ground truth as possible.

69
00:04:40,980 --> 00:04:44,550
So let's take a look at assess these performance back in 2017.

70
00:04:45,150 --> 00:04:51,300
You can see assesses with the red box here is doing probably the best in terms of overall map score

71
00:04:51,750 --> 00:04:58,500
back on this dataset is beating you a little bit and two which was here, it's beating our CNN fast

72
00:04:58,510 --> 00:04:59,640
RC events as well.

73
00:05:00,230 --> 00:05:05,130
And Foster rc and here are the models, so it's doing quite well.

74
00:05:05,150 --> 00:05:06,770
And you can see your time here.

75
00:05:07,310 --> 00:05:08,990
This is of inference time.

76
00:05:09,110 --> 00:05:10,460
It was pretty good as well.

77
00:05:10,940 --> 00:05:15,170
So this probably used a bigger input image here to achieve this score.

78
00:05:15,590 --> 00:05:21,830
So this is more of a demonstration of assess these potential and being a very accurate model as opposed

79
00:05:21,830 --> 00:05:22,730
to being a fast model.

80
00:05:22,730 --> 00:05:28,100
But you can see, given that its requirements were here in damages, its performance, its accuracy

81
00:05:28,100 --> 00:05:28,510
was here.

82
00:05:28,520 --> 00:05:29,240
It's quite good.

83
00:05:30,470 --> 00:05:37,790
So the key takeaways from this lesson is that since these are faster than our afternoons but oftentimes

84
00:05:37,790 --> 00:05:43,730
less accurate than detecting the smaller objects, accuracy increases if we increase the number of default

85
00:05:43,730 --> 00:05:49,520
boxes so we can increase the density of the grid, as well as having more proposals or better design

86
00:05:49,520 --> 00:05:52,190
boxes instead of four, six or even eight.

87
00:05:53,300 --> 00:05:56,750
Multi-Skilled feature maps improved detection of varying skills.

88
00:05:57,170 --> 00:06:04,490
It's faster than you'll listen to back in 2017 and as accurate as faster afternoons, perhaps even more,

89
00:06:04,490 --> 00:06:04,940
actually.

90
00:06:05,720 --> 00:06:14,720
It predicts categories of books offsets uses small convolutional filters applied to future maps, makes

91
00:06:14,720 --> 00:06:20,470
predictions using the feature maps of different skills and for training requires that the ground data

92
00:06:20,930 --> 00:06:25,460
is assigned to specific outputs in the fixed set of detector outputs.

93
00:06:26,420 --> 00:06:32,660
It is slower but more accurate than you would typically, but that a lot has changed since then, and

94
00:06:32,660 --> 00:06:35,120
it's faster but less accurate and faster.

95
00:06:35,120 --> 00:06:43,760
RCA notes that when it actually is a bit debatable to be fair, but it's just one of the key takeaways

96
00:06:43,760 --> 00:06:44,690
we have from this lesson.

97
00:06:45,560 --> 00:06:52,490
So now we can stop there and we can move on to my favorite object detector, which is yellow, and I

98
00:06:52,490 --> 00:06:57,620
have a few sections in yellow because it's quite cool and very useful, and we'll also be doing some

99
00:06:57,620 --> 00:06:59,060
projects in Europe as well.

100
00:06:59,450 --> 00:07:03,800
So stay tuned for our introduction to yellow object detectors.

101
00:07:03,980 --> 00:07:04,400
Thank you.