1
00:00:00,690 --> 00:00:07,440
Now, let's talk about mobile wallet, which has been a very useful network for the mobile phone arena

2
00:00:07,470 --> 00:00:09,720
as well as for other embedded devices.

3
00:00:10,350 --> 00:00:11,490
So let's get started.

4
00:00:11,820 --> 00:00:12,450
So mobile.

5
00:00:12,450 --> 00:00:17,460
That was an architecture developed by Google Google researchers, and here's a link to the people if

6
00:00:17,460 --> 00:00:18,750
you want to check it out yourself.

7
00:00:19,200 --> 00:00:25,230
And it was designed primarily to be an efficient, lightweight CNN that can be used on phones and embedded

8
00:00:25,230 --> 00:00:25,830
devices.

9
00:00:26,340 --> 00:00:27,720
Now remember good.

10
00:00:27,720 --> 00:00:31,190
CNN's All Very Deep also have a lot of parameters.

11
00:00:31,530 --> 00:00:33,540
That means they're quite big a hundred megs.

12
00:00:33,540 --> 00:00:40,710
Plus, it's not always practical to have a 100 Meg model on your cell phone, because especially back

13
00:00:40,710 --> 00:00:46,050
then in 2017, a lot of cell phones didn't come with one hundred and twenty eight gigs of storage.

14
00:00:46,470 --> 00:00:51,840
So upsize was a concern for a lot of these cell phones, as well.

15
00:00:51,840 --> 00:00:57,720
As more importantly, though, in my opinion, was the inference speed because phones back then in 2017

16
00:00:57,720 --> 00:00:58,710
were quite slow and note.

17
00:00:58,710 --> 00:01:06,030
Nowadays, phones like the new iPhones and Google's new Pixel six, they have dedicated machine learning

18
00:01:06,030 --> 00:01:10,470
or neural net processing chips, or it's only a GPU.

19
00:01:10,480 --> 00:01:15,030
Sometimes that tensor unit like what the Google has, but it's effectively it's the way to speed up

20
00:01:15,030 --> 00:01:15,660
the operation.

21
00:01:15,660 --> 00:01:21,570
So you can have a lot more complicated models on cell phones because inference speed becomes an issue.

22
00:01:22,320 --> 00:01:27,900
Inference speed is how long it takes to beautiful propagate and inputs like an input image through the

23
00:01:27,900 --> 00:01:32,160
network and get the probability results for things like object detection.

24
00:01:32,670 --> 00:01:38,700
Running it on a cell phone like my cell phone is actually two years old and running very basic ability

25
00:01:38,700 --> 00:01:45,100
vectors gives me like three to five frames a second, which isn't very fast so mobile.

26
00:01:45,100 --> 00:01:47,880
And that was a way to actually make these things a lot quicker.

27
00:01:48,300 --> 00:01:54,750
So let's take a look at some of the use cases for mobile that they were things like simple object detection,

28
00:01:55,350 --> 00:01:59,250
face attributes, fine grained classification, landmark recognition.

29
00:01:59,670 --> 00:02:00,270
All of these.

30
00:02:00,570 --> 00:02:05,940
All of these use cases use mobile that's in some form of the other to achieve its test.

31
00:02:07,350 --> 00:02:11,590
So let's talk a bit about the world of mobile and embedded devices.

32
00:02:11,670 --> 00:02:17,370
So CNN's basically have to be efficient to run properly on these devices.

33
00:02:17,880 --> 00:02:22,930
This means that they can use a lot of computational power like CG and resonance.

34
00:02:22,950 --> 00:02:27,400
Those are those are basically run on servers and they're quite quick.

35
00:02:27,450 --> 00:02:30,030
They're quite scalable, high High-Performance.

36
00:02:30,420 --> 00:02:31,160
We needed something.

37
00:02:31,190 --> 00:02:32,040
Do it for the user.

38
00:02:32,040 --> 00:02:33,470
End on the edge devices.

39
00:02:33,480 --> 00:02:38,310
That's what mobile phones and other embedded systems embedded systems can refer to.

40
00:02:38,310 --> 00:02:47,250
Also like cameras running Android OS or other types of IoT type devices that have little right to use

41
00:02:47,250 --> 00:02:49,140
a little processing units in there.

42
00:02:49,620 --> 00:02:51,060
But they're still there.

43
00:02:51,150 --> 00:02:53,190
They're fast, but they're not that fast.

44
00:02:54,150 --> 00:02:58,740
They can't run these big models like figure and those big resonates.

45
00:02:59,100 --> 00:03:00,790
They need to run something much simpler.

46
00:03:00,810 --> 00:03:03,990
So there are few ways we can actually make models smaller.

47
00:03:04,500 --> 00:03:08,010
We can just use less parameters, use less complicated models.

48
00:03:08,520 --> 00:03:09,300
That's one way.

49
00:03:09,360 --> 00:03:15,060
However, it's not the most efficient way, sometimes because mobile that allowed us to use a lot less

50
00:03:15,060 --> 00:03:19,740
parameters and achieve a lot better performance in multiple ways, which we'll discuss.

51
00:03:20,430 --> 00:03:25,830
There are other methods of shrinking that do things like pruning, distillation or little bit networks.

52
00:03:25,830 --> 00:03:27,900
They all work fairly well, to be to be fair.

53
00:03:28,320 --> 00:03:31,650
But again, it's not the ideal solution, always.

54
00:03:32,160 --> 00:03:38,160
Maybe nowadays, when cell phones are much faster, these methods with big networks can be better than

55
00:03:38,160 --> 00:03:39,310
mobile at all.

56
00:03:39,540 --> 00:03:43,830
However, mobile that still serves a very strong purpose in 2021.

57
00:03:44,250 --> 00:03:51,570
I remember talking to a camera manufacturer who makes Android an Android OS for their cameras, and

58
00:03:51,570 --> 00:03:57,480
they were recommending only to use mobile that Sony models not to use the other models because they

59
00:03:57,480 --> 00:03:58,350
were quite slow.

60
00:03:58,890 --> 00:04:03,630
So you can see that mobile that is still important in modern computer vision.

61
00:04:04,290 --> 00:04:09,700
So mobile that was able to achieve good performance by using two decent two techniques.

62
00:04:09,720 --> 00:04:16,320
It used something called depth was separable convolutions, which I'll explain to you, as well as two

63
00:04:16,320 --> 00:04:19,170
other hyper hyper parameters that allowed you to control.

64
00:04:19,740 --> 00:04:25,380
Basically, we'll get into this, but allowed you to control the image size, as well as the width of

65
00:04:25,380 --> 00:04:27,330
the convolutions or width of the network.

66
00:04:27,750 --> 00:04:30,780
So it's easy to scale, according to these parameters.

67
00:04:31,140 --> 00:04:33,900
So how does mobile that achieve this, actually?

68
00:04:33,930 --> 00:04:38,550
So let's take a look at what depth by inseparable convolution convolutions, actually.

69
00:04:38,550 --> 00:04:38,850
Oh.

70
00:04:39,420 --> 00:04:43,290
So firstly, that's a look at a regular convolution operation in a CNN.

71
00:04:43,290 --> 00:04:47,760
We have to drill down into this lovely up because this concerns a number of operations, which is not

72
00:04:47,760 --> 00:04:51,030
something we tend to think about when creating CNN's.

73
00:04:51,510 --> 00:04:55,980
But when efficiency matters, it's a concern.

74
00:04:56,220 --> 00:04:58,070
So let's take a look at this example here.

75
00:04:58,080 --> 00:04:59,580
We have a 12 by 12.

76
00:04:59,650 --> 00:05:06,930
Very small image with a color image here by tree, we apply a kernel, a five by five kill and the three

77
00:05:06,930 --> 00:05:12,330
of them because it's a color image and that produces, you know, this feature map of size eight by

78
00:05:12,330 --> 00:05:14,940
eight by one, and that's using a straight one.

79
00:05:15,660 --> 00:05:21,570
Now, the amount of operations this takes is a five by five by three by 64.

80
00:05:21,960 --> 00:05:23,400
It's quite a bit of operations, isn't it?

81
00:05:24,030 --> 00:05:28,620
So if we had 128 filters, look how the this explodes.

82
00:05:28,770 --> 00:05:29,960
We have now 75.

83
00:05:30,060 --> 00:05:32,790
That's five by Firefly Tree by 64.

84
00:05:33,180 --> 00:05:34,410
By 128.

85
00:05:34,440 --> 00:05:42,200
This gives us 614 and 400 operations, which is quite a lot just for one kind of layer.

86
00:05:43,260 --> 00:05:45,750
So let's take a look at depth wise convolutions.

87
00:05:46,290 --> 00:05:47,340
What is this doing?

88
00:05:47,550 --> 00:05:52,740
So what we do here, we use tree filters instead of five by five size.

89
00:05:53,100 --> 00:05:59,400
Instead of using what we did previously, which was a five by five by tree, we use them separate separately

90
00:05:59,580 --> 00:06:00,420
for each channel.

91
00:06:01,080 --> 00:06:04,440
And that gives us an output, a feature map of embodied by tree.

92
00:06:04,470 --> 00:06:06,540
That we have tree feature maps.

93
00:06:06,960 --> 00:06:08,540
So now we have more stuff.

94
00:06:08,550 --> 00:06:10,470
So how does this reduce the operations?

95
00:06:10,980 --> 00:06:13,620
Well, point ways compositions are then used.

96
00:06:13,620 --> 00:06:14,700
That's what this is here.

97
00:06:14,910 --> 00:06:21,120
Point because it's a one by one by tree to get the same output, shape it by to it paid by one.

98
00:06:21,780 --> 00:06:22,950
That's what he applied.

99
00:06:22,950 --> 00:06:23,800
This and this.

100
00:06:23,820 --> 00:06:26,670
You get it by eight, by one matrix.

101
00:06:27,630 --> 00:06:32,220
So this is multiplied sixty four times and it gives us that output, as I just just mentioned.

102
00:06:32,580 --> 00:06:39,060
So this is a number of calculations operations it needs five by five victory by 64, which is four thousand

103
00:06:39,120 --> 00:06:41,190
eight hundred times tree.

104
00:06:41,460 --> 00:06:48,360
Well, this is the other expansion of it tree by 64 by 128, which is when we apply this conversion.

105
00:06:48,600 --> 00:06:54,420
This point weighs convolution here, and in some time up, we only get twenty nine thousand three hundred

106
00:06:54,750 --> 00:06:55,890
and seventy six.

107
00:06:56,640 --> 00:06:58,490
That's remarkably less, isn't it?

108
00:06:58,500 --> 00:07:01,380
And that's what all 128 filled with, my in mind you.

109
00:07:02,040 --> 00:07:05,130
So it's twenty x less operations now.

110
00:07:05,700 --> 00:07:10,320
It doesn't achieve the same performance is something we sacrifice by using two point why called convolution

111
00:07:10,320 --> 00:07:16,650
here and separable remotely refer to this as depth was separable convolutions.

112
00:07:17,040 --> 00:07:22,920
Separable completions point to the fact that we have separate filters now to produce trees, separate

113
00:07:22,920 --> 00:07:23,580
feature maps.

114
00:07:24,210 --> 00:07:28,830
So you can see, though, that this operation does save a lot of operations.

115
00:07:29,430 --> 00:07:33,270
However, that does sacrifice some accuracy, but not as much as you would think.

116
00:07:34,020 --> 00:07:40,350
So the other two hyper parameters that we discussed are the width multiplier and resolution multiplier.

117
00:07:40,350 --> 00:07:45,270
The width multiplier thins the model at each layer, allowing you to use smaller filters if you want

118
00:07:45,270 --> 00:07:51,180
to do a less filters, as well as the resolution multiplier, which reduces the input image size, which

119
00:07:51,180 --> 00:07:55,110
also because it's proportional operations are proportional to the image size.

120
00:07:55,500 --> 00:07:59,550
It reduces the number of operations needed, so he could get much faster performance.

121
00:08:00,120 --> 00:08:02,850
So given all these sacrifices, you would expect mobile.

122
00:08:02,850 --> 00:08:04,950
That's performance to drop off considerably.

123
00:08:05,460 --> 00:08:07,590
However, it doesn't take a look at this.

124
00:08:07,600 --> 00:08:09,910
You can see it in this first table in the top right.

125
00:08:10,380 --> 00:08:18,240
Mobile net with 224 pixel image size that's the input image size used achieved seventy point six percent

126
00:08:18,240 --> 00:08:23,790
image net accuracy, which is quite remarkable because Video G was achieving seventy one point five

127
00:08:24,180 --> 00:08:30,780
with look how many more parameters, 128 million parameters versus 4.2 parameters you can see mobile

128
00:08:30,780 --> 00:08:34,650
that has extremely good performance and very little sacrifice.

129
00:08:35,190 --> 00:08:41,310
Here's another example here similar mobile in that comparison where we use the image not again accuracy.

130
00:08:41,730 --> 00:08:47,370
And we compare this note to other smaller networks, which actually I accept isn't that small squeeze

131
00:08:47,370 --> 00:08:53,250
net is another lightweight model similar to mobile net, which we'll discuss shortly.

132
00:08:54,000 --> 00:08:57,060
And you can see it's even it's better than that.

133
00:08:57,600 --> 00:09:02,070
Also, you can compare it in the top one accuracy of the Stanford Dogs' dataset.

134
00:09:02,550 --> 00:09:09,830
You can see how Inception with twenty three point two million parameters compares to mobile net 224.

135
00:09:10,380 --> 00:09:15,830
And you can see eighty three point eighty eight point eighty three point three percent.

136
00:09:16,140 --> 00:09:19,850
Why was that so hard to see of this 84 percent?

137
00:09:19,860 --> 00:09:21,600
And so that's quite good.

138
00:09:22,230 --> 00:09:28,230
So you can see mobile that is is very effective and very accurate, despite its small size.

139
00:09:28,620 --> 00:09:34,200
And this figure also shows the parameters for the parameters vary with the input image size, as well

140
00:09:34,200 --> 00:09:36,000
as you can see different sizes we used.

141
00:09:36,420 --> 00:09:40,740
And that's basically it for the performance analysis you can dig into.

142
00:09:40,740 --> 00:09:47,010
The people who use these tables were taken from the mobile that people that was published in Twenty

143
00:09:47,220 --> 00:09:47,790
Seventeen.

144
00:09:48,450 --> 00:09:50,040
So we'll stop there for now.

145
00:09:50,040 --> 00:09:54,990
And next, we'll move on to the Inception network, which is a very cool network.

146
00:09:55,470 --> 00:09:59,550
It was a bit trendy when it came out back then, and it's still used nowadays for some.

147
00:10:00,040 --> 00:10:05,320
And tests, so will start, we'll continue with the Inception network next.

148
00:10:05,530 --> 00:10:05,920
Thank you.