1
00:00:00,110 --> 00:00:06,470
Hi guys, and welcome to this new section in which we are going to be looking at quantization for neural

2
00:00:06,470 --> 00:00:07,190
networks.

3
00:00:07,190 --> 00:00:13,130
In the previous section, we looked at the open neural network exchange standard, which is an open

4
00:00:13,130 --> 00:00:15,770
standard for machine learning interoperability.

5
00:00:15,770 --> 00:00:23,780
We saw that not only does this Onnx format permit us to convert models from one framework to another,

6
00:00:23,780 --> 00:00:29,240
but they also allow us to optimize our models for different hardwares.

7
00:00:29,240 --> 00:00:34,790
And so in line with this optimizations, we are going to look at quantization, which is a technique

8
00:00:34,790 --> 00:00:44,720
for performing computations and storing tensors at lower bitwidth than the usual floating points which

9
00:00:44,720 --> 00:00:47,180
we have been working with so far in this course.

10
00:00:47,360 --> 00:00:54,620
Model Quantization is a popular deep learning optimization method in which model data that is both the

11
00:00:54,620 --> 00:01:02,880
network parameters and the activations are converted from a floating point representation to a lower

12
00:01:02,880 --> 00:01:07,080
precision representation, typically using eight bit integers.

13
00:01:07,170 --> 00:01:11,820
Now defining quantization in this manner may not seem very clear.

14
00:01:11,820 --> 00:01:21,540
So let's try to understand first of all why quantization or quantizing a neural network model is important.

15
00:01:21,660 --> 00:01:28,290
So here let's consider this very simplified model where we take in some input, multiply it by a weight

16
00:01:28,290 --> 00:01:29,810
and add the bias.

17
00:01:29,820 --> 00:01:37,080
Now we have several layers, so we just simply stack this up and we could say we have our model which

18
00:01:37,080 --> 00:01:49,590
is already been trained and this model has 100 million parameters and occupies, let's say one gigabyte

19
00:01:49,620 --> 00:01:50,730
of space.

20
00:01:51,120 --> 00:01:55,530
So we have this model, 100 million parameters, one gigabyte of space.

21
00:01:55,740 --> 00:02:03,390
And if you're doubting what this space is for, you should note that it's for storing this weights and

22
00:02:03,390 --> 00:02:04,290
biases.

23
00:02:04,890 --> 00:02:13,020
And so obviously the more parameters we are going to have, the heavier our final model file will be.

24
00:02:13,050 --> 00:02:19,560
Now, supposing you want to use this in some setup like a mobile phone, so you want to use this in

25
00:02:19,560 --> 00:02:20,730
your mobile phone.

26
00:02:21,450 --> 00:02:31,380
It means that you will need to allocate at least one gigabyte of memory space if you want to run this

27
00:02:31,380 --> 00:02:32,100
model.

28
00:02:32,220 --> 00:02:37,770
And this is where the techniques like quantization come in.

29
00:02:38,160 --> 00:02:47,010
So now, thanks to quantization, instead of storing this weights in a 32 bit space, we are going to

30
00:02:47,010 --> 00:02:52,320
store them in eight bit memory space.

31
00:02:52,830 --> 00:03:01,260
So going from floating point 32 to INT eight.

32
00:03:01,890 --> 00:03:07,470
Now, if you're not familiar with the floating point arithmetic, you could check out this resource

33
00:03:07,470 --> 00:03:17,280
by Fabian Sinclair where he explains in a very intuitive manner the whole concept around floating point

34
00:03:17,610 --> 00:03:19,860
binary representation.

35
00:03:20,430 --> 00:03:25,970
Essentially if a single weight value, that's your model weight, let's take this off.

36
00:03:25,980 --> 00:03:34,440
If you have a model weight value, which is, for example, 3.14, the way this is represented in memory

37
00:03:34,440 --> 00:03:43,650
is by, first of all, allocating this 32 spaces we have here where each space takes a zero or a one.

38
00:03:44,760 --> 00:03:52,440
And this first position here, the zero or the one, is to specify whether we are dealing with a positive

39
00:03:52,440 --> 00:03:53,820
or a negative number.

40
00:03:53,820 --> 00:04:03,360
And then for the next eight positions, we are going to say whether this value 3.14 lies in the range,

41
00:04:03,810 --> 00:04:08,400
two to the -1 to 2 to the 0 or 2 to the 0 to 2 to the 1 or 2.

42
00:04:08,400 --> 00:04:11,430
To the 1 to 2 to the two, and so on and so forth.

43
00:04:11,460 --> 00:04:16,500
Now, in our case, 3.14 lies in this range.

44
00:04:18,090 --> 00:04:24,330
And given that this exponent we have here is one, we are going to apply this formula where we have

45
00:04:24,330 --> 00:04:33,090
the exponent -127 should give us this power we have here.

46
00:04:33,090 --> 00:04:34,710
So we'll have that one.

47
00:04:34,710 --> 00:04:45,330
So we have E is equal now 128 and if you convert 128 to binary notation, you would obtain this right

48
00:04:45,360 --> 00:04:47,160
here and now.

49
00:04:47,160 --> 00:04:57,750
After encoding this integer position, the next step will be to encode this decimal value right here

50
00:04:58,050 --> 00:04:59,310
and that will be the role of.

51
00:04:59,340 --> 00:05:01,560
Of this 23 other positions.

52
00:05:01,560 --> 00:05:03,570
Remember, this is only one box.

53
00:05:03,570 --> 00:05:05,100
This is eight boxes.

54
00:05:05,100 --> 00:05:08,340
And here we have 23 boxes for this eight.

55
00:05:08,370 --> 00:05:16,140
We've seen that it helps us locate our number in this range, which we've seen already.

56
00:05:16,260 --> 00:05:26,460
But for this other 23 boxes, we are going to suppose that since we have two to the 23 possibilities,

57
00:05:26,460 --> 00:05:37,170
thus let's write this here, two to the 23 possibilities, which is actually 8.3 88,000,608 possibilities.

58
00:05:37,170 --> 00:05:40,800
So we have 8 million possibilities here.

59
00:05:41,580 --> 00:05:47,070
This simply means that for every given range, which we have seen here, for this range, this range,

60
00:05:47,070 --> 00:05:56,130
this range, this up to the end, we are going to divide it into 8 million different parts.

61
00:05:56,130 --> 00:06:00,040
And so if you see here, you see you have two to the power of one.

62
00:06:00,040 --> 00:06:01,780
This is two to the power of two.

63
00:06:01,900 --> 00:06:11,410
If you break this gap, if you break this year, this year, this gap into 8 million different parts,

64
00:06:12,520 --> 00:06:23,230
or better still, if we consider that the distance to move from 2 to 4 is 8.388 million, then the finding

65
00:06:23,230 --> 00:06:32,230
this 3.14 encoding this 0.14 right here, or let's just say encoding 3.14 will entail calculating the

66
00:06:32,230 --> 00:06:40,600
distance from to right up to 3.14, knowing that this distance from 2 to 4 is 8.388 million.

67
00:06:40,600 --> 00:06:47,320
We can now compute this distance by simply doing 3.14 minus two.

68
00:06:47,350 --> 00:06:54,250
That is 1.14 divided by all this distance.

69
00:06:54,250 --> 00:06:55,060
That's two.

70
00:06:55,060 --> 00:07:04,600
So we find this and then multiply by the 8 million we have 4781506.

71
00:07:04,600 --> 00:07:10,870
So now we shall convert this to binary and we obtain this here.

72
00:07:10,870 --> 00:07:15,970
So once we obtain this, we then fill up all this 23 spaces right here.

73
00:07:15,970 --> 00:07:19,360
And that's essentially how a number like this is stored in memory.

74
00:07:19,660 --> 00:07:27,640
And so getting back here, if we have to store this, let's say 3.14, 3.14, which was previously stored

75
00:07:27,640 --> 00:07:34,360
in this 32 box memory, and now we want to store it in an eight box memory.

76
00:07:34,390 --> 00:07:40,180
Now we move from 1GB to 256MB.

77
00:07:40,300 --> 00:07:52,390
Here we have 256 and our mobile phone will now need only 256MB of memory to run our model.

78
00:07:52,390 --> 00:07:59,740
Now, it doesn't just suffice to say we are going to go from the floating point 32 to the INT eight.

79
00:07:59,920 --> 00:08:03,220
We need to describe exactly how this is done.

80
00:08:03,370 --> 00:08:12,790
And the way it's done is actually by a simple linear mapping where we shall start by defining two ranges

81
00:08:12,790 --> 00:08:13,660
of values.

82
00:08:13,660 --> 00:08:18,610
The first range is for the floating point values.

83
00:08:18,790 --> 00:08:24,040
And as you could see here, the define negative, a max to a max.

84
00:08:24,070 --> 00:08:31,450
Well, one good thing about deep learning models is most times your weight or your weight values lie

85
00:08:31,450 --> 00:08:34,060
between -1 and 1.

86
00:08:35,090 --> 00:08:38,210
And so getting back here, we could have here negative one.

87
00:08:38,210 --> 00:08:41,660
So a max will be one and two negative.

88
00:08:41,690 --> 00:08:45,380
A max is negative one and then a max is one.

89
00:08:45,470 --> 00:08:47,690
So we go from -1 to 1.

90
00:08:47,690 --> 00:08:48,770
And then.

91
00:08:49,870 --> 00:08:59,320
If we want our output to be unsigned ints instead of going from -128 to 127, we shall go from zero

92
00:08:59,320 --> 00:09:00,190
to.

93
00:09:00,820 --> 00:09:04,150
255.

94
00:09:05,570 --> 00:09:10,310
Now notice that a number of values we have between 0 and 255 is the same as number of values.

95
00:09:10,310 --> 00:09:13,970
We have between -148 and 127.

96
00:09:14,510 --> 00:09:18,320
But with the unsigned ints, all our values are positive.

97
00:09:18,440 --> 00:09:23,090
So instead of the int, we have unsigned int eight.

98
00:09:24,360 --> 00:09:31,860
And so at this point, our aim is to take values ranging between -1 and 1 and map them in the range

99
00:09:31,860 --> 00:09:34,200
0 to 255.

100
00:09:34,620 --> 00:09:43,680
And now we will use a simple linear function which has the form y equals to a X plus B.

101
00:09:43,890 --> 00:09:47,460
Now our wire will be the output value.

102
00:09:47,460 --> 00:09:50,730
So we'll have the let's call this X.

103
00:09:50,760 --> 00:09:51,780
We'll call this x.

104
00:09:51,780 --> 00:09:56,430
Quantized is equal to x floating value.

105
00:09:56,430 --> 00:09:58,250
So this is the original value of the weight.

106
00:09:58,260 --> 00:10:00,120
Let's, let's put this in blue.

107
00:10:00,150 --> 00:10:10,950
We have the original value of the weight or the float value divided by a certain scale, plus a zero

108
00:10:10,950 --> 00:10:12,510
point value.

109
00:10:12,540 --> 00:10:14,070
We'll call this Z.

110
00:10:15,000 --> 00:10:22,710
So simply one over S is equal A and B is equal to Z, then Y is x and x is x.

111
00:10:22,710 --> 00:10:23,970
F.

112
00:10:23,970 --> 00:10:31,650
So now our aim is to look for the value of S and Z such that when we have any value in this range,

113
00:10:31,650 --> 00:10:36,360
we get its corresponding value in this other range.

114
00:10:37,350 --> 00:10:42,000
The way we'll get s, let's have this the way we get s.

115
00:10:42,880 --> 00:10:45,520
Is by doing X.

116
00:10:45,520 --> 00:10:47,410
Float Max.

117
00:10:47,650 --> 00:10:48,750
X float max.

118
00:10:48,760 --> 00:10:49,720
You see that x float?

119
00:10:49,720 --> 00:11:01,090
Max is simply a max minus x float min, which is in this case, negative, a max divided by x quantized

120
00:11:01,090 --> 00:11:02,080
max.

121
00:11:03,050 --> 00:11:07,070
Minus x quantized min.

122
00:11:07,940 --> 00:11:13,460
Now, if we replace all this by the corresponding values we have here, we will have one minus negative

123
00:11:13,460 --> 00:11:17,330
one divided by two, 55 -0.

124
00:11:17,840 --> 00:11:20,150
So 255 -0.

125
00:11:20,180 --> 00:11:24,920
This means we have two divided by 255.

126
00:11:24,950 --> 00:11:27,530
That's our s, which is our scale.

127
00:11:28,750 --> 00:11:38,350
And then Z are zero point is x max minus x f max divided by s x we just had here.

128
00:11:38,350 --> 00:11:46,480
So if you replace again, we have x max which is 255 minus x f max, which is one.

129
00:11:46,480 --> 00:11:50,830
So we have 255 minus one divided by 255.

130
00:11:50,860 --> 00:12:00,480
That will give us 255 divided by two is essentially 127 120 7.5.

131
00:12:00,490 --> 00:12:01,810
So that's what you get.

132
00:12:01,840 --> 00:12:11,620
Now, the way you can look at this zero point is it's the quantized value we get when the floating value

133
00:12:11,650 --> 00:12:12,850
is zero.

134
00:12:12,850 --> 00:12:17,470
So when we convert, when we have zero here, we have zero on S, which is zero.

135
00:12:18,100 --> 00:12:22,210
The quantized is the quantized value or the corresponding quantized value is equal to Z.

136
00:12:22,240 --> 00:12:24,400
So that's why we call that the zero point.

137
00:12:25,120 --> 00:12:34,930
And then S, which is the scale simply scales our inputs here as we go from this range of values to

138
00:12:34,930 --> 00:12:36,370
this other range.

139
00:12:37,000 --> 00:12:43,300
Now, you could take a simple example where you could leave from x f to x SR.

140
00:12:43,330 --> 00:12:50,740
If you have negative one here, you see, you have, let's say X is equal negative one, that's x f,

141
00:12:50,740 --> 00:12:57,040
let's suppose we have negative one, then divided by s we said s was 252 on 255.

142
00:12:58,460 --> 00:12:59,850
Um, plus Z.

143
00:12:59,890 --> 00:13:01,850
Z is 255 divided by two.

144
00:13:03,290 --> 00:13:10,550
So this gives us zero, which makes sense as we go from -1 to -1 one to 0 to 55.

145
00:13:10,580 --> 00:13:15,170
This means that this here are these boundary values should be.

146
00:13:16,050 --> 00:13:17,550
Almost the same.

147
00:13:18,420 --> 00:13:23,410
Now, let's take another example in the middle, let's say negative or let's say 0.3.

148
00:13:23,430 --> 00:13:39,210
So we could have X here, which is 0.3 divided by two on to 55 plus 255 divided by two.

149
00:13:40,040 --> 00:13:41,690
So in this case now.

150
00:13:42,760 --> 00:13:45,700
We have a value of 165.

151
00:13:46,640 --> 00:13:53,070
.75, which if we run up, we could have 166.

152
00:13:53,070 --> 00:13:56,610
So essentially we're going from 0.3 to 166.

153
00:13:56,610 --> 00:14:05,460
And apart from rounding the output as we've just done, we would also see that we could clip any outliers.

154
00:14:05,460 --> 00:14:16,500
So in case we've decided to have a max or our x max to be one, and that it happens that we have a weight

155
00:14:16,500 --> 00:14:27,120
or weight value which is more than this one, then the output, um, unsigned int will be 255.

156
00:14:27,120 --> 00:14:33,420
So any value greater than this is going to take this value, any value less than this is going to take

157
00:14:33,420 --> 00:14:35,430
this value of zero.

158
00:14:36,250 --> 00:14:47,080
And so we've seen how this simple technique permits us, reduce our memory used in storing the weights.

159
00:14:48,010 --> 00:14:54,430
Now it's logical that we're going to have a drop in the accuracy, because if you've trained a model,

160
00:14:54,430 --> 00:15:01,030
for example, to have certain floating or certain weights which are actually floats and then you convert,

161
00:15:01,030 --> 00:15:06,790
this floats into integers where you have some extra transformations like the rounded and the clipping.

162
00:15:06,820 --> 00:15:12,940
Then you would expect to have a drop in the performance of the model.

163
00:15:12,970 --> 00:15:24,850
Nonetheless, this huge gains in terms of memory are enough for us to sacrifice a bit of the accuracy

164
00:15:25,510 --> 00:15:28,240
or more generally, the model's performance.

165
00:15:28,240 --> 00:15:29,890
And that's it.

166
00:15:29,920 --> 00:15:37,870
Apart from this model weight size, which is dropped or reduced, it should be noted that arithmetic

167
00:15:37,870 --> 00:15:46,120
operations like multiplication and addition of our quantized integers can be carried out even much faster.

168
00:15:46,120 --> 00:15:54,120
And so we not only have a model which occupies less space, but a model which is even much faster.

169
00:15:54,140 --> 00:15:58,730
We have generally three ways of carrying out quantization.

170
00:15:58,940 --> 00:16:01,520
The dynamic quantization.

171
00:16:01,790 --> 00:16:08,750
Static, quantization and the quantization aware training.

172
00:16:09,870 --> 00:16:17,370
Now, given that during the quantization process, the weights and the activations are stored at lower

173
00:16:17,370 --> 00:16:18,360
bit widths.

174
00:16:18,990 --> 00:16:27,000
In the case of the dynamic quantization, this quantization parameters does a scale on the zero point

175
00:16:27,000 --> 00:16:33,660
which we've seen already for the activations are computed dynamically or on the fly.

176
00:16:33,930 --> 00:16:41,160
And because this year have to be computed dynamically, there is an increase in the cost of inference.

177
00:16:41,160 --> 00:16:48,000
So it will take a little bit more time to produce an output as compared to other methods like static

178
00:16:48,000 --> 00:16:49,080
quantization.

179
00:16:49,110 --> 00:16:56,250
Nonetheless, here we usually achieve higher accuracy compared to the static quantization methods.

180
00:16:56,250 --> 00:16:58,640
For the static quantization method.

181
00:16:58,650 --> 00:17:07,050
We first of all compute the quantization parameters using a much smaller data set, which we'll call

182
00:17:07,050 --> 00:17:08,570
the calibration data.

183
00:17:08,580 --> 00:17:15,550
So essentially we have our model and in here we have our different quantization parameters.

184
00:17:15,550 --> 00:17:23,410
But instead of dynamically computing these different quantization parameters, we are going to pass

185
00:17:23,410 --> 00:17:31,960
in some inputs and outputs, carry out several runs such that we are able to obtain the most appropriate

186
00:17:31,960 --> 00:17:37,390
quantization parameters based on this data we've passed in.

187
00:17:37,390 --> 00:17:47,530
And then now when we want to run or carry out inference, we do not need again to compute these parameters,

188
00:17:47,560 --> 00:17:54,220
unlike with the dynamic quantization, where at inference time we always have to compute this quantization

189
00:17:54,220 --> 00:17:54,910
parameters.

190
00:17:54,910 --> 00:18:01,870
Here, we compute this quantization parameters before via calibration data.

191
00:18:01,870 --> 00:18:08,440
And then now when we run an inference, we just pass in inputs and we already have the quantization

192
00:18:08,440 --> 00:18:09,460
parameters set.

193
00:18:10,110 --> 00:18:18,330
But the problem now with this method is that if this calibration is done poorly, then we would have,

194
00:18:18,630 --> 00:18:23,550
um, low quality values for the scale and the zero point.

195
00:18:23,550 --> 00:18:31,380
And so because of this, we will then have a lower accuracy as compared to the dynamic quantization

196
00:18:31,380 --> 00:18:31,790
method.

197
00:18:31,800 --> 00:18:37,440
That said, these two methods we've just seen are post quantization methods.

198
00:18:37,440 --> 00:18:45,450
So dynamic and static is post quantization, meaning that we train the model of first in floating point

199
00:18:45,480 --> 00:18:53,700
32 and then after training the model, we convert this model to one with weights and activations which

200
00:18:53,700 --> 00:18:55,740
are unsigned ints.

201
00:18:55,770 --> 00:19:04,230
Now sometimes the post-training quantization that is is not able to achieve acceptable task accuracy.

202
00:19:04,230 --> 00:19:09,060
This is when you might consider using the quantization aware training that security.

203
00:19:09,090 --> 00:19:12,280
The idea behind the quantization aware training is simple.

204
00:19:12,280 --> 00:19:19,810
You can improve the accuracy of the quantized models if you include the quantization error in the training

205
00:19:19,810 --> 00:19:20,470
phase.

206
00:19:20,710 --> 00:19:29,260
So unlike Post-training quantization where we train the model first before Quantizing here the network

207
00:19:29,290 --> 00:19:34,330
adapts to the quantized weights and activations during the training.

208
00:19:34,600 --> 00:19:44,740
So as we were saying, we include a quantization error in the training loss by inserting fake quantization

209
00:19:44,770 --> 00:19:50,230
operations into the training graph to simulate the quantization of data and parameters.

210
00:19:50,230 --> 00:19:52,870
These operations are called fake.

211
00:19:52,900 --> 00:20:00,730
That's fake quantization because the quantized data, but they immediately do quantize the data.

212
00:20:00,730 --> 00:20:05,680
So the operations compute remains in float point precision.

213
00:20:05,680 --> 00:20:11,950
That said, the post-training quantization is more popular than the quantization aware training method,

214
00:20:12,280 --> 00:20:16,180
thanks to its simplicity as it doesn't involve the training pipeline.

215
00:20:16,180 --> 00:20:22,450
The quantization aware training almost always produces better accuracy, and sometimes this is the only

216
00:20:22,450 --> 00:20:23,860
acceptable method.

217
00:20:23,860 --> 00:20:31,810
And that's it for this section in which we've looked at quantization of neural network weights and activations

218
00:20:31,810 --> 00:20:37,450
to help reduce model size and also speed up computations.