1
00:00:00,330 --> 00:00:07,320
Hello, everyone, and welcome to this new and exciting session in which we are going to see how we'll

2
00:00:07,320 --> 00:00:16,710
move from a TensorFlow model which occupies one gigabyte of space to an onyx quantized model occupying

3
00:00:16,710 --> 00:00:19,560
just 83 megabytes.

4
00:00:19,680 --> 00:00:26,790
At this point, we now understand the concept of quantization and we're going to see how to apply or

5
00:00:26,790 --> 00:00:35,280
implement quantization, specifically dynamic quantization to make use of our model even more efficiently.

6
00:00:36,000 --> 00:00:46,980
Before we move on, we should also note here that the TF size TF size is one gigabyte is about approximately

7
00:00:46,980 --> 00:00:48,150
one gigabyte.

8
00:00:48,450 --> 00:00:56,970
That's 1000 megabytes, while the onyx size is 328 megabytes.

9
00:00:56,970 --> 00:01:02,250
So onyx size 328 megabytes.

10
00:01:02,280 --> 00:01:07,920
Now we're going to look at the Onyx Onyx quantized.

11
00:01:08,190 --> 00:01:09,680
Let's just copy this.

12
00:01:09,690 --> 00:01:21,330
So we have the Onyx quantized CPU, onyx quantized CPU, which we shall get shortly, the Onyx quantized

13
00:01:21,330 --> 00:01:22,300
GPU.

14
00:01:23,910 --> 00:01:29,100
And then we'll also get the onyx quantized size.

15
00:01:29,340 --> 00:01:33,690
So let's take this off for now and then get back here.

16
00:01:33,720 --> 00:01:38,250
Now, as you could see, we've imported the Onyx Onyx runtime quantization.

17
00:01:38,250 --> 00:01:40,200
We've already imported the Onyx runtime.

18
00:01:40,200 --> 00:01:45,270
So here we just imported this quantized dynamic and quant type.

19
00:01:46,230 --> 00:01:53,690
Then from here you see we have the two models, thus the floating point and the quantized model.

20
00:01:53,700 --> 00:01:57,750
Now, this year this model is what we had already.

21
00:01:57,750 --> 00:02:05,490
So we'll get back and then copy that and then we'll call this one the white quantized.

22
00:02:06,000 --> 00:02:06,860
So that's it.

23
00:02:06,870 --> 00:02:13,920
Now all we need to do is we have this quantized dynamic which takes in this model, takes in the quantized.

24
00:02:13,920 --> 00:02:19,380
The path to the quantized model is still an Onyx model and then the weight type.

25
00:02:19,380 --> 00:02:24,030
Now here the suite type is an unsigned int.

26
00:02:24,150 --> 00:02:27,970
So let's run this and see what we get.

27
00:02:27,990 --> 00:02:29,430
We get an error.

28
00:02:29,610 --> 00:02:31,100
This name is undefined.

29
00:02:31,110 --> 00:02:31,400
Okay.

30
00:02:31,410 --> 00:02:34,800
We should run this before running this one.

31
00:02:35,460 --> 00:02:36,480
So that's it.

32
00:02:36,960 --> 00:02:38,700
This should be fine this time around.

33
00:02:38,700 --> 00:02:46,490
And we should be able to get this file right here, this quantized Onix file again.

34
00:02:46,500 --> 00:02:46,720
Yeah.

35
00:02:46,740 --> 00:02:50,340
You could check out the documentation on quantization on x models.

36
00:02:50,340 --> 00:02:52,290
You see, we have the quantized Onix model.

37
00:02:52,290 --> 00:02:55,770
We have an overview and all of that.

38
00:02:55,770 --> 00:02:58,980
So let's get back here and check out our model.

39
00:02:59,550 --> 00:03:00,660
Oh, there we go.

40
00:03:00,660 --> 00:03:06,000
We have our bit quantized and what we notice is we have 83 megabytes.

41
00:03:06,000 --> 00:03:17,160
So this means that we've gone from one gigabyte or 1000 megabyte to just 83 megabytes.

42
00:03:18,000 --> 00:03:22,670
And now we could go ahead and check out the speed or the CPU speed.

43
00:03:22,680 --> 00:03:25,200
So let's get back here.

44
00:03:25,200 --> 00:03:27,600
Or rather, let's let's get back here.

45
00:03:29,490 --> 00:03:30,570
There we go.

46
00:03:30,960 --> 00:03:32,250
Copy this path.

47
00:03:32,280 --> 00:03:33,750
Let's copy this path.

48
00:03:33,750 --> 00:03:41,420
And then we have this year we paste that out here and then this this one year.

49
00:03:41,430 --> 00:03:43,130
This isn't cuda this.

50
00:03:43,140 --> 00:03:47,700
This is CPU CPU execution provider.

51
00:03:47,700 --> 00:03:48,930
That's fine.

52
00:03:48,930 --> 00:03:50,010
Everything looks fine.

53
00:03:50,010 --> 00:03:51,510
And let's run this.

54
00:03:52,650 --> 00:03:55,230
You see, we get 0.39.

55
00:03:55,230 --> 00:03:59,790
That's practically 0.4 seconds per prediction.

56
00:04:00,690 --> 00:04:04,570
And so here we have 0.4 seconds.

57
00:04:04,590 --> 00:04:07,170
Now let's let's check this out again.

58
00:04:08,550 --> 00:04:13,260
Let's check this out again for the original Onix model.

59
00:04:13,260 --> 00:04:17,250
So let's run this and check this out.

60
00:04:18,380 --> 00:04:22,190
You see here we have 0.49.

61
00:04:22,190 --> 00:04:23,570
That's practically 0.5.

62
00:04:23,570 --> 00:04:26,260
So it means that this isn't exactly true.

63
00:04:26,270 --> 00:04:28,240
There should be 0.5.

64
00:04:28,250 --> 00:04:40,070
So we see that the quantized model is faster and we much lighter than the Onyx, the original Onyx and

65
00:04:40,070 --> 00:04:42,110
the TensorFlow models.

66
00:04:43,100 --> 00:04:51,750
But we have to be careful as quantization generally comes with a drop in the accuracy.

67
00:04:51,770 --> 00:04:56,660
Now we switch to a GPU and then test out our quantized model.

68
00:04:56,660 --> 00:05:05,030
So right here we have this quantized model which we're going to run and then check out its latency.

69
00:05:05,810 --> 00:05:09,350
Here is where we get 0.27, let's see, 0.3.

70
00:05:10,130 --> 00:05:19,820
And this shows that the quantized model doesn't benefit as much as the Onyx model from the usage of

71
00:05:19,820 --> 00:05:20,960
the GPUs.

72
00:05:21,680 --> 00:05:26,720
Now, if you check out in the documentation, we'll see that there is quantization, an Onyx model,

73
00:05:26,720 --> 00:05:28,780
and this is a quantization on a GPU.

74
00:05:28,790 --> 00:05:37,850
So the quantization and the GPU isn't as straightforward as that with the CPU as here, we know that

75
00:05:37,850 --> 00:05:45,230
we'll need a device that supports tensor core in a computation like the T four or the E 100.

76
00:05:45,680 --> 00:05:50,030
And this here, that older hardware would not benefit from quantization.

77
00:05:50,720 --> 00:05:57,080
So if you want to proceed with quantization or quantization with a GPU, you can make use of this tensor

78
00:05:57,170 --> 00:05:58,820
R2 execution provider.

79
00:05:58,820 --> 00:06:04,970
And here they give the overall procedure to leverage this tensor t execution provider.

80
00:06:04,970 --> 00:06:11,960
So with that, we're going to get back here and the next thing we shall do is ensure that the QUANTIZATION

81
00:06:11,960 --> 00:06:19,100
process hasn't led to too much drop in accuracy as when we quantized the model generally, we may have

82
00:06:19,100 --> 00:06:24,590
drop in accuracy, but our Mir is to be sure that this drop is minimal.

83
00:06:24,590 --> 00:06:27,580
And so to do this, we're going to evaluate our model.

84
00:06:27,590 --> 00:06:33,650
So here basically we've defined this accuracy, which takes the model.

85
00:06:33,650 --> 00:06:41,720
And then for 100 we'll take 100 elements, 100 elements in a validation data set where the validation

86
00:06:41,720 --> 00:06:43,550
dataset has a batch size of one.

87
00:06:43,550 --> 00:06:50,090
We are going to compare each time the output or the next prediction with the label.

88
00:06:50,090 --> 00:06:54,590
So here we compare the level with the earnings prediction and if they are the same, we increase the

89
00:06:54,590 --> 00:06:57,380
accuracy variable value by one.

90
00:06:57,380 --> 00:07:03,560
Initially, they are zero total accuracy zero, but the total is always increased and then the accuracy

91
00:07:03,560 --> 00:07:07,250
is increased only when we have this to the same.

92
00:07:07,430 --> 00:07:14,330
So with this basically we implement this accuracy method, which now we take the two models, that is

93
00:07:14,330 --> 00:07:17,600
the Onyx, the original Onyx and the quantized Onyx.

94
00:07:17,600 --> 00:07:21,530
So here we have this providers to, to, to make this run faster.

95
00:07:21,530 --> 00:07:22,490
So we run this.

96
00:07:22,490 --> 00:07:29,540
Now you see here we have 90% for the original and then 89% for the quantized model.

97
00:07:30,620 --> 00:07:39,560
The next thing we'll look at will be how to visualize Onyx models using Lute's rotors neutron app so

98
00:07:39,560 --> 00:07:43,800
you could get your natural or app and you you have this interface right here.

99
00:07:43,820 --> 00:07:45,590
Now we're going to open the model.

100
00:07:45,980 --> 00:07:46,820
There we go.

101
00:07:46,820 --> 00:07:49,790
It's loading and is what we get.

102
00:07:49,790 --> 00:07:51,710
You see we start with this transpose.

103
00:07:51,710 --> 00:08:00,110
You could recall we had this transpose and then we moved on to this resizing, then matrix multiplication

104
00:08:00,110 --> 00:08:04,610
and then the rest of the model right here.

105
00:08:04,730 --> 00:08:14,240
So here is the white model in this Onyx format, Onyx quantized format, screwed right to the end right

106
00:08:14,240 --> 00:08:15,080
here.

107
00:08:15,290 --> 00:08:22,880
And then towards this end, you see we have this matrix multiplications for a linear layer and then

108
00:08:22,880 --> 00:08:25,010
we have this soft max.

109
00:08:25,010 --> 00:08:28,900
You could also export as PNG.

110
00:08:29,330 --> 00:08:33,320
So you could open this up in this PNG format right here.

111
00:08:33,650 --> 00:08:35,420
So that's our model.

112
00:08:38,050 --> 00:08:46,300
And that's it for the section where we've left from a one gigabyte model to an 83 megabyte model with

113
00:08:46,300 --> 00:08:49,990
just 0.01 drop in accuracy.
