WEBVTT

1
00:00:00.570 --> 00:00:01.860
<v Maximilian>Now thankfully,</v>

2
00:00:01.860 --> 00:00:05.190
this memory problem can be solved,

3
00:00:05.190 --> 00:00:09.480
because the goal is not to run these raw trained models

4
00:00:09.480 --> 00:00:13.740
with their float32 or float16 values,

5
00:00:13.740 --> 00:00:16.080
because of course, if you were to ask yourself

6
00:00:16.080 --> 00:00:17.940
whether that runs on your machine or not,

7
00:00:17.940 --> 00:00:20.430
the answer would pretty much always be no,

8
00:00:20.430 --> 00:00:23.490
except for super small models.

9
00:00:23.490 --> 00:00:26.910
That's why we have a solution called quantization

10
00:00:26.910 --> 00:00:30.420
that helps us with running bigger models

11
00:00:30.420 --> 00:00:33.240
on your or mine machine.

12
00:00:33.240 --> 00:00:35.850
So it's a solution that helps us reduce

13
00:00:35.850 --> 00:00:37.950
these memory requirements.

14
00:00:37.950 --> 00:00:39.090
How does this work?

15
00:00:39.090 --> 00:00:42.600
Well, in the end, it's a compression mechanism.

16
00:00:42.600 --> 00:00:44.940
Quantized Large Language Models

17
00:00:44.940 --> 00:00:48.540
are compressed Large Language Models,

18
00:00:48.540 --> 00:00:51.630
and that's achieved by taking the original trained model,

19
00:00:51.630 --> 00:00:53.130
which is not quantized,

20
00:00:53.130 --> 00:00:56.250
and which would therefore take up lots of memory.

21
00:00:56.250 --> 00:01:00.030
And then through a mathematical process,

22
00:01:00.030 --> 00:01:03.360
these parameter values are transformed

23
00:01:03.360 --> 00:01:06.120
to less precise numbers.

24
00:01:06.120 --> 00:01:09.720
So instead of float16, float32,

25
00:01:09.720 --> 00:01:14.160
they are converted to int4 int8 numbers typically,

26
00:01:14.160 --> 00:01:16.950
and int numbers are numbers

27
00:01:16.950 --> 00:01:18.330
without a decimal place.

28
00:01:18.330 --> 00:01:22.230
So something like four, five, 10, minus three,

29
00:01:22.230 --> 00:01:24.540
these would be integer numbers.

30
00:01:24.540 --> 00:01:27.480
And these parameter values are transformed

31
00:01:27.480 --> 00:01:29.820
to such integer numbers so that

32
00:01:29.820 --> 00:01:33.840
they therefore take up way less space per parameter.

33
00:01:33.840 --> 00:01:38.520
And the great news is that the quantization algorithms

34
00:01:38.520 --> 00:01:42.540
that are used for that transformation are so good

35
00:01:42.540 --> 00:01:44.280
that the model performance,

36
00:01:44.280 --> 00:01:48.450
the quality of the model is essentially unchanged,

37
00:01:48.450 --> 00:01:50.070
at least for the commonly used

38
00:01:50.070 --> 00:01:53.340
quantization mechanisms and solutions.

39
00:01:53.340 --> 00:01:55.590
So you get the best of both worlds,

40
00:01:55.590 --> 00:01:57.900
a model that does perform great,

41
00:01:57.900 --> 00:02:00.297
but that does not take up as much space

42
00:02:00.297 --> 00:02:03.960
and memory as the originally trained model would have.

43
00:02:03.960 --> 00:02:08.460
In addition, inference speed may also be better.

44
00:02:08.460 --> 00:02:10.410
Now the good news is that you don't need

45
00:02:10.410 --> 00:02:13.260
to perform this quantization yourself.

46
00:02:13.260 --> 00:02:16.560
So in case you were afraid, you don't need to do that,

47
00:02:16.560 --> 00:02:20.100
you don't need to understand the mathematical process

48
00:02:20.100 --> 00:02:23.070
that's being used for transforming these values.

49
00:02:23.070 --> 00:02:24.870
You just should have heard the term,

50
00:02:24.870 --> 00:02:26.040
and you should understand

51
00:02:26.040 --> 00:02:29.430
that it's about compressing these parameter values

52
00:02:29.430 --> 00:02:30.870
to make them way smaller.

53
00:02:30.870 --> 00:02:33.420
Because if we're talking about half a byte

54
00:02:33.420 --> 00:02:36.210
or one byte instead of two

55
00:02:36.210 --> 00:02:39.240
or four bytes per parameter,

56
00:02:39.240 --> 00:02:43.500
we of course essentially only need around one fourth

57
00:02:43.500 --> 00:02:47.250
of the originally predicted space in memory.

58
00:02:47.250 --> 00:02:50.880
So therefore, this 27 billion parameters model

59
00:02:50.880 --> 00:02:51.840
all of a sudden,

60
00:02:51.840 --> 00:02:55.470
instead of requiring 100 gigabytes of RAM

61
00:02:55.470 --> 00:02:59.160
may only require 25 gigabytes of RAM,

62
00:02:59.160 --> 00:03:04.020
or even less than that, maybe only 12 gigabytes of RAM,

63
00:03:04.020 --> 00:03:08.850
roughly, depending on the quantization mechanism used.

64
00:03:08.850 --> 00:03:11.400
I'll also say right away that the amount of memory

65
00:03:11.400 --> 00:03:15.750
that's required is not just dependent on the parameters,

66
00:03:15.750 --> 00:03:19.260
though they are the most important factor,

67
00:03:19.260 --> 00:03:22.140
but it's also some space that's reserved

68
00:03:22.140 --> 00:03:25.470
for the input and output text of that model

69
00:03:25.470 --> 00:03:28.830
because all that text, the so-called context,

70
00:03:28.830 --> 00:03:30.900
also needs to be loaded into memory

71
00:03:30.900 --> 00:03:33.330
in order to be processed by the model.

72
00:03:33.330 --> 00:03:37.200
But unless you're operating with huge amounts of text,

73
00:03:37.200 --> 00:03:39.870
that will not take up as much space

74
00:03:39.870 --> 00:03:43.170
as the model parameters do, for example.

75
00:03:43.170 --> 00:03:47.970
And that's why typically, when we aim to run open models,

76
00:03:47.970 --> 00:03:50.040
no matter if it's locally on your system

77
00:03:50.040 --> 00:03:52.020
or on some rented server,

78
00:03:52.020 --> 00:03:55.740
we don't use the raw models we find on Hugging Face.

79
00:03:55.740 --> 00:03:59.490
That's also why if I were to click Use This Model here,

80
00:03:59.490 --> 00:04:02.010
I don't see any local options

81
00:04:02.010 --> 00:04:06.810
because this is simply not prepared to be executed locally.

82
00:04:06.810 --> 00:04:09.180
Instead on Hugging Face, for all these models,

83
00:04:09.180 --> 00:04:10.590
if you scroll down a bit,

84
00:04:10.590 --> 00:04:14.190
you'll find derivatives that are based on this model.

85
00:04:14.190 --> 00:04:16.020
For example, fine-tuned versions

86
00:04:16.020 --> 00:04:18.120
that may be better at certain tasks,

87
00:04:18.120 --> 00:04:21.720
and very important, quantizations,

88
00:04:21.720 --> 00:04:26.610
and it's these quantizations that you wanna run locally.

89
00:04:26.610 --> 00:04:29.490
So if I click here, I see quantized versions

90
00:04:29.490 --> 00:04:31.200
of this Gemma 3 model,

91
00:04:31.200 --> 00:04:33.450
which I'm using as an example here.

92
00:04:33.450 --> 00:04:34.740
For example, this one,

93
00:04:34.740 --> 00:04:37.410
which actually technically is a special version

94
00:04:37.410 --> 00:04:41.460
because it was trained by Google with quantization in mind,

95
00:04:41.460 --> 00:04:44.580
which is why it performs particularly well

96
00:04:44.580 --> 00:04:46.470
when running it locally.

97
00:04:46.470 --> 00:04:48.000
But even if you have models

98
00:04:48.000 --> 00:04:50.280
that were not trained with that in mind,

99
00:04:50.280 --> 00:04:53.280
you can use these quantized versions basically

100
00:04:53.280 --> 00:04:56.610
without loss in quality to run them locally

101
00:04:56.610 --> 00:05:00.090
with way reduced memory requirements.

102
00:05:00.090 --> 00:05:02.070
So if you click on such a quantized version,

103
00:05:02.070 --> 00:05:05.160
you get a page that looks similar to what we saw before,

104
00:05:05.160 --> 00:05:07.380
but here, we now actually got

105
00:05:07.380 --> 00:05:10.890
that quantized, optimized, compressed version.

106
00:05:10.890 --> 00:05:13.350
It has the same amount of parameters,

107
00:05:13.350 --> 00:05:17.880
but every parameter is way smaller in size.

108
00:05:17.880 --> 00:05:21.150
Originally it was this float16 value type

109
00:05:21.150 --> 00:05:25.170
that was being used for the parameters for this model here.

110
00:05:25.170 --> 00:05:27.720
Now it's in the end a four bit integer

111
00:05:27.720 --> 00:05:32.720
after quantization, so it's way, way smaller in size.

112
00:05:32.730 --> 00:05:37.590
Now the weird name you see here essentially shows you

113
00:05:37.590 --> 00:05:39.180
that this is a quantized model,

114
00:05:39.180 --> 00:05:41.580
especially whenever you see something like this,

115
00:05:41.580 --> 00:05:44.460
Q4 underscore something,

116
00:05:44.460 --> 00:05:47.520
that is clearly a sign for quantization

117
00:05:47.520 --> 00:05:51.090
because this describes the concrete quantization technique

118
00:05:51.090 --> 00:05:53.460
that was being used, for example here,

119
00:05:53.460 --> 00:05:55.353
that we got four bit integers.

120
00:05:56.190 --> 00:05:58.020
It's also worth noting

121
00:05:58.020 --> 00:06:02.730
that this here is a so-called GGUF file,

122
00:06:02.730 --> 00:06:05.580
which is the file type that's typically used

123
00:06:05.580 --> 00:06:07.590
for these quantized models

124
00:06:07.590 --> 00:06:08.730
because it's a file type

125
00:06:08.730 --> 00:06:11.520
that includes all the quantized parameters.

126
00:06:11.520 --> 00:06:15.480
But in addition, it contains a lot of metadata.

127
00:06:15.480 --> 00:06:16.980
That's just a side note.

128
00:06:16.980 --> 00:06:19.590
So GGUF is typically the file extension

129
00:06:19.590 --> 00:06:23.340
you'll work with when working with such quantized models.

130
00:06:23.340 --> 00:06:26.400
And you will typically work with quantized models

131
00:06:26.400 --> 00:06:28.980
because there is really no reason to use

132
00:06:28.980 --> 00:06:30.720
the unquantized ones.

133
00:06:30.720 --> 00:06:33.540
It's harder, it takes up way more memory,

134
00:06:33.540 --> 00:06:36.870
and in general has way higher hardware requirements.

135
00:06:36.870 --> 00:06:38.490
And as explained before,

136
00:06:38.490 --> 00:06:43.230
the quality is, if at all, only marginally better.

137
00:06:43.230 --> 00:06:45.690
So it's the quantized versions of these models

138
00:06:45.690 --> 00:06:48.423
that we run locally or on our servers.

