WEBVTT

1
00:00:00.930 --> 00:00:03.427
<v Maximilian>So now to come back to the question,</v>

2
00:00:03.427 --> 00:00:06.030
"Does it run on my machine?"

3
00:00:06.030 --> 00:00:09.420
There are two important parts to that answer.

4
00:00:09.420 --> 00:00:11.850
For one, as I explained before,

5
00:00:11.850 --> 00:00:16.050
preferably you have a GPU that can run the model,

6
00:00:16.050 --> 00:00:18.210
and VRAM, into which the parameters,

7
00:00:18.210 --> 00:00:20.700
the quantized parameters can be loaded.

8
00:00:20.700 --> 00:00:23.070
But if you don't, that's okay too.

9
00:00:23.070 --> 00:00:25.980
Your CPU and your regular system RAM

10
00:00:25.980 --> 00:00:28.230
can also handle these models.

11
00:00:28.230 --> 00:00:31.650
You will be able to run them locally without a GPU too,

12
00:00:31.650 --> 00:00:33.510
it will just be slower.

13
00:00:33.510 --> 00:00:35.700
Inference will be slower.

14
00:00:35.700 --> 00:00:38.550
But of course, the amount of memory you have,

15
00:00:38.550 --> 00:00:41.070
no matter if it's VRAM, which is preferable,

16
00:00:41.070 --> 00:00:44.940
or regular system memory, also matters,

17
00:00:44.940 --> 00:00:49.830
because not all open models can be ran on every laptop,

18
00:00:49.830 --> 00:00:53.610
on every machine, even with quantization.

19
00:00:53.610 --> 00:00:56.010
And the great thing about Hugging Face,

20
00:00:56.010 --> 00:00:58.260
is that if you have an account,

21
00:00:58.260 --> 00:01:00.450
which you don't need for this course,

22
00:01:00.450 --> 00:01:02.670
but which you can create for free,

23
00:01:02.670 --> 00:01:06.120
if you have one, then under Settings,

24
00:01:06.120 --> 00:01:08.820
and their, Local Apps and Hardware,

25
00:01:08.820 --> 00:01:13.200
you can basically share your hardware profile.

26
00:01:13.200 --> 00:01:17.460
So in my case, I have a M1 Max MacBook Pro,

27
00:01:17.460 --> 00:01:20.313
so I added this here as my hardware.

28
00:01:21.210 --> 00:01:23.880
And it then tells me whether that's good or bad.

29
00:01:23.880 --> 00:01:27.270
And you see I'm on the lower end here.

30
00:01:27.270 --> 00:01:30.690
But the great thing is that I got 64 gigabytes

31
00:01:30.690 --> 00:01:32.730
of so-called unified memory,

32
00:01:32.730 --> 00:01:35.940
which is a combination of system memory and VRAM.

33
00:01:35.940 --> 00:01:37.110
So that's pretty good,

34
00:01:37.110 --> 00:01:40.920
and allows me to load quite big models into memory.

35
00:01:40.920 --> 00:01:43.440
But the memory is definitely not as fast

36
00:01:43.440 --> 00:01:46.770
as on high-end Nvidia GPUs, for example.

37
00:01:46.770 --> 00:01:49.230
It's always a trade-off in the end.

38
00:01:49.230 --> 00:01:52.320
But since I added this hardware profile here

39
00:01:52.320 --> 00:01:56.520
on the Model pages of these quantized models,

40
00:01:56.520 --> 00:02:01.520
I got this area here where it tells me whether I can run

41
00:02:01.650 --> 00:02:05.310
this specific model version, this quantized version,

42
00:02:05.310 --> 00:02:07.440
on my system.

43
00:02:07.440 --> 00:02:10.200
So here I get a green check mark, which means yes,

44
00:02:10.200 --> 00:02:13.500
this will fit into memory.

45
00:02:13.500 --> 00:02:16.560
But besides this Hugging Face feature,

46
00:02:16.560 --> 00:02:19.140
LM Studio, which is one of the main tools

47
00:02:19.140 --> 00:02:20.640
we'll use throughout this course,

48
00:02:20.640 --> 00:02:22.500
and which we'll also use to download

49
00:02:22.500 --> 00:02:25.590
and run these open models on our machine,

50
00:02:25.590 --> 00:02:28.110
that tool also has a Model Search,

51
00:02:28.110 --> 00:02:30.120
which we will use in the next section,

52
00:02:30.120 --> 00:02:33.990
and there if you browse models, you'll get a warning

53
00:02:33.990 --> 00:02:38.370
if it's likely too large for your system.

54
00:02:38.370 --> 00:02:40.750
So for example, here, this Qwen3,

55
00:02:40.750 --> 00:02:43.590
235 billion-parameter model

56
00:02:43.590 --> 00:02:47.283
is likely too large for my system here.

57
00:02:48.570 --> 00:02:52.260
This one here, on the other hand, would work.

58
00:02:52.260 --> 00:02:55.920
So, since we'll use that tool anyways in the next section,

59
00:02:55.920 --> 00:02:57.780
you can of course use that indication

60
00:02:57.780 --> 00:03:00.465
to find out which model you can use on your system.

61
00:03:00.465 --> 00:03:02.160
And if you don't have a lot of memory,

62
00:03:02.160 --> 00:03:04.230
you simply have to go for diversions

63
00:03:04.230 --> 00:03:06.660
that have less parameters.

64
00:03:06.660 --> 00:03:08.490
Of course, the model quality,

65
00:03:08.490 --> 00:03:11.820
the quality of the results produced by the model,

66
00:03:11.820 --> 00:03:16.410
in general goes up as the number of parameters increases,

67
00:03:16.410 --> 00:03:18.360
but it's not a linear relationship.

68
00:03:18.360 --> 00:03:21.210
And even these super-small models can be great

69
00:03:21.210 --> 00:03:22.710
for certain use cases,

70
00:03:22.710 --> 00:03:26.280
like, for example, text summarization.

71
00:03:26.280 --> 00:03:30.120
You can also do the math manually, at least roughly.

72
00:03:30.120 --> 00:03:33.660
You can take the amount of parameters you have,

73
00:03:33.660 --> 00:03:37.830
27 billion for example, and you can forget the billion part,

74
00:03:37.830 --> 00:03:40.413
It's the 27 part that's important,

75
00:03:41.460 --> 00:03:44.280
and then if it's a four-bit model,

76
00:03:44.280 --> 00:03:46.170
you can essentially take half

77
00:03:46.170 --> 00:03:48.180
of that parameters number here.

78
00:03:48.180 --> 00:03:52.950
So, 13.5 in case of the 27 billion-parameters model.

79
00:03:52.950 --> 00:03:56.940
Why half? Because a four-bit integer takes up half a byte.

80
00:03:56.940 --> 00:04:00.000
So if you were to multiply 27 billion

81
00:04:00.000 --> 00:04:02.250
times half a byte,

82
00:04:02.250 --> 00:04:07.250
you would end up with 13.5 gigabytes of memory required.

83
00:04:09.090 --> 00:04:10.950
Now, you need to add something on top of that,

84
00:04:10.950 --> 00:04:14.070
because it's not just the parameters that must be loaded,

85
00:04:14.070 --> 00:04:16.710
it's also, for example, the context window.

86
00:04:16.710 --> 00:04:19.860
So the input and output tokens, the chat history, and so on.

87
00:04:19.860 --> 00:04:22.800
And that's why you should add something on top of that.

88
00:04:22.800 --> 00:04:26.250
That's why here we see 17 gigabytes, for example,

89
00:04:26.250 --> 00:04:28.440
as a rough estimate.

90
00:04:28.440 --> 00:04:32.400
And if you then have more system and or VRAM

91
00:04:32.400 --> 00:04:34.560
then those 17 gigabytes,

92
00:04:34.560 --> 00:04:38.190
you can run that model locally on your system.

93
00:04:38.190 --> 00:04:41.130
Of course, assuming there's nothing else on your system

94
00:04:41.130 --> 00:04:44.340
that eats up significant amounts of memory.

95
00:04:44.340 --> 00:04:47.040
It's also worth noting that the model can also be split

96
00:04:47.040 --> 00:04:51.060
between GPU and CPU and VRAM and regular memory.

97
00:04:51.060 --> 00:04:54.540
That's of course not as good as doing it on the GPU only,

98
00:04:54.540 --> 00:04:59.070
but it's better than just having a CPU and no VRAM.

99
00:04:59.070 --> 00:05:01.710
So if you have, like, eight gigabytes of VRAM,

100
00:05:01.710 --> 00:05:04.620
but 32 gigabytes of system memory,

101
00:05:04.620 --> 00:05:06.750
you could still run this model here

102
00:05:06.750 --> 00:05:09.180
because eight gigabytes of the 17 gigabytes

103
00:05:09.180 --> 00:05:10.740
would be loaded into VRAM,

104
00:05:10.740 --> 00:05:14.040
and the rest would be loaded into regular system memory.

105
00:05:14.040 --> 00:05:15.843
That's always important to note.

106
00:05:16.710 --> 00:05:18.600
And of course, if you don't have enough memory,

107
00:05:18.600 --> 00:05:21.540
you'll have to look for some other model.

108
00:05:21.540 --> 00:05:23.400
For example, for Google's gemma models,

109
00:05:23.400 --> 00:05:26.460
which are really good, there also are smaller versions

110
00:05:26.460 --> 00:05:30.390
with four billion parameters instead of 27 billions,

111
00:05:30.390 --> 00:05:32.430
and 1 billion and 12 billion.

112
00:05:32.430 --> 00:05:35.670
And of course, these models require way less memory

113
00:05:35.670 --> 00:05:38.820
in order to be loaded and executed.

114
00:05:38.820 --> 00:05:41.850
And that's the offer, how you can answer the question

115
00:05:41.850 --> 00:05:44.250
whether it will run on your machine or not.

116
00:05:44.250 --> 00:05:47.280
And that's what quantization is all about,

117
00:05:47.280 --> 00:05:48.723
and why it's important.

