WEBVTT

1
00:00:00.360 --> 00:00:02.730
<v Maximillian>Now, that's still not all we can tweak</v>

2
00:00:02.730 --> 00:00:05.910
about our models and settings here,

3
00:00:05.910 --> 00:00:08.880
because when you load a model, let's say here,

4
00:00:08.880 --> 00:00:13.260
I wanna switch to that 12 billion model again actually,

5
00:00:13.260 --> 00:00:14.550
which I already have loaded,

6
00:00:14.550 --> 00:00:16.620
but I want to configure it differently.

7
00:00:16.620 --> 00:00:19.170
So, if I load a model here, by the way,

8
00:00:19.170 --> 00:00:22.200
you can always unload it by clicking eject, of course.

9
00:00:22.200 --> 00:00:23.610
But if I load a model,

10
00:00:23.610 --> 00:00:27.720
no matter if I had one or not, then if I'm a power user,

11
00:00:27.720 --> 00:00:30.030
I can tweak some settings.

12
00:00:30.030 --> 00:00:32.100
As a regular user, this is not possible.

13
00:00:32.100 --> 00:00:33.930
As a regular user,

14
00:00:33.930 --> 00:00:38.040
it just gets loaded like this, but as a power user,

15
00:00:38.040 --> 00:00:41.400
there are some options here, as you can see.

16
00:00:41.400 --> 00:00:44.670
Most importantly, you can control the context length,

17
00:00:44.670 --> 00:00:47.550
and that will depend on the underlying model you're using.

18
00:00:47.550 --> 00:00:49.950
For example, these Gemma 3 models

19
00:00:49.950 --> 00:00:52.830
have a quite big context window.

20
00:00:52.830 --> 00:00:55.020
Now, in case you're not sure what context length

21
00:00:55.020 --> 00:00:58.200
or context window means, as mentioned,

22
00:00:58.200 --> 00:01:00.780
all these models operate on tokens,

23
00:01:00.780 --> 00:01:04.650
but they have a maximum amount of tokens they can process.

24
00:01:04.650 --> 00:01:06.930
And that is exactly that context length,

25
00:01:06.930 --> 00:01:10.050
or context window as it's also often called.

26
00:01:10.050 --> 00:01:11.040
You see that here,

27
00:01:11.040 --> 00:01:16.040
this model supports up to 130,000-ish tokens.

28
00:01:16.980 --> 00:01:19.500
And you also see the tool tip here,

29
00:01:19.500 --> 00:01:22.110
the maximum number of tokens the model can attend to

30
00:01:22.110 --> 00:01:23.760
in one prompt.

31
00:01:23.760 --> 00:01:24.810
Now, theoretically,

32
00:01:24.810 --> 00:01:27.660
your chat history could go beyond that limit,

33
00:01:27.660 --> 00:01:30.960
but then not the entire chat history would be considered,

34
00:01:30.960 --> 00:01:33.930
because normally the entire chat history

35
00:01:33.930 --> 00:01:36.960
is sent to the model with every prompt

36
00:01:36.960 --> 00:01:40.500
so that you can also refer back to older messages.

37
00:01:40.500 --> 00:01:43.410
But of course, if you go beyond the context window

38
00:01:43.410 --> 00:01:45.270
with your entire chat history,

39
00:01:45.270 --> 00:01:48.360
then not the entire chat history can be considered.

40
00:01:48.360 --> 00:01:50.610
And if a single prompt goes beyond

41
00:01:50.610 --> 00:01:52.200
the available context length,

42
00:01:52.200 --> 00:01:55.413
then not even that entire prompt can be considered.

43
00:01:56.550 --> 00:01:58.950
Now, what's worth noting about LM Studio,

44
00:01:58.950 --> 00:02:02.070
and later also Ollama, is that by default,

45
00:02:02.070 --> 00:02:05.070
they don't make that entire maximum

46
00:02:05.070 --> 00:02:07.440
context length available to you.

47
00:02:07.440 --> 00:02:10.440
Instead, here, in my case for this model,

48
00:02:10.440 --> 00:02:13.500
the default is around 4,000 tokens,

49
00:02:13.500 --> 00:02:16.200
so way less than the maximum.

50
00:02:16.200 --> 00:02:18.660
By the way, that's around 3,000 words,

51
00:02:18.660 --> 00:02:20.130
because you can roughly say

52
00:02:20.130 --> 00:02:23.163
that one token is around .75 words.

53
00:02:24.360 --> 00:02:26.220
Now, we can ramp this up,

54
00:02:26.220 --> 00:02:29.070
but why don't we always use the maximum?

55
00:02:29.070 --> 00:02:32.550
Well, because the context window also takes up space

56
00:02:32.550 --> 00:02:33.990
in your memory.

57
00:02:33.990 --> 00:02:37.290
Let's say I changed this to 20,000 here

58
00:02:37.290 --> 00:02:39.630
and I leave everything the way it is.

59
00:02:39.630 --> 00:02:41.613
Now, if I load this model,

60
00:02:43.380 --> 00:02:45.120
you will see that for me here,

61
00:02:45.120 --> 00:02:48.000
it takes up around 15 gigabytes of ram.

62
00:02:48.000 --> 00:02:49.110
So, no problem.

63
00:02:49.110 --> 00:02:53.343
I got 48 gigabytes of VRAM, but it takes up around 15.

64
00:02:54.780 --> 00:02:58.440
Now, let me eject it and load it again,

65
00:02:58.440 --> 00:03:01.983
but now with the default setting of 4,000 tokens.

66
00:03:03.240 --> 00:03:05.610
With that, you'll see that once it's done loading,

67
00:03:05.610 --> 00:03:07.803
it's only around nine gigabytes.

68
00:03:08.640 --> 00:03:11.640
So it has to reserve way more space

69
00:03:11.640 --> 00:03:15.510
if I set up a bigger context window,

70
00:03:15.510 --> 00:03:18.360
because it essentially has to reserve all that space

71
00:03:18.360 --> 00:03:20.970
and memory that could potentially be occupied

72
00:03:20.970 --> 00:03:23.070
by your tokens.

73
00:03:23.070 --> 00:03:23.903
So, therefore, of course,

74
00:03:23.903 --> 00:03:27.120
if I ramp this up all the way to the limit to take advantage

75
00:03:27.120 --> 00:03:30.000
of the full available context window size,

76
00:03:30.000 --> 00:03:32.280
you see here for me, for this model,

77
00:03:32.280 --> 00:03:35.310
this goes up to essentially the maximum amount

78
00:03:35.310 --> 00:03:37.620
of RAM I have on my system,

79
00:03:37.620 --> 00:03:39.210
which is typically not what I want,

80
00:03:39.210 --> 00:03:43.113
because now it's really struggling to do anything.

81
00:03:43.950 --> 00:03:46.200
So, you should choose a context length

82
00:03:46.200 --> 00:03:48.150
that makes sense for your use case.

83
00:03:48.150 --> 00:03:50.880
If you know that you are about to process

84
00:03:50.880 --> 00:03:53.820
some big PDF documents,

85
00:03:53.820 --> 00:03:58.260
maybe you need 20,000 tokens, maybe you need more,

86
00:03:58.260 --> 00:04:02.253
but you should not reserve as much as possible all the time.

87
00:04:03.900 --> 00:04:07.050
And if you know that you need to process more tokens

88
00:04:07.050 --> 00:04:09.780
in general than your system can handle,

89
00:04:09.780 --> 00:04:12.930
then of course you might wanna consider splitting your task

90
00:04:12.930 --> 00:04:15.480
into multiple smaller subtasks

91
00:04:15.480 --> 00:04:18.150
so that every subtask can be completed

92
00:04:18.150 --> 00:04:21.840
and you then merge the results together thereafter.

93
00:04:21.840 --> 00:04:23.550
That's what's important to understand

94
00:04:23.550 --> 00:04:25.083
about context length here.

95
00:04:26.040 --> 00:04:27.780
Now, besides the context length,

96
00:04:27.780 --> 00:04:30.210
we also got the GPU offload.

97
00:04:30.210 --> 00:04:32.100
And as you see, if you hover over that,

98
00:04:32.100 --> 00:04:35.310
that's the number of discrete model layers to compute

99
00:04:35.310 --> 00:04:38.880
on the GPU for GPU acceleration.

100
00:04:38.880 --> 00:04:42.330
What this means in the end is how much of the model

101
00:04:42.330 --> 00:04:44.190
should be loaded onto the GPU,

102
00:04:44.190 --> 00:04:47.460
or how much work that needs to be done by the model

103
00:04:47.460 --> 00:04:48.960
should be done by the GPU.

104
00:04:48.960 --> 00:04:52.200
And unless you have a strong reason to change that,

105
00:04:52.200 --> 00:04:54.750
you typically want to keep that at a maximum.

106
00:04:54.750 --> 00:04:57.150
You wanna let your GPU do all the work

107
00:04:57.150 --> 00:05:00.150
because it's much better than that than your CPU,

108
00:05:00.150 --> 00:05:02.310
because the GPU, unlike the CPU,

109
00:05:02.310 --> 00:05:05.910
is really able to handle multiple things in parallel

110
00:05:05.910 --> 00:05:09.330
without constantly switching back and forth between them.

111
00:05:09.330 --> 00:05:14.160
So, typically, GPU offload should be at 100% here.

112
00:05:14.160 --> 00:05:16.800
The context length should only be set to a value

113
00:05:16.800 --> 00:05:19.923
that makes sense for your specific task.