WEBVTT

1
00:00:00.660 --> 00:00:02.880
<v Maximilian>Now when it comes to tweaking those settings,</v>

2
00:00:02.880 --> 00:00:04.860
you can of course also remember them

3
00:00:04.860 --> 00:00:07.260
so that they will be used in the future again.

4
00:00:07.260 --> 00:00:10.560
So if you by default want let's say 10,000 tokens,

5
00:00:10.560 --> 00:00:12.870
you could save these settings as a default

6
00:00:12.870 --> 00:00:15.000
so that you don't always have to change it again

7
00:00:15.000 --> 00:00:16.743
when you reload the model.

8
00:00:17.850 --> 00:00:20.580
But you also have some advanced settings here.

9
00:00:20.580 --> 00:00:22.590
Now, to be very honest, most of these settings

10
00:00:22.590 --> 00:00:24.090
can stay the way they are.

11
00:00:24.090 --> 00:00:26.640
The evaluation batch size is an interesting one

12
00:00:26.640 --> 00:00:29.460
because here you can control the number of input tokens

13
00:00:29.460 --> 00:00:31.500
that are processed at a time,

14
00:00:31.500 --> 00:00:34.440
so simultaneously by the model,

15
00:00:34.440 --> 00:00:35.640
and you could increase this

16
00:00:35.640 --> 00:00:37.710
to improve the performance of the model

17
00:00:37.710 --> 00:00:40.680
at the cost of using more memory.

18
00:00:40.680 --> 00:00:42.870
But I typically keep the default here as well.

19
00:00:42.870 --> 00:00:45.900
But you could experiment with different values here.

20
00:00:45.900 --> 00:00:47.940
The average settings, to be very honest,

21
00:00:47.940 --> 00:00:50.040
are settings I never touch

22
00:00:50.040 --> 00:00:52.260
because the defaults work quite well.

23
00:00:52.260 --> 00:00:54.660
These RoPE settings are about

24
00:00:54.660 --> 00:00:57.240
pushing the boundaries of your model

25
00:00:57.240 --> 00:01:00.570
when it comes to operating with lots of context,

26
00:01:00.570 --> 00:01:04.380
so long context documents or chat histories.

27
00:01:04.380 --> 00:01:06.000
And I typically prefer

28
00:01:06.000 --> 00:01:09.210
to split complex long context tasks

29
00:01:09.210 --> 00:01:11.850
into simpler, shorter subtasks,

30
00:01:11.850 --> 00:01:14.670
because in my experience, that yields better results

31
00:01:14.670 --> 00:01:18.450
than operating on super long context anyways.

32
00:01:18.450 --> 00:01:21.360
Now what is interesting are the flash attention

33
00:01:21.360 --> 00:01:25.440
and the key and V cache settings down here though.

34
00:01:25.440 --> 00:01:28.410
And the idea behind all these settings,

35
00:01:28.410 --> 00:01:32.610
especially when used together, is that they can,

36
00:01:32.610 --> 00:01:36.900
as the tool tip tells us, decrease the memory usage

37
00:01:36.900 --> 00:01:40.140
and also the generation time of some models

38
00:01:40.140 --> 00:01:42.810
depending on how well they support it.

39
00:01:42.810 --> 00:01:45.763
Now, this quantization, this KV cache quantization

40
00:01:45.763 --> 00:01:48.270
is not the quantization

41
00:01:48.270 --> 00:01:50.580
we were referring to earlier in the course

42
00:01:50.580 --> 00:01:53.310
when we explored hardware requirements.

43
00:01:53.310 --> 00:01:57.360
Instead, flash attention and K V cache quantization,

44
00:01:57.360 --> 00:02:00.360
these three settings in the end work together.

45
00:02:00.360 --> 00:02:03.510
As you also see here, flash attention must be enabled

46
00:02:03.510 --> 00:02:06.267
for V cache quantization at least.

47
00:02:06.267 --> 00:02:08.130
And the idea behind these settings

48
00:02:08.130 --> 00:02:13.130
is that the model has to process less tokens

49
00:02:13.380 --> 00:02:15.960
to generate a new output token,

50
00:02:15.960 --> 00:02:18.690
because normally it always has to process

51
00:02:18.690 --> 00:02:20.250
all existing tokens,

52
00:02:20.250 --> 00:02:23.640
including existing output tokens it's generated

53
00:02:23.640 --> 00:02:26.130
to produce the next token in line,

54
00:02:26.130 --> 00:02:28.470
which makes sense because that next token

55
00:02:28.470 --> 00:02:30.960
depends on all these other tokens.

56
00:02:30.960 --> 00:02:32.790
Now, with flash attention

57
00:02:32.790 --> 00:02:36.780
and K V cache quantization enabled,

58
00:02:36.780 --> 00:02:41.220
roughly speaking, tokens generated in earlier steps

59
00:02:41.220 --> 00:02:44.850
can be cached and therefore be remembered,

60
00:02:44.850 --> 00:02:48.420
and can therefore be reused in subsequent steps

61
00:02:48.420 --> 00:02:50.910
to generate future tokens

62
00:02:50.910 --> 00:02:54.240
without having to recalculate them.

63
00:02:54.240 --> 00:02:57.780
Because without this turned on,

64
00:02:57.780 --> 00:03:01.830
the model essentially recalculates all previous tokens

65
00:03:01.830 --> 00:03:04.290
just to derive the next token.

66
00:03:04.290 --> 00:03:07.860
With flash attention and this caching enabled,

67
00:03:07.860 --> 00:03:12.860
it can simply load those old calculations from memory

68
00:03:12.930 --> 00:03:16.380
and therefore derive the next token quicker.

69
00:03:16.380 --> 00:03:18.933
Now, at the point of time where I'm recording this,

70
00:03:18.933 --> 00:03:21.030
this is an experimental feature,

71
00:03:21.030 --> 00:03:23.160
and as this message here tells us,

72
00:03:23.160 --> 00:03:25.590
it may cause issues with some models,

73
00:03:25.590 --> 00:03:29.700
because not all models are great at using this mechanism

74
00:03:29.700 --> 00:03:32.700
or are aware of this technique.

75
00:03:32.700 --> 00:03:36.450
You can nonetheless consider or try turning this on

76
00:03:36.450 --> 00:03:38.520
and simply experiment with the model

77
00:03:38.520 --> 00:03:40.230
with that being turned on

78
00:03:40.230 --> 00:03:42.180
to see whether it makes a difference

79
00:03:42.180 --> 00:03:44.133
and whether it works well for you.

80
00:03:45.180 --> 00:03:47.640
For example, here with that turned on,

81
00:03:47.640 --> 00:03:52.640
if I now ask it to write an essay about a rabbit,

82
00:03:54.779 --> 00:03:57.570
it's quite quick at doing so,

83
00:03:57.570 --> 00:04:00.873
and the text also doesn't look too bad to me.

84
00:04:03.090 --> 00:04:06.390
And it generated those 1000 tokens

85
00:04:06.390 --> 00:04:10.773
at a rate of around 31.3 tokens per second.

86
00:04:12.060 --> 00:04:13.650
Now, if I reload this model,

87
00:04:13.650 --> 00:04:16.923
but I turn off flash attention and K V cache,

88
00:04:18.210 --> 00:04:22.443
and I use that same prompt in a new chat thereafter,

89
00:04:24.540 --> 00:04:27.333
it of course, again goes ahead and generates a text.

90
00:04:29.970 --> 00:04:33.270
And for me here, it generated 1000 tokens

91
00:04:33.270 --> 00:04:35.970
at a rate of only 23 tokens per second.

92
00:04:35.970 --> 00:04:39.903
So quite a bit slower than with flash attention turned on.

93
00:04:40.770 --> 00:04:43.680
And that's precisely the idea behind flash attention

94
00:04:43.680 --> 00:04:45.870
and K V caching.

95
00:04:45.870 --> 00:04:48.930
The idea is to give you faster generations,

96
00:04:48.930 --> 00:04:52.830
but you should carefully check whether the results are good

97
00:04:52.830 --> 00:04:56.310
or if your model is having any problems with these settings.

98
00:04:56.310 --> 00:04:59.970
In that case, you should of course turn off these settings.

99
00:04:59.970 --> 00:05:02.010
But from all these more advanced settings,

100
00:05:02.010 --> 00:05:04.650
these are the ones I would consider turning on

101
00:05:04.650 --> 00:05:06.603
to get faster inference.

