WEBVTT

1
00:00:00.360 --> 00:00:03.510
<v Maximilian>Now when it comes to setting up model files,</v>

2
00:00:03.510 --> 00:00:06.720
there's one interesting instruction you can add,

3
00:00:06.720 --> 00:00:09.330
and that's the template instruction.

4
00:00:09.330 --> 00:00:11.550
The template instruction allows you

5
00:00:11.550 --> 00:00:14.580
to specify a, well, template

6
00:00:14.580 --> 00:00:16.830
that will be used by the model.

7
00:00:16.830 --> 00:00:19.800
By the way, you can add multi-line instructions

8
00:00:19.800 --> 00:00:22.020
also for the system prompt, for example,

9
00:00:22.020 --> 00:00:24.870
by wrapping it with triple quotes,

10
00:00:24.870 --> 00:00:27.476
just as you can do it when chatting with Ollama

11
00:00:27.476 --> 00:00:28.710
through Ollama Run.

12
00:00:28.710 --> 00:00:30.153
But that's just a side note.

13
00:00:30.990 --> 00:00:33.480
But what is such a template then?

14
00:00:33.480 --> 00:00:36.300
What is this weird thing here?

15
00:00:36.300 --> 00:00:40.770
Well, it is also something you find on the model detail page

16
00:00:40.770 --> 00:00:42.840
on the Ollama website.

17
00:00:42.840 --> 00:00:45.690
There you have this template part.

18
00:00:45.690 --> 00:00:48.750
And for example, there you can inspect the default template

19
00:00:48.750 --> 00:00:50.913
set for the gemma3 models.

20
00:00:51.960 --> 00:00:54.510
This also looks rather weird.

21
00:00:54.510 --> 00:00:58.740
You'll also find templates on Hugging Face.

22
00:00:58.740 --> 00:01:03.030
So for example, here for this quantized Qwen3 model,

23
00:01:03.030 --> 00:01:06.870
there is a chat template available.

24
00:01:06.870 --> 00:01:10.143
And if I click this, I get an even weirder template.

25
00:01:11.010 --> 00:01:13.350
So what is all that about?

26
00:01:13.350 --> 00:01:15.570
Well, you must not forget

27
00:01:15.570 --> 00:01:20.070
that all these Large Language Models are token generators.

28
00:01:20.070 --> 00:01:23.640
They generate tokens based on the input tokens.

29
00:01:23.640 --> 00:01:25.380
That's all they do.

30
00:01:25.380 --> 00:01:26.790
And therefore,

31
00:01:26.790 --> 00:01:30.930
they also don't know about things like a system message,

32
00:01:30.930 --> 00:01:32.580
or user messages,

33
00:01:32.580 --> 00:01:36.810
or assistant messages that have been generated by the AI.

34
00:01:36.810 --> 00:01:41.810
So in order to really understand, kind of, the chat history,

35
00:01:42.780 --> 00:01:45.360
or things like a system message,

36
00:01:45.360 --> 00:01:49.500
these models must be taught, so to say,

37
00:01:49.500 --> 00:01:51.540
during the training process,

38
00:01:51.540 --> 00:01:55.560
that certain identifiers in a chat history

39
00:01:55.560 --> 00:01:57.990
mean certain things.

40
00:01:57.990 --> 00:02:02.990
For example, the gemma3 model was trained to understand

41
00:02:03.480 --> 00:02:08.480
that if it detects this start of turn identifier here,

42
00:02:09.270 --> 00:02:11.160
which is written exactly like this,

43
00:02:11.160 --> 00:02:15.510
with &lt;start_of_turn&gt;,

44
00:02:15.510 --> 00:02:18.540
that means that a new message starts.

45
00:02:18.540 --> 00:02:21.570
A user message or an assistant message,

46
00:02:21.570 --> 00:02:24.063
which in gemma's world is called model.

47
00:02:24.900 --> 00:02:28.200
And it was also taught to understand that

48
00:02:28.200 --> 00:02:31.320
end of turn means that a message is over,

49
00:02:31.320 --> 00:02:35.550
and that thereafter a new message might start.

50
00:02:35.550 --> 00:02:37.530
So these are identifiers

51
00:02:37.530 --> 00:02:41.460
that were used during the training process of that model

52
00:02:41.460 --> 00:02:43.650
to teach it to differentiate

53
00:02:43.650 --> 00:02:45.690
between user and assistant models,

54
00:02:45.690 --> 00:02:48.990
and potentially also understand other things.

55
00:02:48.990 --> 00:02:50.880
Though here in the gemma model case,

56
00:02:50.880 --> 00:02:52.560
it's a really simple template.

57
00:02:52.560 --> 00:02:54.363
So that's mostly it.

58
00:02:55.350 --> 00:02:58.620
You also might recall that earlier,

59
00:02:58.620 --> 00:03:01.650
when we used show info,

60
00:03:01.650 --> 00:03:04.380
we saw this weird stop parameter,

61
00:03:04.380 --> 00:03:07.350
which is set to this thing here.

62
00:03:07.350 --> 00:03:09.390
That's just another identifier

63
00:03:09.390 --> 00:03:13.020
that was used, during the training of the gemma model,

64
00:03:13.020 --> 00:03:16.920
to signal the end of a message.

65
00:03:16.920 --> 00:03:18.420
And why is this important?

66
00:03:18.420 --> 00:03:19.470
Because gemma,

67
00:03:19.470 --> 00:03:22.770
because it was trained with those identifiers,

68
00:03:22.770 --> 00:03:26.730
will, for example, insert this identifier

69
00:03:26.730 --> 00:03:31.650
whenever it generally is done with its output,

70
00:03:31.650 --> 00:03:34.350
with the response it generated.

71
00:03:34.350 --> 00:03:36.180
But because it's a token generator,

72
00:03:36.180 --> 00:03:38.040
it might keep on going thereafter

73
00:03:38.040 --> 00:03:40.980
because it doesn't have any logical understanding

74
00:03:40.980 --> 00:03:42.690
of what it means to stop.

75
00:03:42.690 --> 00:03:46.440
It just was trained to have a high likelihood

76
00:03:46.440 --> 00:03:49.530
of inserting, or of outputting, this token here,

77
00:03:49.530 --> 00:03:52.650
after outputting enough other tokens that make sense

78
00:03:52.650 --> 00:03:55.380
for the prompt it received.

79
00:03:55.380 --> 00:03:59.040
So the likelihood of this token being output by the model

80
00:03:59.040 --> 00:04:00.750
simply increases

81
00:04:00.750 --> 00:04:04.950
the more complete the response it gave got.

82
00:04:04.950 --> 00:04:08.760
That all happens due to its training process.

83
00:04:08.760 --> 00:04:12.480
The surrounding application, Ollama in this case,

84
00:04:12.480 --> 00:04:16.890
then is able to capture this token to see it in the output,

85
00:04:16.890 --> 00:04:20.970
and stop the model from producing more tokens.

86
00:04:20.970 --> 00:04:23.670
That's how the model works under the hood.

87
00:04:23.670 --> 00:04:26.490
And that's also the idea behind the template.

88
00:04:26.490 --> 00:04:29.040
This template needs to pick up the structure

89
00:04:29.040 --> 00:04:32.310
that was used during the training process of the model

90
00:04:32.310 --> 00:04:35.910
to steer the model into the right direction,

91
00:04:35.910 --> 00:04:38.520
to wrap the chat history message

92
00:04:38.520 --> 00:04:42.660
into a structure the model knows, in quotes,

93
00:04:42.660 --> 00:04:45.210
from the training process.

94
00:04:45.210 --> 00:04:46.950
That's normally not something you,

95
00:04:46.950 --> 00:04:49.710
as a user of these models, need to worry about.

96
00:04:49.710 --> 00:04:50.543
You can ignore that.

97
00:04:50.543 --> 00:04:52.740
For you it just works.

98
00:04:52.740 --> 00:04:53.580
But under the hood,

99
00:04:53.580 --> 00:04:56.850
we are dealing with dumb token generators,

100
00:04:56.850 --> 00:04:59.040
and that's why something like a template,

101
00:04:59.040 --> 00:05:02.190
or as such a stop token, is important

102
00:05:02.190 --> 00:05:04.860
to steer the model into the right direction,

103
00:05:04.860 --> 00:05:06.030
and to make sure

104
00:05:06.030 --> 00:05:09.270
that it truly understands the chat history,

105
00:05:09.270 --> 00:05:12.540
and that the chat history has the same format, so to say,

106
00:05:12.540 --> 00:05:16.380
as the training data did during the training process.

107
00:05:16.380 --> 00:05:19.740
Because without a properly set up template,

108
00:05:19.740 --> 00:05:23.160
the model will essentially just produce garbage,

109
00:05:23.160 --> 00:05:26.250
because it would receive input that's not in line

110
00:05:26.250 --> 00:05:29.010
with the structure it saw in its training data,

111
00:05:29.010 --> 00:05:32.160
and therefore the likelihood of producing tokens,

112
00:05:32.160 --> 00:05:35.100
and words that make sense, would decrease,

113
00:05:35.100 --> 00:05:38.970
and the likelihood of getting bad results would increase.

114
00:05:38.970 --> 00:05:41.730
That's why setting a good template,

115
00:05:41.730 --> 00:05:45.690
that's in line with the training process template,

116
00:05:45.690 --> 00:05:47.670
is important.

117
00:05:47.670 --> 00:05:49.830
Now, the great thing about LM Studio

118
00:05:49.830 --> 00:05:52.560
is that you don't need to worry about that at all.

119
00:05:52.560 --> 00:05:54.990
And for all Ollama, you also don't need to worry,

120
00:05:54.990 --> 00:05:57.150
if you use models from the catalog,

121
00:05:57.150 --> 00:05:59.670
or if you build your own customized versions

122
00:05:59.670 --> 00:06:02.220
based on models from that catalog.

123
00:06:02.220 --> 00:06:03.780
Because for all these models,

124
00:06:03.780 --> 00:06:08.040
the template is already predefined and baked into the model.

125
00:06:08.040 --> 00:06:13.040
But if you were to build your own model from scratch,

126
00:06:13.140 --> 00:06:14.910
because you trained your own model,

127
00:06:14.910 --> 00:06:17.040
and you got the raw parameters,

128
00:06:17.040 --> 00:06:19.980
then you would need to set up such a template

129
00:06:19.980 --> 00:06:23.430
based on the template you used in your training process

130
00:06:23.430 --> 00:06:26.430
to make sure that interacting with that model

131
00:06:26.430 --> 00:06:28.560
works as intended.

132
00:06:28.560 --> 00:06:31.290
So, long story short, you typically don't need to set that,

133
00:06:31.290 --> 00:06:33.990
but I want to explain what this is all about,

134
00:06:33.990 --> 00:06:37.893
and why it might matter in some more advanced use cases.

