WEBVTT

1
00:00:00.720 --> 00:00:02.623
<v Instructor>The great thing about Ollama</v>

2
00:00:02.623 --> 00:00:06.960
is that it has a quite powerful API

3
00:00:06.960 --> 00:00:09.270
that essentially allows you

4
00:00:09.270 --> 00:00:12.930
to do pretty much everything you can do with Ollama

5
00:00:12.930 --> 00:00:14.790
through the command line.

6
00:00:14.790 --> 00:00:16.770
So for example, create models,

7
00:00:16.770 --> 00:00:19.800
list available models, remove models.

8
00:00:19.800 --> 00:00:24.800
You can do all these things through their API as well.

9
00:00:25.500 --> 00:00:28.350
And attached, you'll find a link to the API documentation,

10
00:00:28.350 --> 00:00:30.030
which again, is unfortunately

11
00:00:30.030 --> 00:00:32.910
just a markdown file on GitHub.

12
00:00:32.910 --> 00:00:35.580
But there you learn that you could also create a model

13
00:00:35.580 --> 00:00:39.090
by sending a post request to the right endpoint.

14
00:00:39.090 --> 00:00:42.003
You would have to specify a name for the model,

15
00:00:43.110 --> 00:00:45.750
and then basically specify all the details

16
00:00:45.750 --> 00:00:48.960
you would otherwise put into your model file,

17
00:00:48.960 --> 00:00:52.830
like the base model, maybe a template,

18
00:00:52.830 --> 00:00:57.720
parameters or messages, or a system message.

19
00:00:57.720 --> 00:01:01.770
So you could build a model programmatically.

20
00:01:01.770 --> 00:01:04.590
And of course, for your own use on your system,

21
00:01:04.590 --> 00:01:07.590
you might not wanna do that using a model file,

22
00:01:07.590 --> 00:01:10.770
and ollama create might be more convenient.

23
00:01:10.770 --> 00:01:12.000
But of course if you're building

24
00:01:12.000 --> 00:01:13.890
some kind of application which you want

25
00:01:13.890 --> 00:01:16.890
to expose to other users from all over the world,

26
00:01:16.890 --> 00:01:18.810
and you wanna automate the process

27
00:01:18.810 --> 00:01:20.970
of building customized models,

28
00:01:20.970 --> 00:01:23.640
then this could be a useful feature.

29
00:01:23.640 --> 00:01:24.960
And it's not just that,

30
00:01:24.960 --> 00:01:27.780
you can really do everything else as well,

31
00:01:27.780 --> 00:01:30.720
including deleting or pulling models.

32
00:01:30.720 --> 00:01:33.060
You can also generate vector embeddings,

33
00:01:33.060 --> 00:01:36.060
which can be useful if you're building RAG systems,

34
00:01:36.060 --> 00:01:38.970
but you can of course also generate a completion

35
00:01:38.970 --> 00:01:41.040
or a chat completion that takes

36
00:01:41.040 --> 00:01:43.830
entire chat history into account.

37
00:01:43.830 --> 00:01:46.650
And I got an example for that here.

38
00:01:46.650 --> 00:01:48.630
It's a very simple Python script.

39
00:01:48.630 --> 00:01:51.630
And of course, you find all that code attached.

40
00:01:51.630 --> 00:01:54.090
And in that script, I send a request to

41
00:01:54.090 --> 00:01:59.090
that /api/generate endpoint, which is this endpoint here,

42
00:01:59.820 --> 00:02:01.770
to generate a simple completion.

43
00:02:01.770 --> 00:02:05.640
So a simple output for a received prompt.

44
00:02:05.640 --> 00:02:09.420
Now this is sent to localhost:11434,

45
00:02:09.420 --> 00:02:13.590
so that's the port that Ollama uses for exposing the API.

46
00:02:13.590 --> 00:02:15.420
That's the default port it uses.

47
00:02:15.420 --> 00:02:17.580
You could change that with environment variables,

48
00:02:17.580 --> 00:02:20.133
but by default, that is the port it will use.

49
00:02:21.000 --> 00:02:24.030
And then here, I'm configuring the request data

50
00:02:24.030 --> 00:02:25.695
I'm sending along with the request,

51
00:02:25.695 --> 00:02:29.100
specifically the model that should be used,

52
00:02:29.100 --> 00:02:33.150
the prompt, and whether I want to stream the result

53
00:02:33.150 --> 00:02:35.283
or just get it in one block.

54
00:02:36.360 --> 00:02:39.300
And for the model name and the prompt text

55
00:02:39.300 --> 00:02:41.190
here in this Python code,

56
00:02:41.190 --> 00:02:43.650
I simply have two parameters

57
00:02:43.650 --> 00:02:46.590
in this generate response function.

58
00:02:46.590 --> 00:02:47.760
And the concrete values

59
00:02:47.760 --> 00:02:50.850
for these parameters are set down there.

60
00:02:50.850 --> 00:02:54.270
So the model is essentially my gemma3 model,

61
00:02:54.270 --> 00:02:57.600
but I could also use my service-agent model here,

62
00:02:57.600 --> 00:03:00.600
for example, that I have on my machine.

63
00:03:00.600 --> 00:03:03.660
So this model which I created before,

64
00:03:03.660 --> 00:03:06.780
and then here the prompt is hard coded into the file.

65
00:03:06.780 --> 00:03:08.749
But of course, since this is Python,

66
00:03:08.749 --> 00:03:11.100
I could also use the input function

67
00:03:11.100 --> 00:03:13.620
to ask the user for input.

68
00:03:13.620 --> 00:03:15.954
And now, whenever I execute that file,

69
00:03:15.954 --> 00:03:20.130
I would have to enter an input manually in the command line,

70
00:03:20.130 --> 00:03:22.473
and then that would be sent to the model.

71
00:03:23.430 --> 00:03:24.960
Well, and then all that's happening

72
00:03:24.960 --> 00:03:28.860
is that a post request is sent to that Ollama server,

73
00:03:28.860 --> 00:03:32.010
because a post request is expected

74
00:03:32.010 --> 00:03:34.980
and then my prompt is handled by that AI model,

75
00:03:34.980 --> 00:03:39.720
and the response is then parsed, returned by this function,

76
00:03:39.720 --> 00:03:44.720
and in the end then, here, this response is being output.

77
00:03:45.030 --> 00:03:48.393
Or to be precise, here I output the response text.

78
00:03:49.590 --> 00:03:52.148
And you can learn all about the possible settings

79
00:03:52.148 --> 00:03:57.148
you can set when sending such a request to the Ollama API,

80
00:03:57.510 --> 00:04:00.270
and more about the response format

81
00:04:00.270 --> 00:04:03.000
in the official documentation, of course.

82
00:04:03.000 --> 00:04:05.250
So therefore here, if I execute

83
00:04:05.250 --> 00:04:08.130
this Ollama API file with Python,

84
00:04:08.130 --> 00:04:11.040
I can ask what's LM Studio?

85
00:04:11.040 --> 00:04:13.350
And you might recall that in my system prompt

86
00:04:13.350 --> 00:04:15.900
for this model I'm using here,

87
00:04:15.900 --> 00:04:19.050
I actually told it that it should not help

88
00:04:19.050 --> 00:04:22.533
with questions related to LM Studio.

89
00:04:23.490 --> 00:04:25.710
And that's why indeed here, it tells me

90
00:04:25.710 --> 00:04:27.510
that it's a friendly and helpful assistant,

91
00:04:27.510 --> 00:04:30.000
but that it's not able to provide information

92
00:04:30.000 --> 00:04:33.060
about the specific software I asked for.

93
00:04:33.060 --> 00:04:37.187
Of course, if I would switch back to gemma3:12b-it-qat,

94
00:04:39.720 --> 00:04:41.370
which is that base model without

95
00:04:41.370 --> 00:04:44.820
any special system instructions or anything like that,

96
00:04:44.820 --> 00:04:47.163
if I would ask that same question here,

97
00:04:48.330 --> 00:04:51.540
it would take a while to generate that response,

98
00:04:51.540 --> 00:04:54.120
but then I would get a response that describes

99
00:04:54.120 --> 00:04:57.240
in detail what LM Studio is,

100
00:04:57.240 --> 00:04:59.850
that it's a popular desktop application

101
00:04:59.850 --> 00:05:01.920
that allows you to run large language models

102
00:05:01.920 --> 00:05:03.393
locally on your computer.

103
00:05:04.440 --> 00:05:09.440
And we also get a lot of metadata related to that response.

104
00:05:09.480 --> 00:05:13.410
So that's all that part here, including all those token IDs.

105
00:05:13.410 --> 00:05:16.653
And I get that simply because I'm outputting that here.