WEBVTT

00:00.000 --> 00:02.550
-: Hey, I am gonna show you how to test

00:02.550 --> 00:04.980
open source models on your local computer.

00:04.980 --> 00:07.800
This is a tool called LM Studio.

00:07.800 --> 00:10.170
I found this is working really well for me,

00:10.170 --> 00:12.960
specifically on my M2 Mac,

00:12.960 --> 00:17.220
although it should work on your PC as well.

00:17.220 --> 00:19.350
I'm not gonna go through the download instructions

00:19.350 --> 00:22.110
because I think if you want to install it,

00:22.110 --> 00:24.720
you should just go to lmstudio.ai

00:24.720 --> 00:26.700
and then you can download here.

00:26.700 --> 00:27.533
It should just work.

00:27.533 --> 00:29.160
It worked straight away from me,

00:29.160 --> 00:31.350
which is why I was really attracted to it.

00:31.350 --> 00:34.380
But I would say the discord is really good.

00:34.380 --> 00:37.080
People got back to me straight away when I wanted to know

00:37.080 --> 00:40.290
how to make sure he's using my GPU on my computer.

00:40.290 --> 00:41.700
It's a really active community

00:41.700 --> 00:44.040
and it just makes things super easy.

00:44.040 --> 00:47.220
Cool, now the way this works is you need to search

00:47.220 --> 00:48.053
for a model.

00:48.053 --> 00:52.020
Say for example, if you wanted the new Mistral Model,

00:52.020 --> 00:54.360
which is a pretty small model,

00:54.360 --> 00:56.430
you can run it on your local computer.

00:56.430 --> 00:57.540
Then you can search

00:57.540 --> 00:59.910
and it comes up with the Hugging Face Hub

00:59.910 --> 01:02.910
and you can see like which ones have the most downloads

01:02.910 --> 01:05.220
and this is the Mistral model

01:05.220 --> 01:07.680
with the most downloads here, most comments.

01:07.680 --> 01:09.120
What you're looking for,

01:09.120 --> 01:11.371
all of this will not make sense to you

01:11.371 --> 01:12.660
if you don't know anything about these models,

01:12.660 --> 01:14.877
but let me explain it real quick.

01:14.877 --> 01:18.420
So the Bloke is basically like a guy or an organization

01:18.420 --> 01:21.510
that they gets some funding from, I think A16Z.

01:21.510 --> 01:24.600
But he basically just posts these fine tuned models

01:24.600 --> 01:26.130
on Hugging Face all the time,

01:26.130 --> 01:27.210
all the new models that come out.

01:27.210 --> 01:28.830
He'll put it on Hugging Face

01:28.830 --> 01:32.040
or also train kind of new fine tuned models as well.

01:32.040 --> 01:35.640
So when you see like this one is Mitral OpenOrca

01:35.640 --> 01:38.700
that's trained on the OpenOrca datasets fine tuned.

01:38.700 --> 01:42.690
And then there's like also OpenHermes, yarn,

01:42.690 --> 01:44.400
there's a German one.

01:44.400 --> 01:46.830
So yeah, these are all fine tuned models,

01:46.830 --> 01:48.780
but the main one that we want

01:48.780 --> 01:51.300
to use here is Mistral-70B-Instruct.

01:51.300 --> 01:54.600
That is the chat version of the Mistral model

01:54.600 --> 01:56.100
that was released to open source.

01:56.100 --> 01:58.260
So if you want the normal Mistral model, that's that,

01:58.260 --> 01:59.790
but that's similar to GPT-3

01:59.790 --> 02:01.650
where you have to do a lot of prompting

02:01.650 --> 02:04.380
to get it to talk back to you and full instructions.

02:04.380 --> 02:08.700
Whereas the chat model is much better out of the box.

02:08.700 --> 02:13.080
And typically I just download the GGUF file as well

02:13.080 --> 02:15.030
because it's a different file formats,

02:15.030 --> 02:16.706
it doesn't really matter that much,

02:16.706 --> 02:18.960
but that's the one that I've been using and it works fine.

02:18.960 --> 02:21.540
The quantization stuff over here,

02:21.540 --> 02:22.920
this is basically a technique

02:22.920 --> 02:27.090
to remove some zeros from the numbers in the model.

02:27.090 --> 02:29.850
Like the parameters, it makes it slightly less accurate

02:29.850 --> 02:31.830
because if it was,

02:31.830 --> 02:34.170
if they have say 7 billion parameters,

02:34.170 --> 02:36.480
if you remove a few zeros from each parameter,

02:36.480 --> 02:38.670
it makes that parameter a little bit less accurate

02:38.670 --> 02:40.530
to the fewer decimal places,

02:40.530 --> 02:43.470
but it really improves the size and the speed of the model.

02:43.470 --> 02:45.090
Typically what I do here,

02:45.090 --> 02:48.090
and these are different quantization techniques,

02:48.090 --> 02:49.170
what I typically go for,

02:49.170 --> 02:50.040
I ignore everything,

02:50.040 --> 02:53.280
I just go for the K_M, Q4_K_M

02:53.280 --> 02:55.500
and you can see for every single model

02:55.500 --> 02:57.360
there's like a KM Q4_K_M

02:57.360 --> 02:59.580
and you can see I've downloaded that as well.

02:59.580 --> 03:02.280
And the reason why I choose that one is it usually says

03:02.280 --> 03:05.160
it's recommended if you go for a bigger model here,

03:05.160 --> 03:09.210
so you can see like the Q8 here is seven gigabytes

03:09.210 --> 03:10.320
a little bit bigger

03:10.320 --> 03:13.200
and then there's also like K_S versus K_M,

03:13.200 --> 03:14.891
but I don't think you need

03:14.891 --> 03:15.724
to really think about it too much.

03:15.724 --> 03:16.980
The other thing is that you can click

03:16.980 --> 03:18.918
through to the model card here

03:18.918 --> 03:19.751
if you want to read a bit more

03:19.751 --> 03:22.020
about the model and why it was made.

03:22.020 --> 03:24.900
Cool, the the next thing you need to do,

03:24.900 --> 03:27.660
so actually let me just stop this over here

03:27.660 --> 03:29.850
and just show you what you can actually do.

03:29.850 --> 03:32.220
You can actually run this just like ChatGPT,

03:32.220 --> 03:34.530
but you're running it locally in the browser

03:34.530 --> 03:38.790
or actually in the application and it does a pretty good job

03:38.790 --> 03:40.980
and you can export the screenshots.

03:40.980 --> 03:43.050
The other cool thing is it tells you token speed,

03:43.050 --> 03:43.920
stuff like that.

03:43.920 --> 03:46.410
And basically you just need to load the right model

03:46.410 --> 03:48.060
and then choose the right preset,

03:48.060 --> 03:50.730
here I've chosen and Mistral Instruct

03:50.730 --> 03:52.920
because each of these models,

03:52.920 --> 03:55.350
they require a different way of sending the prompt

03:55.350 --> 03:57.270
and this will basically just send a prompt

03:57.270 --> 03:58.440
in the right format.

03:58.440 --> 04:00.900
So this one it needs to have the system message

04:00.900 --> 04:04.290
and then this INST token and then user message

04:04.290 --> 04:06.420
and then close the INST token

04:06.420 --> 04:07.620
and then the Assistant Message,

04:07.620 --> 04:09.000
it formats that for you.

04:09.000 --> 04:11.190
Otherwise the model doesn't tend to work.

04:11.190 --> 04:13.320
You see ads or this, this is the system message.

04:13.320 --> 04:16.800
You could change this if you wanted but I've kept it there.

04:16.800 --> 04:19.050
But now if you're gonna chat to this

04:19.050 --> 04:21.960
by the way in here, you can do that.

04:21.960 --> 04:25.890
But one thing I would say you want to do is

04:25.890 --> 04:30.890
if you go into Tools,

04:37.350 --> 04:40.290
and then instead of, here we go,

04:40.290 --> 04:43.350
yeah, under Tools or Model Initialization,

04:43.350 --> 04:44.430
there's a couple of options here.

04:44.430 --> 04:46.680
So one is keep the entire model in RAM

04:46.680 --> 04:48.030
and what that would do is basically,

04:48.030 --> 04:49.500
it could make your computer crash

04:49.500 --> 04:52.800
but it will make sure that it goes as fast as possible.

04:52.800 --> 04:54.030
That's one thing you can do.

04:54.030 --> 04:56.190
But the main thing you need to do is click here.

04:56.190 --> 04:59.010
So like I have Apple Metal GPU clicked

04:59.010 --> 05:01.140
and that's gonna basically mean

05:01.140 --> 05:04.500
that it will use my GPU while the model is generating.

05:04.500 --> 05:05.730
So it'll be a lot faster.

05:05.730 --> 05:07.590
Anyway, that's the main thing you need to do.

05:07.590 --> 05:09.510
Anyway if you, but what I'm gonna show you is how

05:09.510 --> 05:11.700
to use it locally in your code.

05:11.700 --> 05:14.370
The cool thing that they did is they made it,

05:14.370 --> 05:16.560
I give you an option to set up a server

05:16.560 --> 05:19.980
and this is in port 1234 with logging and everything

05:19.980 --> 05:21.720
and automatic prompt formatting.

05:21.720 --> 05:24.957
And basically they have mirrored the OpenAI API.

05:24.957 --> 05:27.180
And here I have Mistral Instruct

05:27.180 --> 05:29.213
and I have Mistral Instruct Preset here,

05:29.213 --> 05:31.860
but it's given me an endpoint

05:31.860 --> 05:34.770
like this chat completions that I can call.

05:34.770 --> 05:37.410
They actually also mirrored the OpenAI API.

05:37.410 --> 05:40.980
So all you have to do is change the API base to this URL

05:40.980 --> 05:43.890
and then all of your OpenAI code should work.

05:43.890 --> 05:45.900
However, this is already outdated

05:45.900 --> 05:50.430
because they updated to version one of the API for OpenAI.

05:50.430 --> 05:53.040
So if you've already upgraded then this won't work.

05:53.040 --> 05:56.220
So that's why I'm using this curl request here

05:56.220 --> 05:57.660
'cause that's just more generic,

05:57.660 --> 05:59.790
it's not using the OpenAI library.

05:59.790 --> 06:01.607
So anyway, so that is,

06:01.607 --> 06:04.590
what we're gonna be making use of today.

06:04.590 --> 06:07.260
And now I'll just jump over to the code.

06:07.260 --> 06:08.760
So a couple of things I've done

06:08.760 --> 06:10.350
here is I've imported requests,

06:10.350 --> 06:12.930
I've imported JSON library

06:12.930 --> 06:14.430
and then I've set the URL.

06:14.430 --> 06:16.890
So this is localhost:1234.

06:16.890 --> 06:19.410
So this is running on my own local computer.

06:19.410 --> 06:23.460
And then the format of how you send to OpenAI is the same

06:23.460 --> 06:25.350
as how you send to this model

06:25.350 --> 06:28.020
and the LM Studio's gonna do all the hard part

06:28.020 --> 06:29.610
to split it out for you.

06:29.610 --> 06:31.590
So I'm just gonna run that

06:31.590 --> 06:33.300
and it's gonna format the prompt for me

06:33.300 --> 06:36.060
and send it back and here we go.

06:36.060 --> 06:37.860
Yeah, and this is the response.

06:37.860 --> 06:40.800
So just the same sort of response you get from OpenAI.

06:40.800 --> 06:42.180
So that's pretty cool.

06:42.180 --> 06:44.370
Now the code that I've run here,

06:44.370 --> 06:47.520
this is just how you would test these types of models.

06:47.520 --> 06:49.740
I've just brought in here a time

06:49.740 --> 06:51.510
so I can measure the latency

06:51.510 --> 06:53.940
and then I've also brought in Matplotlib

06:53.940 --> 06:55.860
so I can graph it afterwards.

06:55.860 --> 06:58.050
But the, the key here is that

06:58.050 --> 07:02.400
I test mistral-7B against GPT-3.5-turbo

07:02.400 --> 07:04.170
and against GPT-4.

07:04.170 --> 07:06.927
And for GPT-4 and GT 3.5-turbo,

07:06.927 --> 07:09.090
I'm gonna be calling the Completions API.

07:09.090 --> 07:10.650
But for Mistral it's gonna be called,

07:10.650 --> 07:11.760
it's gonna be exactly the same,

07:11.760 --> 07:13.920
but it's just gonna be calling my localhost.

07:13.920 --> 07:15.510
So I've set up this function here,

07:15.510 --> 07:18.030
the default is 3.5-turbo,

07:18.030 --> 07:22.650
but basically what it does is gets my API key for OpenAI,

07:22.650 --> 07:26.070
puts the prompt in here and the system prompt.

07:26.070 --> 07:30.510
And then we get back the response

07:30.510 --> 07:32.610
and here's the system prompts.

07:32.610 --> 07:33.990
It could actually be the system prompt

07:33.990 --> 07:35.700
or it could just be the user.

07:35.700 --> 07:37.560
So if you set user here

07:37.560 --> 07:40.410
and then you could set the system prompt separately

07:40.410 --> 07:41.610
if you want.

07:41.610 --> 07:43.990
I'm just gonna set the system prompt is

07:45.817 --> 07:49.227
"You are a helpful system."

07:50.310 --> 07:52.470
And you can make that whatever you want.

07:52.470 --> 07:55.440
But this is gonna set in the prompt as a user

07:55.440 --> 07:57.420
and then it is gonna fill in the temperature

07:57.420 --> 07:58.860
and everything like that as well.

07:58.860 --> 07:59.880
But this is my prompt,

07:59.880 --> 08:02.190
this is what I was using before.

08:02.190 --> 08:05.580
And what it's gonna do is it's gonna run through each model

08:05.580 --> 08:07.680
and then it's going to create an object for,

08:07.680 --> 08:11.340
to hold the latency and the response or the responses

08:11.340 --> 08:13.680
and then it's gonna go through 10 times

08:13.680 --> 08:16.200
and call the model 10 times.

08:16.200 --> 08:18.600
It's gonna time it, so it's gonna start the timer

08:18.600 --> 08:20.910
and then end the timer after it gets the response back

08:20.910 --> 08:23.280
and it's gonna add those to the object.

08:23.280 --> 08:26.910
And then basically we're gonna have

08:26.910 --> 08:29.250
then this result object filled in

08:29.250 --> 08:33.390
with 10 calls to each of the models alongside the latency.

08:33.390 --> 08:35.400
So then we're just gonna plot it afterwards as well.

08:35.400 --> 08:37.050
So I'm just gonna run that

08:37.050 --> 08:41.130
and that's gonna create the chart at the end.

08:41.130 --> 08:43.380
Now the reason why you'd want to do this

08:43.380 --> 08:44.843
while we're running this is just that,

08:44.843 --> 08:46.680
there's a few different reasons actually.

08:46.680 --> 08:49.350
One is, if you can get better results

08:49.350 --> 08:51.030
from an open source model

08:51.030 --> 08:52.620
or if you can get at least good enough results

08:52.620 --> 08:54.360
from an open source model with your prompt,

08:54.360 --> 08:57.330
then you might want to be using that locally instead.

08:57.330 --> 08:59.250
Like you don't need the internet to use this,

08:59.250 --> 09:02.350
which is very cool because if you're on flight

09:02.350 --> 09:04.170
or you don't have, you have spotty Wi-Fi,

09:04.170 --> 09:06.510
you can still be using an LLM locally,

09:06.510 --> 09:07.470
which is pretty good.

09:07.470 --> 09:10.200
The other reason why it's useful is that

09:10.200 --> 09:11.460
it could be a lot cheaper, right?

09:11.460 --> 09:13.200
If you are hosting this yourself,

09:13.200 --> 09:14.940
you're just paying for the inference costs,

09:14.940 --> 09:17.932
you're not paying any markup to OpenAI.

09:17.932 --> 09:19.290
So that can be really helpful.

09:19.290 --> 09:21.630
A third is flexibility or freedom.

09:21.630 --> 09:23.100
So if you're building a business

09:23.100 --> 09:25.890
and you're worried about things going wrong in OpenAI,

09:25.890 --> 09:28.860
what happens if they fire (indistinct) again

09:28.860 --> 09:30.450
or some sort of disruption happens,

09:30.450 --> 09:32.280
like their API goes down quite a bit

09:32.280 --> 09:35.010
and so having a backup, I think that's open source

09:35.010 --> 09:36.360
that it can't be taken down.

09:36.360 --> 09:38.010
It's literally just code on your computer

09:38.010 --> 09:40.560
so nobody can stop you from using it,

09:40.560 --> 09:42.060
which I think is really powerful

09:42.060 --> 09:44.130
and you can build a business around that as well.

09:44.130 --> 09:45.930
You can host it on your own servers

09:45.930 --> 09:48.450
and then you never have any issues

09:48.450 --> 09:51.660
with reliability from other hosts.

09:51.660 --> 09:55.200
Especially if OpenAI starts having to curtail some

09:55.200 --> 09:56.700
of the uses of of its API,

09:56.700 --> 10:00.330
'cause it gets pressure from Microsoft or whatever.

10:00.330 --> 10:02.250
So anyways, that's what we're run now

10:02.250 --> 10:04.110
and we can see the difference

10:04.110 --> 10:06.390
and we can see that Mistral,

10:06.390 --> 10:09.300
it was faster than GPT-4 on average.

10:09.300 --> 10:12.360
Let's, so this is latency is how many seconds it took

10:12.360 --> 10:14.460
to get back the response.

10:14.460 --> 10:17.640
And then you can see 3.5-turbo is still a lot faster

10:17.640 --> 10:19.440
and pretty cheap as well.

10:19.440 --> 10:22.110
The 3.5-turbo, this probably cost less than a penny.

10:22.110 --> 10:24.210
So anyway that's pretty interesting.

10:24.210 --> 10:27.930
We could also, because now we have this results object,

10:27.930 --> 10:30.450
we can just see what it came back

10:30.450 --> 10:32.460
with I hear are the responses.

10:32.460 --> 10:34.770
You can see that with Mistral

10:34.770 --> 10:37.020
it was pretty good at following instructions,

10:37.020 --> 10:38.717
although sometimes it does put here,

10:38.717 --> 10:40.530
it put the numbers in here,

10:40.530 --> 10:42.060
number one, number two, number three,

10:42.060 --> 10:43.410
and that's not what we wanted.

10:43.410 --> 10:46.350
Whereas if you look at say,

10:46.350 --> 10:49.890
here are the responses for GT 3.5-turbo, it never did that.

10:49.890 --> 10:52.080
So that is a little bit more reliable.

10:52.080 --> 10:54.780
And then GPT-4 obviously is really good.

10:54.780 --> 10:56.910
The other big benefit of doing

10:56.910 --> 10:58.800
this multiple times is you just see

10:58.800 --> 11:01.890
how reliable it is and how often the same names come.

11:01.890 --> 11:04.950
So pretty cool that OmniFit shoes,

11:04.950 --> 11:07.140
OmniFit comes up for all of them.

11:07.140 --> 11:11.160
OmniFit walkers comes up pretty commonly for GPT-4.

11:11.160 --> 11:13.290
It's actually more deterministic it feels

11:13.290 --> 11:14.220
than the other responses.

11:14.220 --> 11:16.500
So seeing this just gives you a good sense

11:16.500 --> 11:19.260
of what are the differences between the models.

11:19.260 --> 11:21.840
I would say that Mistrals pretty good here though,

11:21.840 --> 11:23.040
especially for a tiny model.

11:23.040 --> 11:26.053
If you think about GPT-4 is like a trillion parameters.

11:26.053 --> 11:29.910
A GPT-3.5 I think is like

11:29.910 --> 11:32.280
something like 50 billion parameters

11:32.280 --> 11:34.380
and Mistral is only 7 billion parameters

11:34.380 --> 11:36.360
and it's still get hangs tight.

11:36.360 --> 11:37.680
Some of these names are even better

11:37.680 --> 11:39.390
actually I think AdaptiFit.

11:39.390 --> 11:43.050
So that's AdaptaFit, which interesting OmniFit Shoes,

11:43.050 --> 11:46.800
but there's also like FootMorph I think is pretty creative

11:46.800 --> 11:49.473
and then FootMate, but that's cool as well.

11:50.732 --> 11:51.895
I don't see that anywhere else.

11:51.895 --> 11:52.890
But yeah, I think it is definitely a contender

11:52.890 --> 11:54.240
and it's worth checking out.

11:54.240 --> 11:55.650
So yeah, test your prompts

11:55.650 --> 11:57.483
and hopefully you get good results.