WEBVTT

00:00.000 --> 00:02.520
-: All right, now we're gonna learn about fine tuning.

00:02.520 --> 00:04.230
And as of recording this video,

00:04.230 --> 00:05.910
it literally just came out yesterday.

00:05.910 --> 00:07.620
So you're getting this fresh.

00:07.620 --> 00:09.270
The API

00:09.270 --> 00:10.260
for

00:10.260 --> 00:15.260
OpenAI's 3.5 model is pretty straightforward actually.

00:15.960 --> 00:19.530
And they have some details on the documentation

00:19.530 --> 00:22.080
and also the blog post where they came out

00:22.080 --> 00:24.030
with fine tuning as well.

00:24.030 --> 00:26.910
I've just followed along with the example here,

00:26.910 --> 00:30.300
but I'm gonna explain to you what's happening at each stage.

00:30.300 --> 00:31.740
The first thing you need to do is get your

00:31.740 --> 00:33.210
data into the right format.

00:33.210 --> 00:34.920
If you are following along here,

00:34.920 --> 00:36.630
you can make a copy of this.

00:36.630 --> 00:40.710
My data from my blog is here and you can open that up

00:40.710 --> 00:42.150
and I'll show you what that looks like.

00:42.150 --> 00:45.750
This is the format, so it's called JSON L,

00:45.750 --> 00:50.310
which is like JSON, except every object is on its own line.

00:50.310 --> 00:52.770
So you can see there's no parent object

00:52.770 --> 00:55.170
and it just is a list of messages.

00:55.170 --> 00:57.720
So the messages are the system message,

00:57.720 --> 00:59.310
I've just kept that default.

00:59.310 --> 01:02.280
But if you are testing fine tuning for yourself,

01:02.280 --> 01:04.260
you already have a prompt that's working in the system

01:04.260 --> 01:08.010
message, you might want to use that prompt here.

01:08.010 --> 01:09.750
I've just gone with the default.

01:09.750 --> 01:12.180
Then I think it takes a lot of experimentation,

01:12.180 --> 01:14.940
but you might get better results if you update the system

01:14.940 --> 01:19.140
message equally, I've just used this really simple prompt,

01:19.140 --> 01:21.210
which is I write the section

01:21.210 --> 01:23.760
and then the section header for the blog posts

01:23.760 --> 01:25.440
and then the blog title.

01:25.440 --> 01:27.600
What I've done is I split, I, you know,

01:27.600 --> 01:29.820
downloaded all 48 of my blog posts.

01:29.820 --> 01:32.580
I split them into sections based on the section headers,

01:32.580 --> 01:36.420
the H one tags, and then I've put the name of that section

01:36.420 --> 01:38.550
into, into here.

01:38.550 --> 01:41.430
And then the, the blog

01:41.430 --> 01:44.790
title here, like the AI, is then responding.

01:44.790 --> 01:46.650
So obviously the AI didn't really respond, this is

01:46.650 --> 01:48.390
what I'm telling you, it should have responded.

01:48.390 --> 01:50.280
So that's how you're training it, saying like,

01:50.280 --> 01:54.450
when the user prompts like this, then gimme this type

01:54.450 --> 01:58.440
of content and my hope is gonna train it to talk like me.

01:58.440 --> 02:02.490
So that's the actual section text there.

02:02.490 --> 02:05.100
So getting into this format is, the first

02:05.100 --> 02:06.900
challenge is actually like the,

02:06.900 --> 02:08.220
you know, probably the hardest thing.

02:08.220 --> 02:11.010
What, what I did actually is I cheated a little bit

02:11.010 --> 02:13.080
and I used the code interpreter.

02:13.080 --> 02:15.450
I'll just show you kind of how I did that.

02:15.450 --> 02:17.880
I uploaded the blog posts

02:17.880 --> 02:20.040
and then I just asked it to prep it into this format

02:20.040 --> 02:21.840
and I gave it one example.

02:21.840 --> 02:23.880
I just said, prep into the format, write this section header

02:23.880 --> 02:25.890
for the blog post article title

02:25.890 --> 02:28.510
and make sure you cover all the blog posts in the CSV.

02:28.510 --> 02:29.880
And the format is like this.

02:29.880 --> 02:32.340
I gave it one example, I just took a section header,

02:32.340 --> 02:35.610
Why must agencies charge so much for one of my posts?

02:35.610 --> 02:39.450
And then I pasted in that text manually. Cool.

02:39.450 --> 02:40.740
It actually did all the work.

02:40.740 --> 02:45.740
And I can include, I'll include the code that it used.

02:45.960 --> 02:47.880
So I asked it to, I actually,

02:47.880 --> 02:51.540
so one thing is initially it had a bunch of HTML in there,

02:51.540 --> 02:53.040
so I asked it to get rid of that,

02:53.040 --> 02:55.650
but other than that it was, it worked really well

02:55.650 --> 02:59.100
and I asked it to package up all the code.

02:59.100 --> 03:02.580
So I'll include this as a .py file.

03:02.580 --> 03:04.620
So hopefully you should be able to just run this

03:04.620 --> 03:06.750
or you could give this to code interpreter

03:06.750 --> 03:10.083
and say, can you do that for, but for, for my data.

03:11.100 --> 03:13.770
Cool. And then once you ask it to generate the output,

03:13.770 --> 03:16.263
you can click and save the the file.

03:17.340 --> 03:20.010
Cool. So that's how you get the data in the right format.

03:20.010 --> 03:21.390
This is how you load my data.

03:21.390 --> 03:24.630
So if you just wanna test, you can load this from this link.

03:24.630 --> 03:27.450
The, it's just extracting the file ID

03:27.450 --> 03:30.210
and then pulling the data and then saving it locally.

03:30.210 --> 03:31.500
And then this loads the data.

03:31.500 --> 03:34.890
So you can see you have 304 observations.

03:34.890 --> 03:36.600
Typically with fine tuning,

03:36.600 --> 03:39.930
what OpenAI says is you probably need at least 50

03:39.930 --> 03:42.450
observations in order to start making,

03:42.450 --> 03:44.340
noticing a difference in the model.

03:44.340 --> 03:47.880
And I've seen in a few papers that it's typically

03:47.880 --> 03:49.233
around like 200.

03:51.090 --> 03:53.580
200 observations that you need in

03:53.580 --> 03:55.830
order to get to the point where

03:55.830 --> 03:58.440
fine tuning actually shows better results than

03:58.440 --> 04:00.840
just general prompt engineering.

04:00.840 --> 04:03.570
So I, I would say if you have 200 samples,

04:03.570 --> 04:04.950
then you're probably in a good

04:04.950 --> 04:06.762
position to start fine tuning.

04:06.762 --> 04:08.580
If you're not in that position, then try

04:08.580 --> 04:11.370
and generate 200 samples manually by getting people

04:11.370 --> 04:13.140
to label them or you labeling them.

04:13.140 --> 04:15.450
Or you could just use prompt engineering for now.

04:15.450 --> 04:17.940
And then once you get to 200 samples,

04:17.940 --> 04:21.000
then you can do this type of work.

04:21.000 --> 04:24.158
The, the code here, basically I just need to get your,

04:24.158 --> 04:26.457
get the OpenAI key, you can paste that in there

04:26.457 --> 04:28.270
and you can get that in

04:29.819 --> 04:32.703
here, view API keys, and you can create a new one.

04:33.750 --> 04:36.300
And then this is the call

04:36.300 --> 04:39.840
to the OpenAI fine tune.

04:39.840 --> 04:42.870
I guess this specifically is like the files section.

04:42.870 --> 04:45.750
So before it can fine tune on your data,

04:45.750 --> 04:46.920
it needs to have your data, right?

04:46.920 --> 04:50.970
So this is just a way of uploading that file to OpenAI

04:50.970 --> 04:52.740
and this basically just splits it

04:52.740 --> 04:55.620
and opens it in a file path and then,

04:55.620 --> 04:58.800
and then you get a response that says, here's the file ID,

04:58.800 --> 05:01.290
you know how, here's how big it is and it was uploaded.

05:01.290 --> 05:04.260
Now one thing I found is that sometimes

05:04.260 --> 05:06.450
with larger files it takes a while to upload.

05:06.450 --> 05:09.420
So if you try and run the rest of the code, it doesn't work.

05:09.420 --> 05:12.420
So I've added this section here, which is just checking on

05:12.420 --> 05:16.530
that file and it just, if you run that then, then it just

05:16.530 --> 05:18.930
finds the latest file that you uploaded

05:18.930 --> 05:22.710
and then once it says processed, then you're pretty happy.

05:22.710 --> 05:24.300
Then you can keep running.

05:24.300 --> 05:26.370
Then, and then you start a fine tuning job.

05:26.370 --> 05:27.750
So again, pretty straightforward,

05:27.750 --> 05:30.990
just call the fine tuning jobs API,

05:30.990 --> 05:33.300
again with your OpenAI key.

05:33.300 --> 05:37.080
And it is recommended that you train on this

05:37.080 --> 05:40.470
GPT-3.5 turbo 0 6 1 3.

05:40.470 --> 05:42.360
The reason is that it doesn't,

05:42.360 --> 05:44.760
they don't support GPT-4 fine tuning yet,

05:44.760 --> 05:46.440
although they said it's coming in the fall.

05:46.440 --> 05:48.660
You can also tune on to other models,

05:48.660 --> 05:51.480
but they're the older models like GPT-2.

05:51.480 --> 05:54.450
So it is not really worth messing with,

05:54.450 --> 05:56.910
or at least, you know, not for most of the tasks

05:56.910 --> 05:59.730
that I've found helpful for.

05:59.730 --> 06:03.000
Cool. So when you call that and then you get a job id,

06:03.000 --> 06:06.360
and you can see that like it's not finished yet

06:06.360 --> 06:08.790
and it says status created.

06:08.790 --> 06:10.620
And again, you can just check the status.

06:10.620 --> 06:12.780
So if you run that a few times,

06:12.780 --> 06:16.740
like it will say like running, if you scroll across

06:16.740 --> 06:19.020
and then eventually it'll say status succeeded.

06:19.020 --> 06:20.370
And once it has succeeded,

06:20.370 --> 06:22.380
then you can see how long it took.

06:22.380 --> 06:25.200
In this case it took 24 minutes.

06:25.200 --> 06:27.660
And then also you can calculate the price as,

06:27.660 --> 06:29.340
so you can calculate this ahead of time,

06:29.340 --> 06:34.080
but the, the cost for fine tuning is as is a 10th

06:34.080 --> 06:36.810
of 8 cents per 1000 tokens.

06:36.810 --> 06:41.100
So it costs me $2.22 to train on these 300 observations.

06:41.100 --> 06:43.710
And then you get the actual fine tuned model itself.

06:43.710 --> 06:47.430
So you can see that here, this is my specific code,

06:47.430 --> 06:50.760
it shows up in my account and then you can just query it

06:50.760 --> 06:52.230
just like you otherwise would.

06:52.230 --> 06:56.460
And so this is just a normal query to, to OpenAI,

06:56.460 --> 06:58.920
except I'm passing this new model name in.

06:58.920 --> 07:00.420
And if it's, I think it's really important

07:00.420 --> 07:02.610
that you keep the system message and the con-

07:02.610 --> 07:04.710
and the prompt style the same,

07:04.710 --> 07:06.750
but I'm getting, I think, pretty good results.

07:06.750 --> 07:08.730
So this actually really does sound like me.

07:08.730 --> 07:11.313
I think if I scroll back to the beginning.

07:12.810 --> 07:16.410
It uses a lot of the same kind of informal language,

07:16.410 --> 07:18.180
but quite direct, not much fluff.

07:18.180 --> 07:19.680
Yeah, like words like under the

07:19.680 --> 07:21.480
hood, I use that quite a lot.

07:21.480 --> 07:23.130
Yeah, here we go. I think I'm using

07:23.130 --> 07:24.300
figures and stuff like that.

07:24.300 --> 07:25.290
Yeah, I think it's,

07:25.290 --> 07:27.840
I'm actually pretty impressed with the results.

07:27.840 --> 07:30.900
And then the other thing I think you should think about is

07:30.900 --> 07:32.940
like, this was asking it to a section

07:32.940 --> 07:34.800
that like it'd already trained on,

07:34.800 --> 07:36.300
but really you want to try

07:36.300 --> 07:38.310
and write a section that it hasn't trained on.

07:38.310 --> 07:41.970
So here's a topic that I have written about memetics.

07:41.970 --> 07:43.740
I actually wrote a whole book about memetics,

07:43.740 --> 07:47.730
but not, there's only like one blog post on memetics on my

07:47.730 --> 07:51.390
personal blog and it doesn't mention criticisms of memetics.

07:51.390 --> 07:55.140
So this is like a tangential topic

07:55.140 --> 07:57.270
and I haven't written specifically this section like,

07:57.270 --> 07:58.740
why isn't memetics a science?

07:58.740 --> 08:00.360
So this is a better test

08:00.360 --> 08:04.440
and what I find is it still sounds like me, it starts

08:04.440 --> 08:06.360
to quote different people I've never read,

08:06.360 --> 08:07.860
but it's, it's like doing,

08:07.860 --> 08:09.360
it's still doing a pretty good job.

08:09.360 --> 08:11.760
So I think that more testing needs to be done.

08:11.760 --> 08:15.780
Ideally what you would do is you would hold back some test

08:15.780 --> 08:18.240
data when you have the 300 or something samples.

08:18.240 --> 08:21.030
If you hold back like 10, 15% of that,

08:21.030 --> 08:23.790
and then you could feed in those prompts, generate

08:23.790 --> 08:26.730
and see how well they did

08:26.730 --> 08:29.040
relative to the real task.

08:29.040 --> 08:32.070
So I, I would say that's good, good, you know,

08:32.070 --> 08:34.470
that's a good idea really with any machine learning.

08:34.470 --> 08:36.330
But, but yeah, that's how you're gonna really tell

08:36.330 --> 08:37.650
whether it's done a good job.

08:37.650 --> 08:39.360
You can also tell programmatically,

08:39.360 --> 08:42.270
so there's like the evals functionality and lang chain

08:42.270 --> 08:46.680
or in OpenAI you can do manual kind of blind,

08:46.680 --> 08:48.630
thumbs up, thumbs down, does this sound like me?

08:48.630 --> 08:50.820
Or not like sentence by sentence.

08:50.820 --> 08:53.340
But you could also do embedding distance

08:53.340 --> 08:57.420
or you could calculate the vector for, for the text,

08:57.420 --> 09:00.240
the reference text, the actual writing that I did,

09:00.240 --> 09:04.173
and then calculate the, the embedding for the, the text,

09:04.173 --> 09:07.200
the, the, the fine tune model generated

09:07.200 --> 09:10.140
and then see what the difference is, see how close it gets.

09:10.140 --> 09:12.150
And you can try different fine tuning techniques.

09:12.150 --> 09:14.190
So you try with a different system message

09:14.190 --> 09:16.020
and see if that improves things and so on.

09:16.020 --> 09:19.410
So it is an iterative experimental process.

09:19.410 --> 09:22.080
The other thing to show you is that these models,

09:22.080 --> 09:25.050
there's not much interface unfortunately with these things,

09:25.050 --> 09:26.820
but these models do show up.

09:26.820 --> 09:28.590
So if I refresh

09:28.590 --> 09:32.700
and then you can see I have my fine tunes that show up here.

09:32.700 --> 09:35.853
If I wanna say the section,

09:38.160 --> 09:41.603
what is my favorite color?

09:44.400 --> 09:46.383
Just a random one, just submit.

09:48.630 --> 09:50.030
See what it comes back with.

09:54.930 --> 09:56.760
Yeah, so it doesn't always work.

09:56.760 --> 09:59.460
I think if it goes off, off piece in terms of the examples,

09:59.460 --> 10:01.140
it doesn't do as good a job.

10:01.140 --> 10:04.560
Cool. The, the other thing to talk about is pricing.

10:04.560 --> 10:08.790
Here we go for fine tuning costs, like 8 cents per thousand

10:08.790 --> 10:10.390
or less, sorry, one 10th of 8 cents

10:10.390 --> 10:12.450
for a thousand tokens to train.

10:12.450 --> 10:16.230
And then it costs like basically a 1 cent for input

10:16.230 --> 10:19.020
and you know, one and a half cent for output

10:19.020 --> 10:20.640
when you're talking to the model.

10:20.640 --> 10:23.530
So if you compare, this is what GPT

10:25.320 --> 10:27.360
3.5 normally costs.

10:27.360 --> 10:30.510
So you can see that we're going from

10:30.510 --> 10:32.550
0.0015

10:32.550 --> 10:34.230
to 0.012.

10:34.230 --> 10:38.010
So it's an order of magnitude different in terms of cost,

10:38.010 --> 10:39.840
but it's still pretty cheap, right?

10:39.840 --> 10:43.200
If you think about a GPT-4, that's 3 cents

10:43.200 --> 10:45.090
and 6 cents per token.

10:45.090 --> 10:48.840
And so you can, you're basically getting a much cheaper,

10:48.840 --> 10:51.540
you know, much cheaper model than GPT-4.

10:51.540 --> 10:54.300
And what a lot of people have been telling me is

10:54.300 --> 10:56.550
that you can get better results

10:56.550 --> 10:58.830
with a fine tuned GPT-3.5 model

10:58.830 --> 11:02.880
for specific tasks than you can with a general GPT-4.

11:02.880 --> 11:04.980
So, you know, depends how you position it.

11:04.980 --> 11:07.140
It's either a really expensive GPT-3.5

11:07.140 --> 11:11.850
or like a really cheap GPT-4 replacement for some tasks.

11:11.850 --> 11:14.370
So yeah, hopefully this is useful for you guys.

11:14.370 --> 11:17.730
Again, this is like really new, so this might change,

11:17.730 --> 11:19.920
but hopefully you understand the, the kind of thinking

11:19.920 --> 11:21.930
behind it and how it works.

11:21.930 --> 11:24.003
And what's really great about this is

11:24.003 --> 11:25.230
it's pretty accessible.

11:25.230 --> 11:27.930
You don't have to be a machine learning engineer to do this.

11:27.930 --> 11:29.010
Like, I'm certainly not.

11:29.010 --> 11:31.497
So hopefully you guys can experiment and,

11:31.497 --> 11:33.273
and find some interesting stuff.