WEBVTT

00:00.380 --> 00:00.650
All right.

00:00.650 --> 00:02.990
Let's talk about advanced optimization.

00:02.990 --> 00:07.220
So once you have a prompt working what do you do in production to improve it.

00:07.370 --> 00:12.980
And just as a reminder we've been going through the five principles of prompting in the previous section.

00:12.980 --> 00:18.620
So if you haven't done that so far, it might be worth going back through the previous tutorial.

00:18.620 --> 00:24.170
Now we have this social media task where it takes the insight, a social network, and then provides

00:24.170 --> 00:25.400
a social post.

00:25.400 --> 00:31.230
And just to show you how much we improved, we went from this full of emojis and hashtags and not that

00:31.230 --> 00:34.710
great all the way through applying the different five principles.

00:34.710 --> 00:41.820
And the latest result that we got was this we have five to pick from and they're ranked in terms of

00:41.820 --> 00:42.690
rating.

00:42.690 --> 00:45.540
The rating is done by an LLM prompt as well.

00:45.540 --> 00:50.190
And we have, I think, a pretty good social media post from this.

00:50.190 --> 00:52.140
Now we have something working.

00:52.140 --> 00:59.470
You want to move away from basic evaluation and vibe checking, and you want to start to be more consistent

00:59.470 --> 01:01.840
in terms of the things that you test.

01:01.840 --> 01:08.290
The very first thing that I do, and I would say this is probably 80% of my work is run a B tests.

01:08.290 --> 01:11.500
So I have a function here and it could be very simple.

01:11.500 --> 01:17.890
You just generate for task A, you generate for task B, and each one is a different prompt.

01:17.910 --> 01:24.270
And once you get the results out at the end, then you know which prompt is working better, and you

01:24.270 --> 01:28.080
try to test 1 or 2 things at a time just to see what the results are.

01:28.560 --> 01:35.220
Now, the reason why I have number of runs equals ten is that you need to run each task multiple times,

01:35.220 --> 01:40.290
because otherwise you don't know how often it actually succeeds or fails.

01:40.290 --> 01:46.090
We have this a b test prompts and then we have the generate and evaluate post function.

01:46.090 --> 01:47.980
Here I'm just going to show you that.

01:48.190 --> 01:49.510
So generate and evaluate.

01:49.510 --> 01:54.250
Post first gets the response and then it gets the post content.

01:54.250 --> 01:56.560
And then it evaluates the post content.

01:56.560 --> 02:01.330
So it will come back with a rating here and the post content as well.

02:01.480 --> 02:04.390
And that's what we're doing every time we do this.

02:04.390 --> 02:08.150
So we're going to do that ten times for task A and ten times for task B.

02:08.150 --> 02:10.610
Now let's compare the different prompts.

02:10.610 --> 02:17.210
So the only thing I changed about the prompt from the previous example is that instead of in the style

02:17.210 --> 02:22.790
of Malcolm Gladwell, we now say in the style of Malcolm Tucker, and he's a bit funnier.

02:22.790 --> 02:24.980
He's a character from The Thick of It.

02:25.400 --> 02:28.040
He's a bit sweary, actually, so we'll see if we get it to swear.

02:28.040 --> 02:31.110
But that's literally the only thing we changed is the word.

02:31.410 --> 02:34.590
Instead of Malcolm Gladwell, we changed it to Malcolm Tucker.

02:34.590 --> 02:41.460
And when we run this, we're going to get a sense of over, in this case, 30 runs.

02:41.460 --> 02:46.320
What is the average rating for Malcolm Tucker versus Malcolm Gladwell?

02:46.560 --> 02:49.890
And this is doing all the heavy lifting for us, which is really nice.

02:49.890 --> 02:53.220
We're going to get back the results B and we can compare them.

02:53.220 --> 02:57.940
And this is a very simple pattern, but it's something that I do all the time.

02:57.940 --> 03:01.030
Quite often I'll dump the results into a data frame.

03:01.030 --> 03:04.780
So we have all the past examples and how often it did.

03:04.810 --> 03:07.210
I'll also tend to test different examples.

03:07.210 --> 03:07.990
Context.

03:07.990 --> 03:10.120
In this case I'm just keeping it the same.

03:10.120 --> 03:13.390
I have the same insight, but I never run it 30 times.

03:13.390 --> 03:17.920
But you might want to run it say ten times, but then run it across five different insights just to

03:17.920 --> 03:21.000
see what sort of edge cases does it fail on.

03:21.000 --> 03:25.380
But in general, this is most of the work I'm doing now.

03:25.380 --> 03:28.890
The interesting thing is I have a pretty simple change here.

03:28.890 --> 03:33.420
Malcolm Tucker is Malcolm Gladwell, but you can make a much bigger change.

03:33.420 --> 03:36.090
You could apply a whole technique, right?

03:36.090 --> 03:40.620
If you're using a specific strategy, you have a hypothesis for what technique could work, then you

03:40.620 --> 03:42.000
could make a change there.

03:42.090 --> 03:43.260
Now here we go.

03:43.260 --> 03:46.540
So we had a rating of 4.2 for Malcolm Gladwell.

03:46.540 --> 03:49.150
And we have 4.3 on average for Malcolm Tucker.

03:49.150 --> 03:50.380
And that's over 30.

03:50.380 --> 03:52.360
So 60 responses there.

03:52.360 --> 03:53.890
So it's pretty robust.

03:53.890 --> 03:56.200
And you can see it just making a small change.

03:56.320 --> 04:01.510
Now I have some proof that I could give to my client or just proof for myself if this is something I'm

04:01.510 --> 04:06.910
automating for myself that the Malcolm Tucker style works and you could even go through and you could

04:06.910 --> 04:12.500
test prompt C, prompt D, you could try different, different styles.

04:12.590 --> 04:14.750
You could try different providing different examples.

04:14.750 --> 04:16.220
And the examples partial.

04:16.220 --> 04:19.640
You could apply different instructions in here.

04:19.640 --> 04:21.680
There's so many different things you can do.

04:21.710 --> 04:22.940
You know a different framework.

04:22.940 --> 04:28.160
Instead of bait hook reward you could ask it to use a different social media framework.

04:28.160 --> 04:32.580
And you change this section here so you can really do whatever you need.

04:32.580 --> 04:38.970
And there are infinite different combinations of prompts that you could change in order to get to the

04:39.090 --> 04:41.310
different evaluation score at the end.

04:41.340 --> 04:44.040
You can also optimize your evaluation prompt as well.

04:44.040 --> 04:46.320
That's not immune from a B testing as well.

04:46.740 --> 04:47.040
All right.

04:47.040 --> 04:49.080
So that's obviously a lot of work.

04:49.080 --> 04:55.920
And one thing I have found which is particularly good for certain types of classifier is DSP.

04:55.950 --> 04:59.650
So this is doing the above, but automatically for you.

04:59.650 --> 05:01.450
And you have to do some setup.

05:01.450 --> 05:07.060
So you have to set up the social media post generator here that inherits from DSP signature.

05:07.090 --> 05:09.790
The signature is just where you define the inputs and outputs.

05:09.790 --> 05:12.220
So we have the insight input field.

05:12.220 --> 05:13.750
We have the social network input field.

05:13.750 --> 05:15.910
And then we have the post that comes out at the end.

05:16.000 --> 05:19.120
And this is actually the prompt that gets fed into DSP.

05:19.150 --> 05:21.500
So I haven't written my big old prompt here.

05:21.500 --> 05:24.170
I'm just going to see how DSP does with this.

05:24.710 --> 05:26.870
Then we're generating the post.

05:26.870 --> 05:30.890
We pass in the social media generator to a chain of thought tactic.

05:30.890 --> 05:33.440
So this is implementing chain of thought for you.

05:33.440 --> 05:36.440
And then we have to set up an evaluation metric.

05:36.500 --> 05:39.680
The way you set up evaluation metrics in DSP is you.

05:39.680 --> 05:44.120
The convention is you take in a gold answer, which is the correct answer and the context.

05:44.120 --> 05:49.900
And then you take in a prediction which is the LLM answer, and then you have this trace equals none,

05:49.900 --> 05:52.030
which is just required a break.

05:52.030 --> 05:58.060
And what I'm doing here is I'm just finding the rating and then getting the score and passing it back.

05:58.390 --> 06:03.100
I have to define which which model you're using and configure that.

06:03.100 --> 06:09.580
And then you can just hit enter and it will actually it would actually generate this using this prompt.

06:09.590 --> 06:12.530
It would generate a social media post.

06:12.530 --> 06:14.450
And you can see here that this is actually pretty good.

06:14.450 --> 06:15.800
This is without optimization.

06:15.800 --> 06:18.650
So we haven't used DSP to optimize anything.

06:18.650 --> 06:22.430
It was just taken this it's taken the data structure.

06:22.430 --> 06:23.210
You passed it.

06:23.210 --> 06:25.400
It's taken the chain of thought technique.

06:25.400 --> 06:28.760
It's just brought all that together to make a pretty decent prompt for you.

06:28.760 --> 06:36.360
And we're getting fairly good results then where the actual optimization comes through is in this bootstrap

06:36.360 --> 06:38.280
few shot with random search.

06:38.280 --> 06:44.220
And that sounds complicated, but literally all it means is it will add examples to the prompt in order

06:44.220 --> 06:45.510
to improve the performance.

06:45.510 --> 06:47.700
So you need to pass it some examples.

06:47.700 --> 06:50.130
This is a training data set and the development set.

06:50.130 --> 06:55.320
Then here I've just split the previous context and the previous examples.

06:55.320 --> 06:58.200
The ones I ran 30 times for my a B test.

06:58.200 --> 07:02.080
I've just taken them and I've split them into these two data sets.

07:02.590 --> 07:06.550
You just need to make sure it's a DSP example format.

07:06.550 --> 07:08.830
But and then you have to say what the inputs are.

07:08.830 --> 07:14.440
So with inputs insight and social network, once you've done that you can just, you know, compile

07:14.440 --> 07:14.830
it.

07:14.830 --> 07:16.720
And literally this is the way you do it.

07:16.720 --> 07:20.260
Optimizer dot compile and you pass in the train set and the value set.

07:20.260 --> 07:24.410
And then at the end it will have optimized the the prompt.

07:24.410 --> 07:28.370
So not the specific instructions but the examples that it includes.

07:28.370 --> 07:34.310
And the reason it's called bootstrap few-shot is that it will include your examples, the ones that

07:34.310 --> 07:39.890
you have, but it will also create synthetic examples as well and add them, meaning get the LLM to

07:39.890 --> 07:42.170
come up with good examples to add to the prompt.

07:42.560 --> 07:46.190
When you inspect a history here you can see this is the prompt that it came up with.

07:46.190 --> 07:48.490
It's following it saying follow this format.

07:48.490 --> 07:51.700
And then it's giving a bunch of examples here.

07:51.700 --> 07:55.270
And you can see you can see the ones that it's added to the prompt.

07:55.270 --> 07:56.530
It's added quite a few.

07:56.530 --> 07:59.170
And some of them are ones that I gave.

07:59.170 --> 08:04.120
And then some of them are ones that it's come up with, if that makes sense.

08:05.890 --> 08:07.150
Listen up, you Muppets.

08:07.420 --> 08:08.710
That's pretty interesting.

08:08.920 --> 08:15.620
And then the final one, which is usually what I try last because it can be end up being expensive and

08:15.620 --> 08:18.410
a little bit inflexible is fine tuning.

08:18.410 --> 08:25.100
So fine tuning is really what you would do quite often to get a smaller open source model to perform

08:25.100 --> 08:28.010
better or perform as good as GPT four.

08:28.010 --> 08:34.280
But in this case, we're fine tuning GPT 3.5 and it's a pretty straightforward API.

08:34.280 --> 08:40.860
I would I would say that you should try fine tuning the smaller GPT models through OpenAI first, because

08:40.860 --> 08:45.570
if you can't get it working with them, then it probably won't work with the open source models either.

08:45.570 --> 08:51.540
So I'd like to test I like to test out the viability of fine tuning using this type of example first.

08:51.570 --> 08:55.680
Now all you really need to do for fine tuning, which is nice, is that you need to get the data into

08:55.680 --> 08:56.490
the right format.

08:56.880 --> 09:02.590
For OpenAI, it's just the messages, the messages list, and I'm just appending that to the fine tuning

09:02.590 --> 09:03.850
data, and I'm saving it.

09:03.850 --> 09:07.330
So you can see here that here's the list of messages.

09:07.330 --> 09:09.790
And you have raw user content.

09:09.790 --> 09:12.280
And then you have like raw assistant content.

09:12.280 --> 09:15.340
So you get the user and the assistant responses.

09:15.340 --> 09:18.760
And then you just create a you just upload the file.

09:18.760 --> 09:20.530
So it needs to have the file in there.

09:20.530 --> 09:24.130
Then once the file is uploaded then you can create a job with that file ID.

09:24.140 --> 09:29.000
And right now you can't train, but you can train GPT 3.5 turbo.

09:29.390 --> 09:34.730
Once that's created, then you have the job ID you'll be able to see that in the interface as well.

09:34.730 --> 09:40.040
And once it's done and you can actually check this programmatically, then it would say when it's finished

09:40.040 --> 09:42.650
at and it would give you the name of the model.

09:42.650 --> 09:45.590
And it also says status equals succeeded.

09:45.740 --> 09:49.450
Or if there's an error then you'll see what the error is and hopefully try and fix it.

09:49.690 --> 09:51.730
That is how you set up and create a job.

09:51.730 --> 09:55.030
And then how you actually use the model is like this.

09:55.030 --> 10:00.910
I'm just going to hit run here just to see, okay, I need to create a job first.

10:00.910 --> 10:02.620
So that's how you create a job.

10:02.620 --> 10:02.890
Yeah.

10:02.890 --> 10:04.240
So it's created a job there.

10:04.240 --> 10:06.190
And then it's going to cost some money to fine tune.

10:06.190 --> 10:08.380
And you can actually change some of the parameters as well.

10:08.380 --> 10:09.880
But I just use the defaults.

10:09.880 --> 10:18.680
But then once you've run the here we go, copy this fine tuned model here.

10:18.860 --> 10:23.360
And I'm just going to change this.

10:23.360 --> 10:26.570
So we're using the fine tuned model.

10:26.570 --> 10:31.070
So in here this is looking for the status and seeing if it's working.

10:31.070 --> 10:34.250
But I'm just going to swap in this fine tune model here.

10:34.250 --> 10:38.310
So if you know the specific name of the model you're going to use and they're going to comment this

10:38.310 --> 10:41.910
out here just because that's this one's still going to be running, right.

10:42.360 --> 10:47.040
Actually you can see what the what the status is here that's still running.

10:47.040 --> 10:50.340
But we're going to comment out this stuff which would check if it's running.

10:50.340 --> 10:53.550
And if it's succeeded then it would set the fine tuned model.

10:53.550 --> 10:56.220
And then we're just hardcoding the model name here.

10:56.340 --> 10:59.070
And we're just going to see what this looks like.

10:59.370 --> 11:06.610
So here I'm comparing GPT 3.5 turbo, the base model with GPT four, and then with the fine tuned model

11:06.610 --> 11:07.120
as well.

11:07.120 --> 11:10.870
So we're going to see which models perform best.

11:11.680 --> 11:17.410
And we're going to see the difference here between what we're getting from the base model and what we're

11:17.410 --> 11:22.210
getting from GPT four, and how close the fine tuned model is to GPT four, because I think that's really

11:22.210 --> 11:29.660
the key, is get a lot of examples from GPT four and then distill those examples into a base model.

11:30.260 --> 11:35.210
So this is just evaluating GPT 3.5 turbo ten times.

11:35.210 --> 11:40.340
And then it should evaluate GPT 410 times and then fine tune model ten times as well.

11:41.930 --> 11:50.910
I'm just going to uncomment this just in case you want to use this in the future, you should be able

11:50.910 --> 11:55.920
to just run that and when your job is done then it will update the fine tuned model here.

11:56.730 --> 11:59.220
So you say last job dot fine tune model.

11:59.220 --> 12:02.790
But I already knew what the model was called which is why I hardcoded it there.

12:02.790 --> 12:06.060
Okay, so let's finished evaluating GPT 3.5 turbo.

12:06.060 --> 12:10.710
So evaluating it probably would have made sense to do this asynchronously because it would take less

12:10.710 --> 12:11.010
time.

12:11.010 --> 12:17.000
But being lazy for this case, the other benefit by the way of doing fine tuning is that you can sometimes

12:17.000 --> 12:18.380
get better latency.

12:18.380 --> 12:21.110
Because GPT four is a much bigger model.

12:21.110 --> 12:23.030
You can see how long it's taking.

12:23.030 --> 12:28.280
It's taking eight seconds per iteration versus two seconds per iteration for GPT 3.5.

12:28.280 --> 12:34.940
So if you can get GPT four level responses from a smaller model, then that's much better for for your

12:34.940 --> 12:35.780
application.

12:35.780 --> 12:41.130
Okay, so it looks like we finished evaluating GPT four, which is now waiting for the fine tuned model

12:41.130 --> 12:42.060
to see how that does.

12:42.060 --> 12:45.390
Okay, now we're evaluating the fine tuned model that we created.

12:45.390 --> 12:48.060
And this is only available in my organization.

12:48.060 --> 12:52.470
So if you called this and would say you don't have access to this model, but the downside of it being

12:52.470 --> 12:55.740
an OpenAI is that you also don't have access to this model.

12:55.740 --> 12:57.840
You can't export it anywhere or run it locally.

12:57.840 --> 13:03.020
You can see the latency is much better than GPT four, right, eight seconds versus 2.8 seconds, But

13:03.020 --> 13:05.990
it is higher, unfortunately, than, you know, the base model.

13:05.990 --> 13:07.280
So that's something to think about.

13:07.280 --> 13:08.930
Now it's just running those evaluations.

13:08.930 --> 13:12.590
So it's it's generated the ten but it's just evaluating them.

13:12.590 --> 13:14.660
And we're going to get a score in a second okay.

13:14.660 --> 13:15.260
Here we go.

13:15.680 --> 13:19.670
So GPT 3.5 turbo got a score of 3.8 on average.

13:19.910 --> 13:22.340
Got a couple of fours in there which is good.

13:22.340 --> 13:25.250
And then if we just keep going here.

13:25.250 --> 13:27.440
So these are the top three posts.

13:27.440 --> 13:30.080
Then we have GPT four got a 4.5 on average.

13:30.080 --> 13:32.240
So it's a killer you know it's really good.

13:32.240 --> 13:34.010
And then here we go.

13:34.010 --> 13:35.030
This is our fine tune model.

13:35.030 --> 13:36.680
We got 4.2 on average.

13:36.680 --> 13:39.860
This is much better right 4.2 is really good.

13:39.860 --> 13:42.440
It's a big performance uplift from 3.8.

13:42.440 --> 13:46.310
It's not quite as good as GPT four but I'm willing to live with that.

13:46.310 --> 13:49.910
And we can actually see what are the types of responses we got.

13:50.330 --> 13:50.990
Yeah here we go.

13:50.990 --> 13:51.380
Sure.

13:51.380 --> 13:52.170
Let's cut to the chase.

13:52.170 --> 13:54.600
Even Einstein couldn't remember where he put his socks without a nudge.

13:54.600 --> 13:58.380
Smarter AI models still need the human touch or prompt engineer.

13:58.380 --> 14:01.740
Just like how you get a gentle reminder from legal HR and management.

14:01.740 --> 14:04.050
So this is actually really good and I'm pretty happy with this.

14:04.080 --> 14:06.330
We've got a 4.0 here.

14:06.480 --> 14:09.240
So this is this is something that's really helpful.

14:09.240 --> 14:11.550
This is a successful case of fine tuning.

14:11.550 --> 14:17.090
And now that this has been successful now I would want to potentially go through the effort of trying

14:17.090 --> 14:23.030
it with an open source model and maybe hosting it myself, or hosting it on replicate or some other

14:23.030 --> 14:24.260
model provider service.

14:24.290 --> 14:31.070
Once you've done the proof of concept and the fine tuning works, then it's worth optimizing the process

14:31.070 --> 14:33.560
and maybe trying out different parameters and things.

14:33.710 --> 14:39.230
But yeah, I'd say this is a last resort because it's a lot more intensive in terms of cost and and

14:39.230 --> 14:39.890
know how.

14:39.900 --> 14:43.410
Because I'm not a fine tuning engineer.

14:43.410 --> 14:44.580
I'm not a machine learning engineer.

14:44.670 --> 14:46.740
My expertise is in prompting.

14:46.800 --> 14:51.780
So I only play with fine tuning, not doing it a lot right in production.

14:52.290 --> 14:57.300
And typically what I'm doing is proving the case for prompting, for fine tuning, building the data

14:57.300 --> 15:02.790
set for fine tuning, and then handing that over to an AI engineer who would know what are the different

15:02.790 --> 15:04.170
parameters to optimize.