WEBVTT

00:00.860 --> 00:01.400
Right.

00:01.490 --> 00:04.130
How do we optimize a prompt for production?

00:04.190 --> 00:08.870
We're going to take this specific example, which is a news article summary prompt.

00:08.930 --> 00:10.730
So here's an example.

00:10.730 --> 00:16.490
I'm using Gene II the gene reader by the way in order to download this article.

00:16.670 --> 00:17.390
Scrape it.

00:17.420 --> 00:20.390
That's quite a nice easy hack.

00:20.720 --> 00:27.620
You just put a r I in front of any URL and it should give you back the text in markdown, which is a

00:27.620 --> 00:29.930
very useful format for putting it into an LLM.

00:29.930 --> 00:32.120
But the actual prompt itself is literally just.

00:32.120 --> 00:33.590
Please summarize the following user.

00:34.550 --> 00:41.450
Now with any of these tasks, typically I'll break open a Jupyter notebook and then get OpenAI, which

00:41.450 --> 00:49.010
is the main way that I use Llms is just directly through the OpenAI API, and I have just a basic function

00:49.130 --> 00:50.390
for completion.

00:50.480 --> 00:53.450
Here we're just putting in the prompt and then we're getting the context.

00:53.510 --> 00:56.300
You're passing the context and then we're generating that.

00:56.300 --> 01:06.740
By the way one thing I tend to do is I'll pass the prompt in here with Any variables in curly brackets.

01:06.740 --> 01:09.980
And then I'll use prompt dot format and then I'll pass in.

01:09.980 --> 01:11.390
I'll unbundle the context.

01:11.390 --> 01:15.740
So the context is just a dictionary of different the variable.

01:15.740 --> 01:18.140
And then what the value of that variable is.

01:18.530 --> 01:21.950
And I find that that's quite useful for, for doing this.

01:22.790 --> 01:23.030
All right.

01:23.030 --> 01:24.890
So this is the naive response we get.

01:24.920 --> 01:26.660
And I'm not going to read the whole thing.

01:26.690 --> 01:32.960
It's an article about Donald Trump and just a relatively like politically biased article.

01:33.140 --> 01:35.450
And don't mean to pass judgment either way.

01:35.450 --> 01:40.670
I just mean that specifically the adversarial one to Donald Trump, which maybe he deserves.

01:41.240 --> 01:45.560
I'm not going to weigh in, but but essentially, I wanted to take an article that was quite emotionally

01:45.560 --> 01:49.220
charged and just see how well that shines through in the response.

01:49.220 --> 01:52.250
And unfortunately, it doesn't really shine through enough.

01:52.280 --> 01:56.570
We might lose some of the meaning of the original article, I think, if you're summarizing it with

01:56.570 --> 01:58.040
a naive response.

01:58.040 --> 02:00.440
So here's a more optimized response.

02:00.440 --> 02:04.010
And this is after doing a little optimization.

02:04.010 --> 02:09.880
And one of the things you'll notice is that this is a lot more making its reviews and its responses

02:09.910 --> 02:14.740
like its much more biased and in the way that the original article is, which is what we wanted.

02:14.740 --> 02:18.610
But it does have citations as well, which is also very useful.

02:18.640 --> 02:22.270
It's looked up different things and I'll show you how we did that.

02:22.270 --> 02:25.090
So we're applying the five principles of prompting here.

02:25.090 --> 02:26.830
The first one is just give direction.

02:26.830 --> 02:32.170
So the first thing I tried was I wanted it to preserve a nuance of the original author's intended tone.

02:32.170 --> 02:37.270
So that's just the first thing that I would try and add in order to improve this prompt.

02:37.270 --> 02:43.150
And it does do a much better job when you do that, which is quite helpful.

02:43.390 --> 02:47.740
And then the second thing that I've started to do is specify the format, what do I actually want out

02:47.740 --> 02:48.130
of this?

02:48.130 --> 02:54.040
And one thing I've found quite useful is if I ask it to give me the key points or extract some information

02:54.040 --> 02:57.490
first, and then it does a better summary afterwards.

02:57.490 --> 03:03.520
And I've also asked it to bring me back this in JSON format, so I could use it in a program later on.

03:04.090 --> 03:06.190
And you can see here it gives me the key points.

03:06.190 --> 03:08.260
These are the things that kind of check.

03:08.290 --> 03:12.130
And I found this is much better in debugging because it gets the wrong key points to summary.

03:12.160 --> 03:13.210
It would be very good.

03:13.240 --> 03:15.520
I find that's useful as well.

03:15.550 --> 03:21.220
It's almost like a chain of thought type prompt as well, because if the key points are like the thinking

03:21.220 --> 03:25.540
step and we have it before the actual summary, so it does improve the summary.

03:25.540 --> 03:28.990
But the main thing you do to improve performance is just add examples.

03:29.140 --> 03:33.760
So I went through I added some examples manually from other articles.

03:33.760 --> 03:38.920
And yeah, one thing I should note here is that you don't always have to input the full thing.

03:38.950 --> 03:41.920
So in this case I'm just showing the example outputs.

03:41.920 --> 03:47.770
I didn't paste in the whole article text because then you're prompting are getting huge, and it doesn't

03:47.770 --> 03:52.930
necessarily need that full article text in order to to learn from your examples.

03:54.220 --> 03:59.050
This is what we get after we do that and it ends up being a lot more useful.

03:59.080 --> 04:04.120
Again, I'm not going to bore you too much, but if you're interested in this, but I've just it is

04:04.120 --> 04:07.120
a lot more close to the original article intent.

04:07.330 --> 04:12.010
The fourth thing they tend to do is to try and evaluate the quality of the performance.

04:12.010 --> 04:15.520
up until now, you just vibe checking and just what?

04:16.060 --> 04:22.540
What you think it does well, but you really want to get a more formal definition as as quickly as possible.

04:22.540 --> 04:30.520
So in this case, I used a function that you get the rouge scores, which is the the actual scores of

04:30.520 --> 04:37.030
the the similarity between the original text and the summary which is commonly used.

04:37.150 --> 04:38.620
It's just calculating it.

04:38.620 --> 04:45.910
And then I have, uh, those rouge scores into a prompt and then asked ChatGPT to interpret them for

04:45.910 --> 04:46.480
me.

04:46.630 --> 04:48.520
So this is a typical response.

04:48.520 --> 04:51.580
Compare the two rouge scores and then it will tell you what the winner is.

04:51.610 --> 04:57.610
Overall, this is quite helpful because it means you don't have to interpret it yourself, and especially

04:57.640 --> 05:01.600
useful when some of the scores are better and some are worse.

05:01.630 --> 05:05.500
And so in this case, it's the overall chosen 1 or 2 is the winner.

05:05.500 --> 05:09.820
But the first response was actually better on rouge one.

05:09.820 --> 05:10.720
For example.

05:10.720 --> 05:13.270
And then you want to break it into multiple steps.

05:13.270 --> 05:15.370
So divide up the labor labour, essentially.

05:15.490 --> 05:18.220
It's very hard to do all of these things in one prompt.

05:18.370 --> 05:24.160
One of the steps I added here, which is something I found quite useful, is in some respects, I wanted

05:24.160 --> 05:28.210
the summary to be matching the original intent of the article.

05:28.240 --> 05:31.600
I don't want it to change the the style of the article.

05:31.630 --> 05:33.370
So we've solved for that problem.

05:33.400 --> 05:35.500
But I do want some fact checking here, right?

05:35.530 --> 05:38.740
Like I want to see how biased in one direction it is.

05:38.830 --> 05:44.950
So I added another prompt at the end where we would rewrite the summary, adding citations from the

05:44.950 --> 05:48.850
fact check key points where necessary, and absolutely add 1 or 2.

05:48.880 --> 05:56.470
But I got the fact checking from using Calveley, which is a search engine similar to perplexity.

05:56.500 --> 06:02.080
So what it does is it will search out the course for these specific things, and then it will fill in

06:02.350 --> 06:05.560
each of the key points that will fill in some solution.

06:05.560 --> 06:08.110
It will say, yes, this is true or no, this isn't true.

06:08.140 --> 06:09.970
And then some context.

06:09.970 --> 06:14.710
And then I'm passing that into the prompt here with the fact checks section.

06:14.710 --> 06:17.790
So this is the type of response we get.

06:17.850 --> 06:21.210
You can see that it has the fact checks at the end, which is nice.

06:21.330 --> 06:22.530
From there.

06:22.560 --> 06:25.740
And that code is in the Jupyter notebook if you want to take a look at it.

06:25.770 --> 06:30.930
There's a call a whole nother module on how to use tavli in the course.

06:30.930 --> 06:31.830
So check that out.

06:31.860 --> 06:37.170
Now once you get that prompt into production then this is usually good enough right.

06:37.200 --> 06:38.730
But you want to keep improving it.

06:38.730 --> 06:40.440
So now you have an evaluation metric.

06:40.470 --> 06:41.700
You can run a B tests.

06:41.730 --> 06:45.630
So you can test different versions of the prompt and see which ones get better scores.

06:45.660 --> 06:50.520
You can also use the spy so you can do optimization instead of a B testing yourself.

06:50.550 --> 06:53.760
You get the spy to write prompts for you, which is quite nice.

06:53.790 --> 06:59.400
And then once you have a big enough data set of successful responses, then you might want to consider

06:59.400 --> 06:59.970
fine tuning.

07:00.000 --> 07:04.710
So usually I wouldn't start the spy until I've got 50 plus examples and then fine tuning.

07:04.710 --> 07:07.500
Or maybe wait until I have 200 plus examples.

07:07.500 --> 07:14.370
Or if the cost of the production is getting too expensive and I want to train like a small or an open

07:14.370 --> 07:16.110
source model to do this task.

07:17.610 --> 07:21.060
Hopefully that gives you a bit of a sense of what prompt engineers do day to day.