WEBVTT

00:00.230 --> 00:02.780
We're going to talk about prompt optimization.

00:02.780 --> 00:06.230
And this is one long Jupyter notebook.

00:06.230 --> 00:08.690
But we're actually going to break this into two parts.

00:08.690 --> 00:14.480
So the first one is the one that we're talking about right now, which is optimizing a prompt for production

00:14.480 --> 00:19.460
by applying the five principles of prompting from our book.

00:19.520 --> 00:25.310
So I'm going to run through those, and then I'm going to do a second one, which is going to be a follow

00:25.310 --> 00:29.210
up to this, where we're going to talk about more advanced optimization techniques.

00:29.210 --> 00:29.360
All right.

00:29.360 --> 00:30.340
Let's get started.

00:30.340 --> 00:33.700
So we have a task and you always have a task with AI, right?

00:33.700 --> 00:38.560
There's always something you need to automate or something that your tool needs to do like a feature.

00:38.560 --> 00:41.380
And usually I write it in this format.

00:41.530 --> 00:43.540
What are the inputs and what are the outputs?

00:43.540 --> 00:45.460
In this case, it's a simple task.

00:45.460 --> 00:50.230
You need to input an insight that I have about the way the world works, and I need to input a social

00:50.230 --> 00:52.090
network like LinkedIn or Twitter.

00:52.090 --> 00:54.870
And then I need I'll get a social post at the end of it.

00:55.350 --> 00:57.930
That's ultimately what this does.

00:57.960 --> 01:03.300
And by being really clear on what the insights are and really clear on what the outputs are expected

01:03.300 --> 01:08.010
to be, you can already do a pretty good job with prompt optimization.

01:08.010 --> 01:10.380
We're just going to write a really simple template.

01:10.380 --> 01:16.800
It's going to take both of the inputs, write the social media post about how insight for social network.

01:17.130 --> 01:20.550
And I'm just going to run that just so we see I we're running that.

01:20.550 --> 01:23.130
Just look at the insight that I'm providing here.

01:23.130 --> 01:28.380
Prompt engineering will still be needed with smarter models, even as genius humans need prompting.

01:28.500 --> 01:34.170
Even genius humans need prompting from Lego, HR management, etc. to align with business interests.

01:34.170 --> 01:39.660
So this is something that I believe about the world, and it is not a belief that everyone shares.

01:39.660 --> 01:43.230
And the reason why you need this, by the way, is yeah, I could just say, right.

01:43.230 --> 01:47.330
A social media post about topic, but it's not going to be very good.

01:47.330 --> 01:53.030
You tend to need to provide the secret sauce to the AI, and this is something that I've learned, and

01:53.030 --> 01:58.160
therefore it's going to write a much better social post when I provide it that insight.

01:58.160 --> 02:00.080
And let's see what we got.

02:00.440 --> 02:00.950
Now.

02:00.950 --> 02:06.620
The outputs not great because it's got these silly emojis and hashtags, but the underlying structure

02:06.620 --> 02:09.850
is good as AI models get smarter, you think prompt engineer would fade away?

02:09.850 --> 02:10.480
Think again.

02:10.480 --> 02:14.980
Even the most brilliant minds need guidance from legal, HR and management, etc. it's already a good

02:14.980 --> 02:17.020
start, but I don't really like the style.

02:17.290 --> 02:21.820
So let's apply the five principles of prompting and get it ready for production.

02:21.820 --> 02:24.430
The first principle is to give direction.

02:24.520 --> 02:28.870
So describe a desired style or provide a relevant persona.

02:28.870 --> 02:30.280
So I'm just going to run this.

02:30.280 --> 02:33.720
And what I've done to change this is this is exactly the same prompt.

02:33.720 --> 02:39.600
But now I've said in the style of Malcolm Gladwell, and specifically I've given it some instructions

02:39.600 --> 02:44.910
that are really important for making the post sound authentically human and colloquial.

02:45.990 --> 02:49.680
The response I'm getting here is much better.

02:49.680 --> 02:54.330
Imagine a world where even the brightest minds think Einstein, Maya Angelou, Sherlock Holmes needed

02:54.330 --> 02:55.650
occasional nudges.

02:55.890 --> 03:01.880
So it's actually giving some interesting examples, but it still has these weird emojis and hashtags.

03:01.880 --> 03:08.450
So let's keep going and keep improving performance and apply the second principle, which is to specify

03:08.450 --> 03:09.140
the format.

03:09.140 --> 03:14.390
And this is one where you could just mean the format of how it responds.

03:14.390 --> 03:20.420
Do I want it in JSON or YAML or an ordered list or whatever it is, but also the frameworks that you're

03:20.420 --> 03:20.900
using.

03:20.900 --> 03:25.240
And I've specifically implemented a framework here called bait hook reward.

03:25.270 --> 03:28.540
This is something that I use for my social posts when I'm writing them manually.

03:28.540 --> 03:34.450
So I'm asking the AI to follow these, like what will grab the attention, what will keep their attention,

03:34.450 --> 03:37.240
and then how do we reward them for paying attention?

03:37.690 --> 03:42.670
And I'm asking them to write this out first in a kind of like chain of thought style.

03:42.670 --> 03:47.350
So it's going to write in YAML, which is an easy to understand format.

03:47.350 --> 03:50.950
It's going to give me the bait, the hook, the rod, and then it's going to write the post content.

03:50.950 --> 03:52.870
So it's going to end up with a much better result.

03:52.870 --> 03:55.150
And I'm still providing it with that insight.

03:55.150 --> 04:00.910
And you can see here in the output it writes the bait, which is the even the smallest models need a

04:00.910 --> 04:01.630
little nudge.

04:01.630 --> 04:07.030
And then it's writing that hook and then the reward, which is what it can take away.

04:07.030 --> 04:10.660
And then if finally brings everything together to this prompt.

04:10.860 --> 04:13.710
So this is a much better result already.

04:13.710 --> 04:17.700
We're giving a specific framework, and if you gave it a different framework, you might get a different

04:17.700 --> 04:19.080
response as well.

04:20.010 --> 04:20.310
All right.

04:20.340 --> 04:26.040
Now let's go into the most impactful response which is providing examples.

04:26.040 --> 04:27.840
This is the best principle.

04:27.840 --> 04:31.110
It's also costly because you need to give it some examples.

04:31.110 --> 04:34.020
And I've taken the exact same prompt by the way.

04:34.020 --> 04:36.830
And then I've just added this example's partial.

04:36.830 --> 04:40.430
I've kept this separate just so it's easier to change the examples in the future.

04:40.430 --> 04:44.030
But essentially what I've done is manually written these.

04:44.030 --> 04:49.130
So I went through the framework, I wrote a bait, I wrote a hook, I wrote a reward, and then I wrote

04:49.130 --> 04:51.590
the post in my kind of copywriting style.

04:51.890 --> 04:56.600
So this is really helpful, and I've given examples of how it would look for the Instagram or Facebook

04:56.600 --> 04:58.460
or LinkedIn or Twitter as well.

04:58.640 --> 05:02.020
So it should be able to learn from these examples and do a better job.

05:02.020 --> 05:06.220
And you can see here that this is just adding the examples to the prompt.

05:06.220 --> 05:11.470
This is the examples partial that we're adding in here we have the example and then we have the YAML.

05:11.620 --> 05:14.110
And then we have the example two, example three and so on.

05:14.500 --> 05:19.960
Now let's see how much better it does feeding in the examples partial now as a parameter as well as

05:19.960 --> 05:21.850
just the social network and insight.

05:21.850 --> 05:23.890
And I think it does a lot better.

05:23.890 --> 05:28.860
We got rid of most of the emojis and hashtags because it's learned, you know, I didn't include them

05:28.860 --> 05:34.920
in my examples and it's, I think, a much higher character in terms of the quality.

05:34.920 --> 05:36.990
It feels more authentically human.

05:36.990 --> 05:37.860
Here's the kicker.

05:37.860 --> 05:39.570
Prompt engineering is still be vital, right?

05:39.570 --> 05:40.740
That's pretty cool.

05:40.830 --> 05:42.120
I really like that.

05:42.120 --> 05:49.760
Now, what I want to get into now is the evaluation metrics, because you can see I'm still getting

05:49.760 --> 05:50.960
these emojis.

05:51.170 --> 05:58.400
So I could create a function to check for emojis here and then start to see like how many examples do

05:58.400 --> 06:03.650
I need to bring in in order to make sure that there are no emojis found?

06:03.650 --> 06:04.880
What's the emoji count?

06:05.180 --> 06:08.990
And that's where you start to get into evaluating quality.

06:08.990 --> 06:10.310
That's the fourth principle.

06:10.310 --> 06:15.650
But emojis that's one one thing to check for, but it's not necessarily going to be the most important

06:15.650 --> 06:16.250
thing.

06:16.250 --> 06:18.320
I want to evaluate whether this is going to be engaging.

06:18.320 --> 06:23.330
So I've given it a valuation prompt and this is usually what I do like.

06:23.330 --> 06:26.180
Obviously you can have manual human responses.

06:26.180 --> 06:31.760
You can programmatically check for performance of this function that checks for emojis.

06:31.760 --> 06:37.430
But what I tend to find is that for most tasks, you need an LLM as a judge.

06:37.430 --> 06:42.580
And in this case, I'm just asking it does it have grab attention, does it have a hook, what's the

06:42.580 --> 06:43.630
reward, etc..

06:43.630 --> 06:48.310
So all the things I expect from a social post and then I'm passing in that context.

06:48.790 --> 06:53.650
What I'm asking it to do is to give the analysis and then the rating afterwards.

06:53.650 --> 06:58.840
Because by writing the analysis first, you're going to get a more consistent rating afterwards.

06:58.840 --> 07:01.720
And this is again using the chain of thought style response.

07:01.720 --> 07:06.210
So let me just evaluate this on a basic output.

07:06.210 --> 07:10.260
I'm just going to see what it gives me in terms of an engagement score.

07:10.260 --> 07:14.850
So it's written all this information and then it said the rating is four.

07:15.030 --> 07:15.360
All right.

07:15.360 --> 07:20.970
Now that we have an evaluation prompt, that's usually when our cue is to start splitting things up.

07:20.970 --> 07:24.600
And because most production systems are not just one prompt right.

07:24.600 --> 07:29.570
Like it's actually multiple prompts put together, this is where we get into the principle of dividing

07:29.570 --> 07:30.230
labor.

07:30.500 --> 07:34.730
So I'm using async here here to run these things.

07:34.730 --> 07:41.450
I'm just running asynchronously because I want to be able to generate multiple versions of this social

07:41.450 --> 07:44.000
post and then get a rating at the end.

07:44.090 --> 07:47.630
So I have this function generate a rank desk.

07:47.630 --> 07:52.760
So generating rank descending and all it does is just run everything we had above.

07:52.760 --> 07:55.060
but it's just doing it multiple times.

07:55.060 --> 08:00.850
So here I'm generating five posts asynchronously all at the same time, and then I'm gathering the results

08:00.850 --> 08:02.680
and then sorting them by rating.

08:02.680 --> 08:10.270
So when we run this what we get is not just one prompt, but we have this evaluation step as well.

08:10.270 --> 08:16.570
It's running the prompt and then it's running the second step which is evaluation which is here.

08:16.570 --> 08:21.630
And and then it's getting the ratings and ranking them and maybe even three steps you could call this.

08:21.630 --> 08:27.210
And what that does for us, which you can see is it allows us to filter out bad examples.

08:27.210 --> 08:35.640
This one you got a rating of zero because I think specifically I said anything that has hallucinations

08:35.640 --> 08:39.840
or made up statistics should be ranked zero so that we've avoided that bullet.

08:39.840 --> 08:40.170
Right.

08:40.170 --> 08:44.040
But then we have a few here that are rated five and then one that's rated four.

08:44.040 --> 08:48.060
So now we have the very best one at the top, which is really helpful.

08:48.150 --> 08:50.370
And we can make sure the quality is high.

08:50.370 --> 08:53.610
Then when we have that split up into multiple sections.

08:53.760 --> 08:57.900
So that's the thing that I tend to do to get things into production.

08:57.900 --> 09:02.040
I would say now is the point where we have 1 or 2 evaluation metrics.

09:02.040 --> 09:07.220
We have the test split up into a couple of prompts, and we have we're using best practices in terms

09:07.220 --> 09:12.440
of giving direction, specifying the format or the framework to use, and then providing examples as

09:12.440 --> 09:12.800
well.

09:12.800 --> 09:15.950
This is probably all you need to do to get it working right.

09:15.950 --> 09:18.710
This is this prompt is, I think, working pretty well.

09:18.710 --> 09:25.310
And I would I would post this to my social social networks so that they're already in a really good

09:25.310 --> 09:25.940
position.

09:25.940 --> 09:30.530
And then once you're in production, that's when you moved on to more advanced optimization, which

09:30.530 --> 09:31.610
we'll talk about next.