WEBVTT

00:04.790 --> 00:10.040
I'm just going to walk you through a use case for AI and how I optimize the prompt.

00:10.040 --> 00:15.620
And the specific task was to write a PRD or a product requirements document.

00:15.620 --> 00:20.300
So your product manager or you work with one, then this is a useful feature.

00:20.300 --> 00:25.370
But either way, even if you don't you'll be able to see how we optimize a prompt, a real prompt that

00:25.370 --> 00:26.750
you would use in the world.

00:26.870 --> 00:32.750
So specifically this prompt, the PRD we want to write is a feature that allows passengers to contact

00:32.780 --> 00:35.180
Lyft drivers directly to recover lost items.

00:35.180 --> 00:38.900
The reason I chose that is because there is actually one of these out there in the world, so you can

00:38.900 --> 00:39.620
compare it.

00:40.100 --> 00:43.310
So this is the ChatGPT response and it's pretty good.

00:43.340 --> 00:48.080
There are some issues with it though, but I won't dive into the details.

00:48.080 --> 00:53.510
You can have a look at this a link in the presentation, but it's pretty comprehensive.

00:53.540 --> 00:54.830
It's well formatted.

00:54.830 --> 00:57.590
It's professionally written like you'd expect from AI.

00:57.620 --> 00:58.280
It's pretty good.

00:58.280 --> 01:01.700
It's also user centric, like you specifically are good at.

01:01.700 --> 01:07.540
I think the AI in particular is good at making sure that we follow up with practices in terms of gearing

01:07.540 --> 01:13.750
the feature towards the user experience, but there's a few things that are lacking compared to an actual

01:13.750 --> 01:14.770
product manager.

01:14.860 --> 01:16.330
One is there's not much creativity.

01:16.360 --> 01:21.490
It doesn't really draw from experience or make any niche references which you would expect from a real

01:21.490 --> 01:21.820
human.

01:22.180 --> 01:30.250
There's also no real kind of attempt to morph it towards what the what worked within that organization.

01:30.250 --> 01:37.270
So there's no it's not in the right style that you would need for a to get it approved, typically because

01:37.270 --> 01:38.710
everyone is different.

01:38.770 --> 01:40.120
Can't expect ChatGPT to know.

01:40.330 --> 01:42.520
But then the other thing is hallucinations.

01:42.700 --> 01:45.250
We don't want it to make up statistics and things like that.

01:45.280 --> 01:48.640
How we're going to address all of these issues with our optimization.

01:49.120 --> 01:51.100
So these are the five principles of prompting.

01:51.130 --> 01:55.000
If you've been through the beginning of the course you would have seen these give directions specify

01:55.000 --> 01:58.150
format provide examples equality and divide labor.

01:58.150 --> 02:00.250
So I'm just going to walk you through these.

02:00.280 --> 02:02.050
Apply each one to the prompt.

02:02.050 --> 02:03.880
So the first thing is give direction.

02:03.880 --> 02:08.760
It's really important to understand what type of style you want.

02:08.790 --> 02:15.030
You want this in the style of a fortune 500 executive, which you know is full of corporate speak and

02:15.060 --> 02:17.070
it doesn't pay as much, but it's longer.

02:17.100 --> 02:20.550
And or do you want to say like in the style of Steve Jobs?

02:20.940 --> 02:26.580
And it doesn't have to be a specific famous person, but you want to give it some direction in terms

02:26.580 --> 02:31.500
of what type of style that you want, and you can see how big of a difference that makes.

02:31.770 --> 02:33.780
The other thing is specify the format.

02:33.840 --> 02:37.950
So one thing I found really helpful is to ask it to use placeholders.

02:37.980 --> 02:42.450
In this case I asked it to choose insert or wherever there's a number.

02:42.450 --> 02:46.320
So I went through and edited that in the prompt, which you'll see.

02:46.800 --> 02:53.640
And that means that you just need to control and find the different placeholders.

02:53.640 --> 02:57.210
And then you can go and fill in those statistics yourself.

02:57.330 --> 02:58.860
So that's quite helpful.

02:59.010 --> 03:00.630
The other thing is just provide examples.

03:00.660 --> 03:02.460
That's actually one of the most important things.

03:02.460 --> 03:11.520
I chose this task specifically because we do have an example of someone actually writing a PID for this

03:11.520 --> 03:12.030
task.

03:12.030 --> 03:16.200
And then I took that example and I asked ChatGPT to create another one like it.

03:16.230 --> 03:18.840
So instead it's this one.

03:18.840 --> 03:23.790
It's the same sort of format, but it's guided virtual tours with drone footage.

03:23.820 --> 03:30.780
And for example, so I took the real PID and I asked ChatGPT to make a different version of the same

03:30.780 --> 03:32.430
one, the same structure.

03:32.700 --> 03:37.350
And then I insert that as a prompt in order to get back to the initial task.

03:37.380 --> 03:41.340
The important thing, by the way, is that this needs to match the formatting that you've asked it to

03:41.370 --> 03:41.580
do.

03:41.610 --> 03:46.800
You can see here I swapped out one of the metrics with the insert placeholder.

03:48.810 --> 03:51.450
Value and quality is really important at this point as well.

03:51.480 --> 03:54.630
You want to check whether it's performing better or worse.

03:54.660 --> 04:03.900
What I did is I rated this example versus the traditional ChatGPT example, the default example that

04:03.900 --> 04:07.560
you get if you just ask it to do this task without any of the prompting.

04:07.710 --> 04:12.350
And I kept doing that until my prompt performed better than the original.

04:12.380 --> 04:18.680
You could also keep going and get it to where it performs better than the real human version as well.

04:18.680 --> 04:22.100
And you can a B test this with another LLM like.

04:22.130 --> 04:27.080
For example, I would just dump both in without any context and a new chat window and then ask it to

04:27.110 --> 04:28.100
rate these two.

04:28.340 --> 04:32.930
The other option is you could get a real human rated to rate them, which is more accurate.

04:33.890 --> 04:39.470
And then the main thing that you'll find is that you get to a point where you can't optimize much more

04:39.470 --> 04:42.350
without dividing the prompt up into different things.

04:42.380 --> 04:47.870
So one of the common ways of doing this is to add like a thinking section, a chain of thought.

04:47.900 --> 04:53.660
This is this is technique that we're talking about here, where we're adding an additional step before

04:53.660 --> 05:01.100
we start the prompt, where we ask it to think through the solution first and then answer the prompts.

05:01.100 --> 05:06.740
And that tends to improve, especially for reasoning tasks like this, where it needs to make some assumptions

05:06.770 --> 05:09.410
and and figure out what would be the right thing to do.

05:10.070 --> 05:12.740
So in the end, we ended up with a good prompt.

05:12.770 --> 05:18.040
I just wanted to share with you This was the actual example here from Lyft.

05:18.310 --> 05:19.420
This is a famous example.

05:19.420 --> 05:20.650
You can find it online.

05:20.680 --> 05:24.310
I've got a link in there, but you can see this is what a human example looks like.

05:24.310 --> 05:30.310
And it has some kind of specific things that are good about it in terms of it's very brief and to the

05:30.310 --> 05:34.780
point, which I like, but there's a few things if you want to improve.

05:34.810 --> 05:40.480
And in particular, one thing I find in general with these is I can't really describe what's good about

05:40.480 --> 05:40.960
this.

05:40.960 --> 05:46.540
So quite often what I'll do is I'll dump that human example into an email.

05:46.570 --> 05:52.420
Like in this case, I've dumped it into code, and I've asked it to describe the writing style of this

05:52.450 --> 05:54.730
PID in a brief paragraph.

05:54.730 --> 05:58.450
And then I've used that as the instructions for my prompt.

05:58.600 --> 06:01.360
So that was like the give direction step.

06:01.390 --> 06:04.690
Then I've also generated some test cases as well.

06:04.690 --> 06:10.750
So this is where I came up with new ideas like in this case PID for a feature that allows Airbnb hosts

06:10.750 --> 06:12.700
to offer guided virtual tours.

06:12.880 --> 06:18.330
So that was this is how you can come up with good test examples or good examples to put into your prompt

06:18.330 --> 06:19.080
itself.

06:19.920 --> 06:24.630
And then then you can run the prompt again on on the new examples.

06:24.660 --> 06:28.320
So first to run the code for the Airbnb hosts virtual tours.

06:28.440 --> 06:30.450
Then we get a pretty good response.

06:30.480 --> 06:35.790
You can see this is much closer to what we get from the human, except we've got the thinking action

06:35.790 --> 06:38.280
at the beginning here, and we've got the inserts.

06:38.280 --> 06:42.600
And then it's just a brief into the point, which is exactly what we wanted.

06:42.630 --> 06:45.300
And it's a lot more creative as well.

06:45.360 --> 06:51.360
I think it's it's much more close to what I would expect from a real person.

06:52.470 --> 06:52.920
Cool.

06:53.160 --> 06:54.540
Solved the creativity issue.

06:54.570 --> 07:00.510
We kept the good formatting, professional, nice and user centric approach, but now it's much closer

07:00.510 --> 07:02.610
to our culture, which is what we want.

07:02.820 --> 07:08.550
There's this brief and to the point factual type response, and we also solved the hallucination problem

07:08.580 --> 07:09.330
as well.

07:09.690 --> 07:14.130
So here's the optimized prompt and I'll share that with you guys so you can take a look.

07:14.160 --> 07:20.850
So here's the example with a synthetic example where we, you know generated a fake version of the Airbnb

07:20.880 --> 07:21.660
hosts.

07:21.690 --> 07:29.460
And then this is how we rated the prompts as well, and use the rating to inform how we should ab test

07:29.580 --> 07:33.360
the previous prompts, the different techniques we used here.

07:33.540 --> 07:36.840
These are the specific ones that we found useful.

07:36.870 --> 07:38.190
One was raw prompting.

07:38.220 --> 07:42.870
We asked it to behave as a product manager for a major tech company.

07:42.900 --> 07:44.010
The style of bundling.

07:44.010 --> 07:48.540
We talked about this where we took the human example and asked it what was the type of style this was

07:48.540 --> 07:51.660
written in, and then used those instructions in our prompt.

07:51.690 --> 07:53.430
I also use emotion prompting.

07:53.460 --> 07:56.670
So I put in there my manager will fire me if you make up any statistics.

07:56.670 --> 08:00.780
So that got us to more reliably use the insert method.

08:00.810 --> 08:05.550
And then we added the example with a real human example we talked about.

08:05.550 --> 08:08.850
And then we also did chainsaw which was the planning step.

08:08.970 --> 08:10.170
These are the tactics that work.

08:10.200 --> 08:11.790
Sometimes these don't work.

08:11.820 --> 08:13.260
You've got to test each one of them.

08:13.290 --> 08:18.660
There's hundreds of prompt engineering tactics, but it really just takes a lot of trial and error.

08:19.440 --> 08:23.670
So hopefully that gives you a bit of an insight into how the work of prompt engineering is done.