WEBVTT

00:00.000 --> 00:03.090
-: All right, let's talk Prompt Optimization.

00:03.090 --> 00:05.730
So we're gonna define a couple of prompts,

00:05.730 --> 00:07.110
and we're going to A/B test them

00:07.110 --> 00:08.997
and see how well they do.

00:08.997 --> 00:11.220
And this is really key, I think,

00:11.220 --> 00:14.880
when you start talking about prompt engineering.

00:14.880 --> 00:19.080
You're really talking about actually optimizing your prompt

00:19.080 --> 00:21.300
and proving that it works in production,

00:21.300 --> 00:22.710
finding all those edge cases.

00:22.710 --> 00:24.270
Because you don't always get that

00:24.270 --> 00:26.400
when you're just playing around in ChatGPT, right?

00:26.400 --> 00:29.940
So, I think this is really the next level

00:29.940 --> 00:31.380
that you can start to get to.

00:31.380 --> 00:32.610
And once you start to learn

00:32.610 --> 00:34.710
about the benefits of prompt optimization,

00:34.710 --> 00:36.810
I think then, you start to really unlock

00:36.810 --> 00:40.410
a whole new kind of tier of performance.

00:40.410 --> 00:43.072
So let me just run this.

00:43.072 --> 00:43.905
(mouse clicks)

00:43.905 --> 00:46.920
What that's gonna do is it's gonna use getpass

00:46.920 --> 00:50.920
to help me just paste in my API key

00:51.810 --> 00:55.410
without, you know, actually just sharing it with the world.

00:55.410 --> 00:57.060
So I recommend using that.

00:57.060 --> 00:58.740
And then that's stored locally.

00:58.740 --> 01:00.960
So I'm just gonna run this,

01:00.960 --> 01:03.390
which is to install OpenAI,

01:03.390 --> 01:06.540
and that's the only library we really need.

01:06.540 --> 01:10.863
And we're gonna be using the openai.ChatGPTCompletion.

01:12.990 --> 01:17.310
So we're gonna actually be calling GPT-3.5 Turbo

01:17.310 --> 01:18.660
to test this.

01:18.660 --> 01:20.730
Typically, I would say prompt optimization

01:20.730 --> 01:22.710
is really important for smaller models,

01:22.710 --> 01:24.780
like, if you're using GPT-4,

01:24.780 --> 01:27.630
it tends to work a lot better, but that's very expensive.

01:27.630 --> 01:29.880
And if you can get the prompt to work on GPT-3.5,

01:29.880 --> 01:32.733
then you're gonna save like a hundred, you know,

01:32.733 --> 01:34.530
like 99% of your costs.

01:34.530 --> 01:38.487
So now, these are the two prompts that we wanted to test.

01:38.487 --> 01:40.380
And the reason why we want to test them,

01:40.380 --> 01:44.010
so this is, you know, product name generator,

01:44.010 --> 01:46.050
and you know, if we're gonna be doing this

01:46.050 --> 01:49.230
hundreds of times, you know, maybe we're building a product

01:49.230 --> 01:52.950
that helps you generate product names, kind of matter,

01:52.950 --> 01:56.550
then it makes a big difference in terms of token costs

01:56.550 --> 01:58.950
and also, whether you can get it working with 3.5.

01:58.950 --> 02:02.100
So prompt B and prompt A are the same.

02:02.100 --> 02:05.820
The only difference is that prompt B has two examples.

02:05.820 --> 02:09.480
So it's a few shot versus a zero shot prompt.

02:09.480 --> 02:11.070
And we're gonna actually test and see

02:11.070 --> 02:13.620
if adding those examples really helps.

02:13.620 --> 02:16.590
And the reason why that's important is,

02:16.590 --> 02:17.880
if we don't need to have them in there,

02:17.880 --> 02:18.840
let's get rid of them.

02:18.840 --> 02:21.750
I see a lot of this, like, prompt witchcraft, right?

02:21.750 --> 02:25.290
Spell casting, some people call it blind prompting,

02:25.290 --> 02:27.630
is how I've heard it described,

02:27.630 --> 02:29.640
where you don't actually really test the prompt,

02:29.640 --> 02:31.320
you just keep adding more and more stuff

02:31.320 --> 02:33.090
until it becomes a big essay.

02:33.090 --> 02:36.930
And I would bet you that most of that stuff isn't necessary,

02:36.930 --> 02:40.650
it isn't followed and maybe even leads to worse results.

02:40.650 --> 02:42.780
So it's really important to test.

02:42.780 --> 02:45.360
All right, so we're gonna test these two prompts.

02:45.360 --> 02:50.130
We're gonna see if adding the examples really helps,

02:50.130 --> 02:51.900
and we could test other parts of this, right?

02:51.900 --> 02:56.160
We could test whether it works well for multiple seed words.

02:56.160 --> 02:57.390
It doesn't have to be performance.

02:57.390 --> 03:00.270
So it could be, you know, do we get bad language?

03:00.270 --> 03:02.580
You know, if we put like a swear word and the C words,

03:02.580 --> 03:05.250
like, you can test this for all sorts of things.

03:05.250 --> 03:06.780
So lemme show the interface.

03:06.780 --> 03:09.257
So we just get the prompts together in this test prompts,

03:09.257 --> 03:12.270
we can actually add multiple prompts if we wanted to.

03:12.270 --> 03:14.130
And we're just gonna import Pandas

03:14.130 --> 03:17.490
and OpenAI Pandas is the data frame.

03:17.490 --> 03:19.620
We're going to store all the tests in

03:19.620 --> 03:21.450
and then do some reporting.

03:21.450 --> 03:25.972
So this is just the standard way that you talk to GPT-3.5.

03:25.972 --> 03:29.880
I'm just using the standard, you know, system message.

03:29.880 --> 03:31.800
You could obviously use something different.

03:31.800 --> 03:34.863
And then, we're just returning the actual text here.

03:35.910 --> 03:39.170
So, we have the test prompts, we have the responses,

03:39.170 --> 03:41.100
so we're gonna save them.

03:41.100 --> 03:44.190
We're gonna run five tests per test prompt,

03:44.190 --> 03:47.760
and then, we're gonna evaluate them afterwards.

03:47.760 --> 03:49.950
So this is what runs the test.

03:49.950 --> 03:51.660
And then, we're gonna store them in the data frame,

03:51.660 --> 03:53.160
and we're gonna print the df,

03:53.160 --> 03:56.610
we're gonna save it to responses.csv.

03:56.610 --> 03:58.174
So let me just run this.

03:58.174 --> 04:00.757
(mouse clicks)

04:01.740 --> 04:05.883
And what this is doing is calling OpenAI multiple times.

04:07.743 --> 04:09.480
(clears throat)

04:09.480 --> 04:11.380
So it's gonna just take a few minutes.

04:12.240 --> 04:14.433
It's calling it 10 times for each prompt.

04:21.240 --> 04:22.893
Should wait for that to finish.

04:24.360 --> 04:26.610
Should only take a few more seconds, I think.

04:28.800 --> 04:30.450
One thing I should point out is,

04:30.450 --> 04:35.450
every time we're calling this, saving the variant name,

04:35.580 --> 04:37.533
the prompt, and the response we got.

04:38.400 --> 04:41.550
And the variant name, we're just calling it, the first one A

04:41.550 --> 04:44.880
and then, this will find B for the next one,

04:44.880 --> 04:46.023
C for the next one.

04:46.890 --> 04:47.723
Here we go.

04:47.723 --> 04:50.343
So, we have all the different responses, which is cool.

04:51.330 --> 04:53.310
Oh, one thing I would say as well,

04:53.310 --> 04:55.140
if you're doing this for lots of prompts,

04:55.140 --> 04:59.040
so many, many different, you know, calls,

04:59.040 --> 05:02.190
then, you might want to add some retry logic

05:02.190 --> 05:03.357
in here as well.

05:03.357 --> 05:06.690
All right. So now, we're gonna read the CSV,

05:06.690 --> 05:07.710
responses.csv.

05:07.710 --> 05:11.190
Actually, we can see it, it's in here.

05:11.190 --> 05:15.060
Yeah, responses.csv. So we can grab that.

05:15.060 --> 05:18.210
And then, we're just gonna shuffle the index.

05:18.210 --> 05:23.070
So it's really important when we evaluate the responses

05:23.070 --> 05:25.410
that we're, you know, doing it blind,

05:25.410 --> 05:27.300
like, we don't know which prompt it came from,

05:27.300 --> 05:28.770
so we didn't skew the results,

05:28.770 --> 05:30.180
but also that they're shuffled.

05:30.180 --> 05:32.040
So again, we can't, you know,

05:32.040 --> 05:33.570
we didn't know which one it came from.

05:33.570 --> 05:36.180
Otherwise, you know, you can see that like,

05:36.180 --> 05:38.880
these all came from the first prompt

05:38.880 --> 05:40.260
and like, these all came from the second.

05:40.260 --> 05:43.230
So, you know, it'd be too easy, otherwise.

05:43.230 --> 05:45.900
All right. So what does this code do?

05:45.900 --> 05:48.750
This just sets up an interface where we can quickly

05:48.750 --> 05:50.880
click thumbs up, thumbs down,

05:50.880 --> 05:53.250
and it just uses the IPython widgets.

05:53.250 --> 05:58.230
So this only works in a Jupyter Notebook like Google Colab.

05:58.230 --> 06:03.230
So what it's doing is, it gives user feedback of one,

06:03.690 --> 06:06.480
if there's a thumbs up and zero if not.

06:06.480 --> 06:11.480
And then, we are updating the response to load the next,

06:11.640 --> 06:13.830
you know, response from the CSV.

06:13.830 --> 06:18.830
And then, if we've run out of responses,

06:20.310 --> 06:23.280
then, we save results.csv,

06:23.280 --> 06:26.700
and then, we do some printing of the feedback

06:26.700 --> 06:28.620
and kind of give us the the mean.

06:28.620 --> 06:31.440
All right, so this is the update response.

06:31.440 --> 06:35.010
This basically just gets the next thing in the index,

06:35.010 --> 06:37.200
the next response until we're all done.

06:37.200 --> 06:41.040
And then, it just does some formatting to kind of show you,

06:41.040 --> 06:42.240
you know, this in the interface.

06:42.240 --> 06:44.250
So, this is all setting up,

06:44.250 --> 06:46.140
like, the thumbs up button and all that stuff.

06:46.140 --> 06:47.972
So, don't worry too much about that.

06:47.972 --> 06:50.310
So we're gonna run this. Here we go.

06:50.310 --> 06:54.210
So we have it. So Adapt-a-fit, OmniShoe, FitAll, FitFlex.

06:54.210 --> 06:55.260
So I think that's good.

06:55.260 --> 06:56.400
What I'm gonna test for here

06:56.400 --> 06:58.500
is whether it's in the right format.

06:58.500 --> 07:01.800
So, forget about the quality of the names right now.

07:01.800 --> 07:05.070
I just wanna make sure, are we getting the right format?

07:05.070 --> 07:07.860
And this is the format I want, right?

07:07.860 --> 07:12.860
I want a inline comma separated list. So that one's good.

07:13.020 --> 07:15.030
This one's the same. That's good.

07:15.030 --> 07:17.220
Okay, that's different names,

07:17.220 --> 07:19.410
but it's still in the right format.

07:19.410 --> 07:21.090
Okay, now this is wrong.

07:21.090 --> 07:23.310
Here we go. We have a failure. (chuckles)

07:23.310 --> 07:24.240
Perfect.

07:24.240 --> 07:26.190
So, it's a numbered list in order,

07:26.190 --> 07:27.360
and that doesn't make any sense.

07:27.360 --> 07:29.190
And if we were trying to pass this,

07:29.190 --> 07:30.300
that would really suck, right?

07:30.300 --> 07:34.620
Like, when we're trying to use this programmatically,

07:34.620 --> 07:37.320
then sometimes, it comes through like this, you know,

07:37.320 --> 07:39.810
it's not gonna pass properly.

07:39.810 --> 07:42.324
So you say no and then no for this one,

07:42.324 --> 07:44.970
and then no for this one, no for this one.

07:44.970 --> 07:49.110
Okay, this one's good. This one's good. No for this one.

07:49.110 --> 07:51.630
So it looks like sometimes it's coming through.

07:51.630 --> 07:55.680
Here we go. Yeah, look, prompt B, score of a 100%.

07:55.680 --> 07:57.780
Prompt A, score of 0%, right?

07:57.780 --> 08:00.240
So that's a real problem, right?

08:00.240 --> 08:05.240
Like by, you know, not including the examples,

08:05.970 --> 08:08.880
it's not following the, you know,

08:08.880 --> 08:10.680
the format that we want, right?

08:10.680 --> 08:13.490
So it does look like we need those examples in there

08:13.490 --> 08:15.330
to get the right format.

08:15.330 --> 08:17.910
And yeah, we wouldn't have known that if we didn't test it.

08:17.910 --> 08:20.790
I didn't think it was gonna be a 100% right.

08:20.790 --> 08:23.010
It's a real problem.

08:23.010 --> 08:23.970
Cool. All right.

08:23.970 --> 08:25.830
But now that you know how this works,

08:25.830 --> 08:28.290
like, you could really test it for everything.

08:28.290 --> 08:30.210
You know, you could test it for quality,

08:30.210 --> 08:32.280
you could actually add ratings if you want in here

08:32.280 --> 08:34.710
instead of just thumbs up and thumbs down.

08:34.710 --> 08:36.600
You know, the world is your oyster.

08:36.600 --> 08:38.490
And hopefully, you'll be testing

08:38.490 --> 08:41.520
all of your different elements or your prompts, you know,

08:41.520 --> 08:44.940
instead of just taking on blind faith that it's necessary

08:44.940 --> 08:45.903
or even helping.