WEBVTT

00:00.450 --> 00:04.710
-: Okay, using DSPy is quite good for optimizing prompts,

00:04.710 --> 00:06.120
but typically, you need

00:06.120 --> 00:09.120
to set up your own evaluation framework first,

00:09.120 --> 00:11.490
and that's actually the hardest thing,

00:11.490 --> 00:14.580
not writing the prompts, but designing the evaluator,

00:14.580 --> 00:17.130
particularly for tasks where it's not as straightforward,

00:17.130 --> 00:18.720
you can't just calculate.

00:18.720 --> 00:21.540
You need to actually create some sort

00:21.540 --> 00:22.980
of synthetic evaluation

00:22.980 --> 00:25.650
where the AI will tell you whether the joke is funny or not

00:25.650 --> 00:29.199
because you can't use DSPy as a manual human being.

00:29.199 --> 00:32.640
(chuckles) You have to have a labeled test set.

00:32.640 --> 00:36.180
So what I did, this is a approach I use quite often,

00:36.180 --> 00:39.330
this is just a list of funny jokes scraped

00:39.330 --> 00:42.000
from the internet, and then I just asked ChatGPT

00:42.000 --> 00:44.370
to make a list of not funny jokes.

00:44.370 --> 00:45.543
Here's an example.

00:47.100 --> 00:49.140
And so we have funny jokes and not funny jokes,

00:49.140 --> 00:52.380
and then I just create a test set.

00:52.380 --> 00:56.160
So if you see here, don't worry too much about this code,

00:56.160 --> 00:58.920
but basically, it just creates a DataFrame

00:58.920 --> 01:01.230
and this here is that to_csv for each one.

01:01.230 --> 01:03.780
We grab all of the jokes,

01:03.780 --> 01:07.020
we zip them together, and then we shuffle them

01:07.020 --> 01:09.030
and then split them into test,

01:09.030 --> 01:12.600
into training data, 177 results,

01:12.600 --> 01:14.460
testing data, so that's what we're using

01:14.460 --> 01:16.320
to optimize with DSPy,

01:16.320 --> 01:18.360
and then development data, that's how we're checking

01:18.360 --> 01:22.080
that DSPy hasn't over optimized on the test set,

01:22.080 --> 01:25.830
and that the actual solution is more generalizable

01:25.830 --> 01:27.960
to things it hasn't seen before.

01:27.960 --> 01:30.000
Cool, and you can see that each one has got a topic

01:30.000 --> 01:30.990
and a joke.

01:30.990 --> 01:32.430
Now, one thing that's interesting,

01:32.430 --> 01:35.970
by the way, is these initial ones here.

01:35.970 --> 01:37.590
When I downloaded them off internet,

01:37.590 --> 01:39.150
they didn't have a topic, obviously.

01:39.150 --> 01:40.980
What I did is I just asked ChatGPT

01:40.980 --> 01:43.140
to create the topic for each joke.

01:43.140 --> 01:46.590
It's a way to, you know, if you have specific inputs

01:46.590 --> 01:48.720
that you need, you can just reverse engineer them

01:48.720 --> 01:50.130
from the outputs.

01:50.130 --> 01:53.130
It's another little trick, (chuckles) which is quite fun.

01:53.130 --> 01:56.400
All right, now, if we go through,

01:56.400 --> 01:58.920
and now we have the funny jokes labeled as 1

01:58.920 --> 02:01.200
and they not funny jokes labeled as 0.

02:01.200 --> 02:04.470
We can create, we can set up our DSPy,

02:04.470 --> 02:06.510
and there's some involvement here,

02:06.510 --> 02:10.470
so I have GPT-3.5 Turbo and GPT-4 in case I need it,

02:10.470 --> 02:13.500
and what I'm gonna use is GPT-4

02:13.500 --> 02:16.410
to teach GPT-3.5 how to do this task.

02:16.410 --> 02:18.870
Because if we can get GPT-3.5 checking

02:18.870 --> 02:20.010
whether the joke is funny or not,

02:20.010 --> 02:23.400
then that's much cheaper when we're actually optimizing,

02:23.400 --> 02:26.220
because every time we optimize or test a joke,

02:26.220 --> 02:27.480
we're gonna have to run this,

02:27.480 --> 02:29.820
and we don't want to have to rely on GPT-4.

02:29.820 --> 02:33.570
So in DSPy, you create a class, a signature,

02:33.570 --> 02:36.570
and we just show that we have the joke and we have the topic

02:36.570 --> 02:39.210
and we have a question to assess the joke against,

02:39.210 --> 02:41.190
and then the answer, and in this case,

02:41.190 --> 02:43.837
the question is just, "Would this joke actually be funny

02:43.837 --> 02:45.690
"to an adult attending a comedy show?"

02:45.690 --> 02:48.630
I found that prompt is what works best

02:48.630 --> 02:50.340
to get a reliable result.

02:50.340 --> 02:52.260
So that's my bit of prompting.

02:52.260 --> 02:53.700
And then we can just test it,

02:53.700 --> 02:55.650
so we feed in a topic and a joke

02:55.650 --> 02:58.350
and then we have an assessing the joke chain,

02:58.350 --> 03:01.327
and it gives you rationale, "need to consider the joke

03:01.327 --> 03:02.497
"about heavy drinking and lighting candles

03:02.497 --> 03:04.320
"on a birthday cake would be appropriate and humorous,"

03:04.320 --> 03:06.210
blah, blah, and then here, it said no,

03:06.210 --> 03:07.440
so it didn't do a good job.

03:07.440 --> 03:08.730
I actually think this joke is funny

03:08.730 --> 03:10.530
but it doesn't think it's funny.

03:10.530 --> 03:12.750
And so we create a metric,

03:12.750 --> 03:17.750
and then we can evaluate based on the existing dev set,

03:18.720 --> 03:20.460
so that, remember, if we looked at it before,

03:20.460 --> 03:22.830
we had training data, we had test data,

03:22.830 --> 03:24.060
and then we had development data,

03:24.060 --> 03:27.000
so development is basically our fair test

03:27.000 --> 03:29.130
of whether it's doing well or not.

03:29.130 --> 03:32.430
And if we look, it gets 42% of them correct.

03:32.430 --> 03:34.890
And you can see here that it answered yes,

03:34.890 --> 03:38.340
that joke is true, and then it is was actually true.

03:38.340 --> 03:40.830
It said yes, but it actually, it was false.

03:40.830 --> 03:42.000
It's making a lot of errors.

03:42.000 --> 03:44.130
It's only got it 46% correct.

03:44.130 --> 03:45.330
So how do we train it?

03:45.330 --> 03:49.470
Here, we're doing bootstrap with random search,

03:49.470 --> 03:51.540
and I'll just explain what these mean,

03:51.540 --> 03:54.000
so we're loading in the training data set

03:54.000 --> 03:57.360
and the testing data set, it's gonna use them to optimize,

03:57.360 --> 04:00.030
and then we're just giving it a few parameters here.

04:00.030 --> 04:02.730
We're giving it the metric, the evaluation metric,

04:02.730 --> 04:06.120
and then, that was the prompt that we came up with earlier,

04:06.120 --> 04:08.340
and then we're telling it how many bootstrap demos,

04:08.340 --> 04:10.890
this is how many synthetic examples it will make up,

04:10.890 --> 04:13.290
so GPT-4 will make up a good example of this joke

04:13.290 --> 04:16.827
and then give it to GPT-3.5 to put it into its prompt,

04:16.827 --> 04:19.260
and so we give it up to eight jokes

04:19.260 --> 04:22.860
that GPT-4 has written that pass the evaluation metric.

04:22.860 --> 04:23.990
And then we have labeled demos,

04:23.990 --> 04:25.590
so this is how many prompts

04:25.590 --> 04:28.260
from our training data can we put into the,

04:28.260 --> 04:30.150
how many jokes from our training data we can put

04:30.150 --> 04:32.040
into the prompt, so there's 16 here.

04:32.040 --> 04:33.630
These are just values that I found

04:33.630 --> 04:35.070
that have worked pretty well.

04:35.070 --> 04:36.150
You'll get different results

04:36.150 --> 04:38.910
for different tasks if you set different amounts.

04:38.910 --> 04:40.950
Then, we have teacher settings here,

04:40.950 --> 04:43.890
this is where we tell it we wanna use GPT-4 Turbo

04:43.890 --> 04:47.190
as the teacher, rather than GPT-3.5.

04:47.190 --> 04:50.337
And then we have number of candidate programs, 16,

04:50.337 --> 04:53.370
and this is basically how many of the accommodations

04:53.370 --> 04:54.930
to track during the random search.

04:54.930 --> 04:57.630
So if we run this, normally it takes really long,

04:57.630 --> 05:01.050
maybe 15 minutes, 10 minutes, something like that,

05:01.050 --> 05:03.780
but this should have been cached locally,

05:03.780 --> 05:06.573
so hopefully it doesn't take very long.

05:07.410 --> 05:09.330
It looks like it's actually running it again

05:09.330 --> 05:10.163
for some reason.

05:10.163 --> 05:12.330
Maybe I accidentally deleted my cache. (chuckles)

05:12.330 --> 05:13.230
Every time you run this,

05:13.230 --> 05:14.970
it's gonna take a few hundred examples.

05:14.970 --> 05:19.970
You can see there's 177 calls right now that it's doing,

05:20.310 --> 05:22.860
and it's gonna give you the average metric.

05:22.860 --> 05:24.240
So I'm just gonna pause this recording

05:24.240 --> 05:26.240
and wait for it and then I'll come back.

05:27.870 --> 05:29.400
Mm-kay, this is still running.

05:29.400 --> 05:30.930
That's taking some time. (chuckles)

05:30.930 --> 05:33.480
But you can see, in some cases,

05:33.480 --> 05:34.950
it's getting a good response.

05:34.950 --> 05:36.780
I'm just gonna keep going through this

05:36.780 --> 05:37.890
and we'll come back to that,

05:37.890 --> 05:41.160
but essentially, once you have this compiled

05:41.160 --> 05:43.500
and it's found the best combination,

05:43.500 --> 05:45.300
then you can run it like this,

05:45.300 --> 05:46.470
so you give it the joke

05:46.470 --> 05:47.730
and you can see actually, in this case,

05:47.730 --> 05:50.970
from a previous run, it did actually check,

05:50.970 --> 05:54.210
it confirmed that the joke was good,

05:54.210 --> 05:55.530
so that's really helpful.

05:55.530 --> 05:58.920
And you can run the evaluate based on the full program.

05:58.920 --> 06:00.780
You can see how, on average, it got it,

06:00.780 --> 06:05.100
and you can see here that the success rate was 79.5%

06:05.100 --> 06:09.420
so it was up considerably from the 46% we had before.

06:09.420 --> 06:12.510
Was, you know, basically almost double the accuracy

06:12.510 --> 06:15.990
in terms of getting it correct, which is really helpful.

06:15.990 --> 06:18.780
But I wanted to show you another option here,

06:18.780 --> 06:20.070
which is called MIPRO,

06:20.070 --> 06:24.600
and what this allows you to do is not just add good examples

06:24.600 --> 06:26.520
to the dataset, but also come up

06:26.520 --> 06:28.590
with new prompt instructions.

06:28.590 --> 06:31.890
So here, the prompt model is gonna be GPT-4,

06:31.890 --> 06:34.230
and then the task model's gonna be GPT-5,

06:34.230 --> 06:36.210
it's gonna come up with new prompts for you,

06:36.210 --> 06:39.450
as well as add different prompts to the,

06:39.450 --> 06:41.790
sorry, different examples of jokes to the prompt,

06:41.790 --> 06:43.800
and here we've gotta set the same results.

06:43.800 --> 06:44.670
Should get the same results,

06:44.670 --> 06:47.340
but it's also gonna optimize the prompt for us.

06:47.340 --> 06:49.620
And this actually takes a lot of calls so you can,

06:49.620 --> 06:51.030
I'm not gonna run this again,

06:51.030 --> 06:54.480
it's 5,000 calls for this one, and so there's a lot there,

06:54.480 --> 06:58.350
but once I ran it, then you can get an idea.

06:58.350 --> 07:01.260
It's just printing out the different prompts,

07:01.260 --> 07:02.940
so I'm gonna zoom through those.

07:02.940 --> 07:06.420
You can see like how it makes its decisions in there.

07:06.420 --> 07:10.800
But the final score, which is really interesting, is 84%,

07:10.800 --> 07:13.830
so we got another few percentage points difference.

07:13.830 --> 07:17.070
Tuning the prompt did make a bit of a difference,

07:17.070 --> 07:19.410
whereas just adding the examples, I think,

07:19.410 --> 07:20.850
made the most difference.

07:20.850 --> 07:21.930
The takeaway I would say

07:21.930 --> 07:24.570
in general is that adding good examples

07:24.570 --> 07:27.060
to the prompt is a much bigger gain.

07:27.060 --> 07:30.420
But to have a GPT-3 level prompt

07:30.420 --> 07:33.870
that can generate 84% accuracy

07:33.870 --> 07:36.420
in telling whether a joke is funny or not is really good

07:36.420 --> 07:38.880
because now we can use that, you know, going forward.

07:38.880 --> 07:41.670
And you can see, when you run this on the dev set,

07:41.670 --> 07:43.860
you can still see, it's 84%.

07:43.860 --> 07:46.110
This is based on the training set,

07:46.110 --> 07:47.880
this is based on the dev set,

07:47.880 --> 07:49.320
and we're still doing a really good job.

07:49.320 --> 07:50.490
It didn't over fit.

07:50.490 --> 07:52.590
If you wanted to see the prompt that it came up with,

07:52.590 --> 07:54.540
this is what it came up with,

07:54.540 --> 07:58.470
saying, "Follow the following format, the joke," et cetera,

07:58.470 --> 08:00.277
and then, "Evaluate the humor of each joke

08:00.277 --> 08:01.777
"by considering its sophistication,

08:01.777 --> 08:03.690
"cultural references, and wordplay."

08:03.690 --> 08:05.490
Yeah, so you could take this prompt to basically,

08:05.490 --> 08:06.870
you don't have to use DSPy anymore,

08:06.870 --> 08:08.130
you could take this prompt

08:08.130 --> 08:09.990
and then just, it's just text, right?

08:09.990 --> 08:12.120
So you could save it somewhere or put it somewhere else.

08:12.120 --> 08:15.390
And here, I've saved this program as JSON,

08:15.390 --> 08:16.650
now you have this prompt,

08:16.650 --> 08:19.770
you could go and use this in a custom GPT or whatever,

08:19.770 --> 08:22.440
or you could use this as an evaluation metric.

08:22.440 --> 08:23.273
Enjoy.

08:23.273 --> 08:27.030
This is a really useful mental model

08:27.030 --> 08:28.440
to get your head around,

08:28.440 --> 08:31.020
this idea of training the evaluator,

08:31.020 --> 08:33.420
not just training the original prompt,

08:33.420 --> 08:35.070
because now that we have an evaluator,

08:35.070 --> 08:37.170
we can use this evaluator

08:37.170 --> 08:41.220
as our evaluation metric for DSPy,

08:41.220 --> 08:43.230
so we've gone from like a labeled data set

08:43.230 --> 08:47.250
to a synthetic prompt evaluator,

08:47.250 --> 08:48.600
which is really helpful.

08:48.600 --> 08:51.840
It just expands the amount of tasks that you can do,

08:51.840 --> 08:54.990
'cause it's usually easier to provide labeled examples

08:54.990 --> 08:56.670
of an evaluator than it is

08:56.670 --> 08:59.070
to provide labeled examples of a good joke.

08:59.070 --> 09:00.030
That's much harder.

09:00.030 --> 09:02.490
Yeah, feel free to use this.

09:02.490 --> 09:04.650
Make a joke GPT if you want to.

09:04.650 --> 09:06.000
That's what I used it for.

09:06.000 --> 09:07.860
But yeah, this is really helpful.

09:07.860 --> 09:10.710
You can also use this to generate more data, right?

09:10.710 --> 09:12.900
So you could scrape a load of jokes on the internet

09:12.900 --> 09:15.840
and then now use your evaluation metric,

09:15.840 --> 09:19.320
and with GPT-3, you can actually tell like 85% accuracy,

09:19.320 --> 09:20.820
which jokes on the internet are funny,

09:20.820 --> 09:22.710
so which ones should you include in your dataset?

09:22.710 --> 09:25.110
It's gonna really improve things.

09:25.110 --> 09:26.640
All right, hopefully that's helpful for you

09:26.640 --> 09:30.693
and opens a few doors or sets off a few light bulb moments.
