WEBVTT

00:00.510 --> 00:03.390
-: Hey, I'm gonna walk you through DSPY,

00:03.390 --> 00:08.250
which is a prompt optimization framework or library.

00:08.250 --> 00:09.498
And it's pretty cool, actually.

00:09.498 --> 00:13.110
Very complicated and a little bit over-engineered,

00:13.110 --> 00:16.350
but I did get pretty good results with it.

00:16.350 --> 00:19.803
Now, the reason why I wanted to cover DSPY is

00:19.803 --> 00:23.460
that some engineers in the AI community have been saying

00:23.460 --> 00:26.580
about how it is the end of prompt engineering.

00:26.580 --> 00:30.540
And I, my personal opinion is that

00:30.540 --> 00:33.570
even if you use AI to write the prompts, which is

00:33.570 --> 00:36.667
what DSPY is doing, that is still prompt engineering

00:36.667 --> 00:39.339
because you're actually engineering their entire

00:39.339 --> 00:41.147
system around the prompt.

00:41.147 --> 00:43.350
And that's the important thing.

00:43.350 --> 00:44.790
Like how does the, what are the inputs,

00:44.790 --> 00:47.310
what are the outputs, what are the evaluation metrics?

00:47.310 --> 00:49.623
And DSPY doesn't automate that part.

00:49.623 --> 00:52.320
That's what it needs actually in order

00:52.320 --> 00:55.127
to do the automatic prompt rewriting,

00:55.127 --> 00:56.965
I would say it is prompt engineering.

00:56.965 --> 00:59.790
Prompt engineering isn't just writing the right combination

00:59.790 --> 01:03.420
of words in the right order to get AI to do what you want.

01:03.420 --> 01:06.039
It doesn't really matter to a prompt engineer, I think,

01:06.039 --> 01:07.770
whether they wrote those words

01:07.770 --> 01:09.417
or whether the AI wrote the prompt.

01:09.417 --> 01:11.001
I'm just gonna walk through how to do this

01:11.001 --> 01:13.650
and specifically piece this together from their

01:13.650 --> 01:15.330
tutorials and documentation.

01:15.330 --> 01:18.120
I'm gonna try and explain it a little bit better than

01:18.120 --> 01:20.117
I could understand from the docs

01:20.117 --> 01:21.960
when I looked at them initially.

01:21.960 --> 01:24.240
I'm gonna go through how to use it in 8 steps.

01:24.240 --> 01:26.040
So first of all, you wanna define your task.

01:26.040 --> 01:29.100
I think generally, just before you take on the complexity

01:29.100 --> 01:32.730
of a framework, just try and do it in the open AI API first.

01:32.730 --> 01:36.194
So just gonna show you what I'm trying to do here.

01:36.194 --> 01:40.320
I'm trying to get it to tell me a funny joke about a topic.

01:40.320 --> 01:44.573
So this is just gpt-4-turbo using open AI API.

01:44.573 --> 01:48.005
And you can see here "why don't fish do well on school tests

01:48.005 --> 01:49.903
because they work below C level".

01:49.903 --> 01:53.370
Ha, ha, ha, not particularly funny, but we're gonna try

01:53.370 --> 01:55.143
and make it tell funnier jokes.

01:55.143 --> 01:57.710
Once you've got the tasks to find, you want

01:57.710 --> 02:00.510
to define the pipeline in DSPY

02:00.510 --> 02:02.670
and there's a few different tactics you can use.

02:02.670 --> 02:05.400
So typically you wanna start with the most powerful models.

02:05.400 --> 02:07.468
I, again, I start with GPT-4 turbo here.

02:07.468 --> 02:12.037
And there are different types of modules you can use

02:12.037 --> 02:15.266
to do this type of analysis

02:15.266 --> 02:18.150
or to get it to optimize your prompt for you.

02:18.150 --> 02:20.618
One is like chain of thought, that's a pretty common one.

02:20.618 --> 02:22.740
Tends to work with AI.

02:22.740 --> 02:25.950
You ask the AI to come up with some rationale first

02:25.950 --> 02:29.040
and then give you the result.

02:29.040 --> 02:31.350
And then you could also use retrieval like rag,

02:31.350 --> 02:32.850
a lot of their examples use rag,

02:32.850 --> 02:35.010
although I didn't do that here,

02:35.010 --> 02:36.240
where you could make a search

02:36.240 --> 02:38.972
and give that context to the prompt when it's searching

02:38.972 --> 02:40.380
or tool use as well.

02:40.380 --> 02:43.620
You can also give the prompt, give the AI models

02:43.620 --> 02:45.030
and tools to use and then

02:45.030 --> 02:46.680
incorporate that into your pipeline.

02:46.680 --> 02:48.090
That's all kind of complicated.

02:48.090 --> 02:50.732
And their most basic example includes rag,

02:50.732 --> 02:54.588
which I think makes it too hard for a lot of people to grok.

02:54.588 --> 02:57.780
But I'm just gonna show you an example with a simple prompt

02:57.780 --> 02:59.640
and some optimization.

02:59.640 --> 03:01.090
First we're inputting DSPY.

03:01.090 --> 03:03.752
The convention is you just keep it, you know, as DSPY

03:03.752 --> 03:05.902
and then you can see wherever you're using it

03:05.902 --> 03:08.880
in the script.

03:08.880 --> 03:11.829
So first we're setting up GPT-4 Turbo

03:11.829 --> 03:14.354
and this is just an open AI model.

03:14.354 --> 03:16.140
So that's pretty easy.

03:16.140 --> 03:20.235
And then just configuring DSPY to use that by default.

03:20.235 --> 03:22.320
And then when we hit run,

03:22.320 --> 03:24.180
you can see it comes back straight away

03:24.180 --> 03:25.890
and it gives the same response.

03:25.890 --> 03:27.810
It doesn't, it's not guaranteed to be the same response

03:27.810 --> 03:29.370
as using the open AI API

03:29.370 --> 03:31.950
because our temperature is set to one.

03:31.950 --> 03:33.840
There is some randomness to the results,

03:33.840 --> 03:35.505
but you'll generally find that

03:35.505 --> 03:39.283
for some reason OpenAI always gives the same basic jokes

03:39.283 --> 03:41.473
again and again, which really suck.

03:41.473 --> 03:43.230
That's what we're trying to solve,

03:43.230 --> 03:45.198
but at least we've replicated it here.

03:45.198 --> 03:46.740
Alright, that's the first thing

03:46.740 --> 03:48.780
to understand about DSPY is it needs

03:48.780 --> 03:50.662
a signature in order to optimize.

03:50.662 --> 03:53.160
And the signature is literally just what are the inputs

03:53.160 --> 03:55.440
and outputs and you can define them in line,

03:55.440 --> 03:57.510
which is pretty nice for small tasks.

03:57.510 --> 04:00.090
And this is just prediction signature,

04:00.090 --> 04:03.300
which is the most basic one where it literally just asks,

04:03.300 --> 04:04.954
it gives it the inputs and and outputs.

04:04.954 --> 04:06.900
And you can define it this way.

04:06.900 --> 04:08.655
So I'm saying I have a topic

04:08.655 --> 04:12.310
and I want to get back a joke and you can run that

04:12.310 --> 04:14.610
and it works the same, which is pretty cool.

04:14.610 --> 04:16.709
So now you have this joker thing is

04:16.709 --> 04:19.140
basically like a little module.

04:19.140 --> 04:21.810
You can, a little function you can run in order

04:21.810 --> 04:26.059
to replace all this code here that we had for OpenAI.

04:26.059 --> 04:27.485
Alright, so that's interesting.

04:27.485 --> 04:29.430
Now you notice we didn't

04:29.430 --> 04:30.660
actually have to write a prompt here.

04:30.660 --> 04:32.923
This is the whole prompt and it actually writes the prompt

04:32.923 --> 04:35.259
for you, which you'll see in a second.

04:35.259 --> 04:38.684
The main convention that I recommend

04:38.684 --> 04:42.029
and they recommend is to create a class based signature

04:42.029 --> 04:44.100
because it's just a little bit clearer

04:44.100 --> 04:45.327
and you can do more advanced things.

04:45.327 --> 04:47.676
But this is the same example.

04:47.676 --> 04:51.164
We create a class and we inherit from DSPY signature

04:51.164 --> 04:53.215
and then we create a doc string.

04:53.215 --> 04:55.030
This is the prompt, this is actually

04:55.030 --> 04:56.970
all the prompt that we need.

04:56.970 --> 04:58.258
Make a funny joke given a topic

04:58.258 --> 05:01.166
and that's all it needs to optimize, which is pretty cool.

05:01.166 --> 05:05.244
Then you have the topic of the joke is an input field

05:05.244 --> 05:07.830
and we've specified that it's gonna expect

05:07.830 --> 05:09.780
that it's gonna throw an error if we don't,

05:09.780 --> 05:11.160
if it doesn't get a topic.

05:11.160 --> 05:12.828
And then it has an output field,

05:12.828 --> 05:14.702
which is the funny joke itself,

05:14.702 --> 05:16.890
and it takes into account what's written in

05:16.890 --> 05:18.243
these descriptions as well.

05:19.230 --> 05:20.836
All right, so now we have the signature.

05:20.836 --> 05:23.953
We wanna also create a module in order to be able

05:23.953 --> 05:25.770
to run this signature.

05:25.770 --> 05:29.430
The module is basically like, you can think of this

05:29.430 --> 05:31.739
as a prompting strategy, whereas the signature is

05:31.739 --> 05:33.620
what are the inputs and outputs?

05:33.620 --> 05:35.640
And in this case, you know,

05:35.640 --> 05:38.430
we're just using the basic predict prompting strategy,

05:38.430 --> 05:40.052
which is, it doesn't have anything fancy,

05:40.052 --> 05:41.643
it's just a normal prompt.

05:41.643 --> 05:45.060
But you can, you know, once you kind using structure this,

05:45.060 --> 05:46.980
you could also bring in any other kind

05:46.980 --> 05:48.090
of prompting strategy you want.

05:48.090 --> 05:49.200
So you could actually pull,

05:49.200 --> 05:51.107
build together a bunch of different modules

05:51.107 --> 05:53.040
and stream 'em together into a pipeline.

05:53.040 --> 05:56.250
But ignoring that complexity, this is all you need,

05:56.250 --> 05:57.531
you pass it, this is it here,

05:57.531 --> 05:59.962
I'm creating one called COT Chain

05:59.962 --> 06:02.621
of thought and we're just passing it.

06:02.621 --> 06:05.940
The built-in chain of thought module here,

06:05.940 --> 06:09.015
which is something that DSPY, so DSPY created just

06:09.015 --> 06:12.750
to give you an outta the box chain of thought strategy.

06:12.750 --> 06:13.740
But you can create your own.

06:13.740 --> 06:16.950
And then, so it needs a self.prog,

06:16.950 --> 06:17.783
that's how it makes progress.

06:17.783 --> 06:19.380
That's what the prompt is.

06:19.380 --> 06:23.370
And then you need a forward here if you think about it,

06:23.370 --> 06:26.379
like in order to move forward I guess is the way you can

06:26.379 --> 06:28.470
think about this, that you need a topic

06:28.470 --> 06:30.341
and then you call self.prog

06:30.341 --> 06:32.730
and then everything else is just inherited.

06:32.730 --> 06:34.906
So don't worry about this if you don't understand it,

06:34.906 --> 06:37.560
if you don't not into object oriented programming,

06:37.560 --> 06:38.850
you don't really have to worry about it.

06:38.850 --> 06:41.550
Just copy this example exactly, essentially

06:41.550 --> 06:43.830
and then just change the signature

06:43.830 --> 06:45.420
for the one that you're doing.

06:45.420 --> 06:48.962
But this is what a chain of thought prompt looks like.

06:48.962 --> 06:53.962
First it gets you to, it gets AI to create the rationale.

06:54.300 --> 06:57.420
It's like how should it approach the result?

06:57.420 --> 07:01.406
And then it actually asks to do the task, right?

07:01.406 --> 07:03.240
But here it's, you can see it's getting

07:03.240 --> 07:04.230
like a slightly better joke.

07:04.230 --> 07:05.760
Why don't fish make good musicians?

07:05.760 --> 07:07.680
'Cause you can tune a guitar but you can't tuna fish.

07:07.680 --> 07:11.460
And so it's gotten better because it's got this rationale.

07:11.460 --> 07:13.508
First, you're giving it more tokens to think.

07:13.508 --> 07:15.833
And so this is a common prompting strategy, right?

07:15.833 --> 07:17.460
Why did that run so fast?

07:17.460 --> 07:18.870
By the way, one really nice thing

07:18.870 --> 07:21.690
with DSPY is it locally caches everything.

07:21.690 --> 07:24.720
If you are doing running this a bunch of times

07:24.720 --> 07:26.250
and it does get expensive

07:26.250 --> 07:27.690
because it does make hundreds of calls

07:27.690 --> 07:30.120
to API when you optimize, I'll show you that in a sec,

07:30.120 --> 07:31.290
but if you run it again,

07:31.290 --> 07:33.060
it literally just looks at the cache first.

07:33.060 --> 07:35.760
So you're never gonna cost yourself too much money

07:35.760 --> 07:38.131
by running the same thing over and over again.

07:38.131 --> 07:40.224
And it'll come back immediately if it's already

07:40.224 --> 07:41.584
run, which is very nice.

07:41.584 --> 07:43.830
All right, and the other cool thing is you can

07:43.830 --> 07:45.366
expect the history of the model.

07:45.366 --> 07:47.569
So here I've just got the last one result.

07:47.569 --> 07:50.400
And you can see here that this is the actual

07:50.400 --> 07:52.179
prompt that we sent in.

07:52.179 --> 07:54.384
This is the prompt that created from our signature,

07:54.384 --> 07:56.340
make a funny joke given a topic.

07:56.340 --> 07:58.650
And then it says follow the following format.

07:58.650 --> 07:59.760
It gives the topic of the joke.

07:59.760 --> 08:02.446
Let's think step by step in order to, and then,

08:02.446 --> 08:06.856
and then it inserts the specific thing it's trying to do.

08:06.856 --> 08:09.447
And then it tells it what the variables, you know,

08:09.447 --> 08:11.160
input and output variables are.

08:11.160 --> 08:12.870
If I had one criticism of DSPY

08:12.870 --> 08:15.960
it's that their actual prompts aren't very good,

08:15.960 --> 08:17.310
but don't worry about that too much.

08:17.310 --> 08:18.143
It does a pretty good job.

08:18.143 --> 08:20.143
And you can adjust all this if you want to.

08:20.143 --> 08:22.671
And you can see here recent, let's think by step

08:22.671 --> 08:25.413
by step in order to, and then it,

08:25.413 --> 08:28.203
then this green is like what it came back with.

08:29.511 --> 08:30.869
Okay, hopefully that makes sense

08:30.869 --> 08:32.547
to you if you're familiar with chain of thought.

08:32.547 --> 08:35.071
But now we have a basic pipeline.

08:35.071 --> 08:37.183
We wanna explore a few different examples

08:37.183 --> 08:39.780
to understand the task a little bit,

08:39.780 --> 08:41.670
and we have a topic of fishing

08:41.670 --> 08:45.660
and here's a joke from Ricky Gervais, which you know,

08:45.660 --> 08:46.920
this is an example of the type

08:46.920 --> 08:48.840
of caliber we want to be able to train into it.

08:48.840 --> 08:49.673
Give a man a fish,

08:49.673 --> 08:51.920
he'll probably follow you home expecting more fish.

08:51.920 --> 08:55.860
So that's a good example of a good joke for that topic.

08:55.860 --> 08:56.999
But, and we can compare that

08:56.999 --> 08:59.772
to our actual joke that it came back with.

08:59.772 --> 09:02.340
And you can see that it doesn't do as good as

09:02.340 --> 09:03.948
Ricky Gervais, that's to be expected.

09:03.948 --> 09:06.090
All right, you wanna try other examples

09:06.090 --> 09:07.260
you might run into errors with.

09:07.260 --> 09:09.381
So I was a bit worried that, you know,

09:09.381 --> 09:10.930
if we did a topic on drinking

09:10.930 --> 09:13.710
or whatever, then it wouldn't, it would refuse.

09:13.710 --> 09:15.240
But in this case it seems fine.

09:15.240 --> 09:16.560
You could try other, you know,

09:16.560 --> 09:18.990
more adversarial examples like trying to get it

09:18.990 --> 09:21.459
to make a really taboo joke and see if it fails.

09:21.459 --> 09:23.880
But this one seemed to be fine.

09:23.880 --> 09:25.550
I didn't wanna make very taboo jokes, I just wanted

09:25.550 --> 09:28.590
to make them not entirely politically correct,

09:28.590 --> 09:29.910
but this is a joke.

09:29.910 --> 09:31.680
It came back and it was actually wrong.

09:31.680 --> 09:33.450
It says, "I told my friend I was going

09:33.450 --> 09:34.680
to the bar for some fruit juice.

09:34.680 --> 09:36.180
He looked confused when I came back with a beer.

09:36.180 --> 09:38.250
I said, technically it's a bunch of grapes".

09:38.250 --> 09:39.390
That's actually wine.

09:39.390 --> 09:41.647
So pretty weird bizarre joke there.

09:41.647 --> 09:42.794
And so we're gonna train it.

09:42.794 --> 09:44.670
So far we haven't done any training

09:44.670 --> 09:46.170
and this is where, you know, it starts

09:46.170 --> 09:47.322
to get very interesting.

09:47.322 --> 09:50.096
There are a lot of different strategies for getting data

09:50.096 --> 09:53.340
for training and you can just do this manually, right?

09:53.340 --> 09:54.600
If you just go and write a bunch

09:54.600 --> 09:55.653
of jokes, you could do that.

09:55.653 --> 09:58.675
But one really quick and dirty way to do this is I go

09:58.675 --> 10:01.590
and find a bunch of resources.

10:01.590 --> 10:03.442
So here I've just found a bunch of joke websites

10:03.442 --> 10:07.500
and then I just copy and paste the websites into ChatGPT

10:07.500 --> 10:09.210
and I ask it to give me an array

10:09.210 --> 10:11.700
of those jokes in this structure.

10:11.700 --> 10:13.440
And then specifically, I usually ask it

10:13.440 --> 10:16.650
to just reverse engineer the input variables.

10:16.650 --> 10:18.180
If you look on some of these websites,

10:18.180 --> 10:20.460
you'll see it does have the joke, right?

10:20.460 --> 10:22.726
And it does have the comedian Ricky Gervais,

10:22.726 --> 10:25.560
but it doesn't have the topic on that website.

10:25.560 --> 10:28.680
So I just ask it to write the topic, to guess the topic.

10:28.680 --> 10:31.140
And then that way we have some training data now

10:31.140 --> 10:34.470
'cause we have our inputs and then we have expected outputs.

10:34.470 --> 10:37.620
But obviously we've just reverse engineered those inputs,

10:37.620 --> 10:38.790
which is pretty cool.

10:38.790 --> 10:40.830
And it does a really good job of doing this.

10:40.830 --> 10:44.250
It's way faster than you doing it all manual yourself.

10:44.250 --> 10:45.300
We can see I've got a bunch

10:45.300 --> 10:46.740
of different jokes in this method.

10:46.740 --> 10:49.359
It took about 10, 20 minutes to get these together

10:49.359 --> 10:51.840
and that's all you need really.

10:51.840 --> 10:54.510
You could also have it in A CSV and pull from a CSV,

10:54.510 --> 10:57.690
but I just written some code here that goes

10:57.690 --> 10:58.980
through all the different jokes

10:58.980 --> 11:02.670
and then turns them into DSPY examples.

11:02.670 --> 11:05.371
And it's important to have those examples in here

11:05.371 --> 11:08.720
and to be able to actually in this format,

11:08.720 --> 11:10.140
to actually train the data.

11:10.140 --> 11:11.310
A few different things are important.

11:11.310 --> 11:13.501
You could put the name of the comedian in here as well.

11:13.501 --> 11:16.388
For now I've just said, okay, get me the topic and the joke

11:16.388 --> 11:19.650
and then I pass them in to the example.

11:19.650 --> 11:21.187
And then I say with inputs topic.

11:21.187 --> 11:24.570
So this tells it that the input variable is topic

11:24.570 --> 11:27.960
that you could also have, you know, multiple input variables

11:27.960 --> 11:30.420
and then the rest of everything else it becomes like an

11:30.420 --> 11:32.547
output variable or something else, right?

11:32.547 --> 11:34.933
But this code basically splits it into three sets

11:34.933 --> 11:37.590
and it took me a while to understand what they meant

11:37.590 --> 11:39.090
by these different things,

11:39.090 --> 11:40.620
but because they use slightly different

11:40.620 --> 11:42.090
terminology than I'm used to.

11:42.090 --> 11:44.094
But specifically the train set is

11:44.094 --> 11:46.344
what you're gonna train the model on.

11:46.344 --> 11:47.940
So that's your training data

11:47.940 --> 11:49.278
and that should be the largest bucket.

11:49.278 --> 11:51.570
The validation set, the val set,

11:51.570 --> 11:54.540
that is when it's optimizing your prompt,

11:54.540 --> 11:56.970
that is what it wants to test against

11:56.970 --> 11:58.805
to see whether it's doing a good job.

11:58.805 --> 12:03.805
And then the devset is left aside as like a final check

12:04.020 --> 12:05.370
to see how well it really did.

12:05.370 --> 12:07.500
So the reason why you split it into these three different

12:07.500 --> 12:09.864
things is it needs something to optimize against.

12:09.864 --> 12:11.880
So it needs to train on the data

12:11.880 --> 12:13.530
and then see if it can predict something

12:13.530 --> 12:16.424
and then change the prompt of prediction fails.

12:16.424 --> 12:18.328
So that's why you need these two.

12:18.328 --> 12:22.170
But if that's all you do, then it's seen all the data

12:22.170 --> 12:23.267
that it's testing against.

12:23.267 --> 12:25.260
I mean that's not good practice.

12:25.260 --> 12:27.570
What's better practice is to see if it can generalize

12:27.570 --> 12:29.490
to new problems it hasn't seen yet.

12:29.490 --> 12:32.133
And then, and that's why you wanna train

12:32.133 --> 12:35.104
and validate the algorithm with this, these sets.

12:35.104 --> 12:39.180
But then do a kind of blind test on this new set,

12:39.180 --> 12:40.380
this devset here,

12:40.380 --> 12:42.450
and that's gonna tell you whether you've actually done

12:42.450 --> 12:43.770
a good job or not, right?

12:43.770 --> 12:47.209
You can see here we've got 76 jokes in the training set,

12:47.209 --> 12:51.057
25 jokes to test them against and then 26 for the devset

12:51.057 --> 12:55.530
and this just splits into, into 60, 20, 20.

12:55.530 --> 12:58.710
Cool, you can use that, you can adjust those if you want.

12:58.710 --> 12:59.880
But here's an example.

12:59.880 --> 13:04.274
You can see there's the kind of formatted response here.

13:04.274 --> 13:08.040
All right, so now we're gonna get into the fun stuff,

13:08.040 --> 13:09.666
evaluation metrics and the,

13:09.666 --> 13:12.936
you can actually do this programmatically if you have,

13:12.936 --> 13:15.840
maybe you're optimizing a prompt for a blog post

13:15.840 --> 13:18.870
and you can count the word length with a python function,

13:18.870 --> 13:22.053
that's ideal because DSPY does a lot of testing

13:22.053 --> 13:24.930
and therefore if it's expensive to run that test,

13:24.930 --> 13:26.220
then it's a real problem.

13:26.220 --> 13:29.186
the other ideal thing is if you have a human annotated

13:29.186 --> 13:31.510
set of questions, if you're building a question

13:31.510 --> 13:34.710
and answer bot and you get a human to go through

13:34.710 --> 13:37.199
and label a hundred questions with the correct answer,

13:37.199 --> 13:39.180
that can be really good as well

13:39.180 --> 13:40.866
because you can use that as a test set

13:40.866 --> 13:44.213
and that is something, again, it can check very quickly

13:44.213 --> 13:47.370
and that can be really helpful in terms of the speed

13:47.370 --> 13:49.530
and cost of running DSPY.

13:49.530 --> 13:52.968
But in my case, it's coming up with the creative task,

13:52.968 --> 13:54.690
it's pretty hard to measure.

13:54.690 --> 13:56.010
Like I can't just,

13:56.010 --> 13:59.086
'cause I said this joke is about fishing doesn't mean

13:59.086 --> 14:00.750
that there is like a right answer

14:00.750 --> 14:02.550
to telling a joke about fishing

14:02.550 --> 14:04.740
and they might get a very different joke,

14:04.740 --> 14:08.100
but it is actually funnier than the one that I annotated.

14:08.100 --> 14:09.990
So in this case it makes sense.

14:09.990 --> 14:11.230
Instead of using programmatic evals

14:11.230 --> 14:15.417
or human evals, instead you should use synthetic evals

14:15.417 --> 14:17.310
and that means using AI

14:17.310 --> 14:19.050
to assess whether the joke is fun.

14:19.050 --> 14:21.179
That is its own kind of broad topic,

14:21.179 --> 14:22.651
which I won't go too much into,

14:22.651 --> 14:25.590
but the nice thing about DSPY is it makes it simple.

14:25.590 --> 14:28.561
So I create this again, just another DSPY signature,

14:28.561 --> 14:31.400
elegant and just an assessment prompt

14:31.400 --> 14:33.630
and you could optimize this prompt itself,

14:33.630 --> 14:34.860
which is pretty fun.

14:34.860 --> 14:36.480
But yeah, you wanna assess the quality

14:36.480 --> 14:38.011
of a joke along the spec side dimension.

14:38.011 --> 14:41.387
And then I have, you know, a few different questions here.

14:41.387 --> 14:43.169
I have the joke, the topic,

14:43.169 --> 14:46.290
and then I have the question to assess the joke against.

14:46.290 --> 14:48.833
And then the output field, which is yes or no.

14:48.833 --> 14:50.942
Now I have that signature

14:50.942 --> 14:53.361
and then I just set up this function metric

14:53.361 --> 14:57.550
where I just get in a topic and then I get the joke

14:58.441 --> 14:59.889
and then the prediction, sorry,

14:59.889 --> 15:04.889
the topic and the joke from, and the topic comes from gold,

15:05.070 --> 15:07.554
which is the golden answer or the reference answer.

15:07.554 --> 15:10.814
And then the prediction comes from the model itself.

15:10.814 --> 15:12.510
So you need to set it up.

15:12.510 --> 15:13.835
And then I have three questions

15:13.835 --> 15:15.750
which checks the performance.

15:15.750 --> 15:18.673
So the three questions are, is it funny,

15:18.673 --> 15:20.340
would this joke actually be funny

15:20.340 --> 15:22.560
to an adult attending a comedy show?

15:22.560 --> 15:24.655
Is it relevant, is the joke relevant to the topic?

15:24.655 --> 15:27.330
And then format is the joke is only the joke return,

15:27.330 --> 15:29.409
no disclaimer or text pretending the joke.

15:29.409 --> 15:32.430
Lemme just run this, and here I'm using

15:32.430 --> 15:35.247
because I have, I set up this TJ pair example before.

15:35.247 --> 15:39.208
And the inputs are basically just the inputs

15:39.208 --> 15:42.376
that we decided before the topics and then the labels

15:42.376 --> 15:45.510
and the label is the output, which is the joke.

15:45.510 --> 15:49.020
You can use this structure here if it's formatted correctly.

15:49.020 --> 15:51.390
But you can see that the TJ pair

15:51.390 --> 15:52.950
that the first one we came up with

15:52.950 --> 15:55.410
before, if we look up here,

15:55.410 --> 15:57.839
it was the one about fishing.

15:57.839 --> 16:02.250
So it's saying that passes the test, which is great.

16:02.250 --> 16:06.780
It's scored one the, yeah, there we go.

16:06.780 --> 16:09.090
So that scored one, it's a hundred percent.

16:09.090 --> 16:12.270
And that means, the way I calculated this is I took

16:12.270 --> 16:14.723
the score and then I divided it by the length of questions.

16:14.723 --> 16:17.310
So it sums up like if it's scored,

16:17.310 --> 16:19.434
it gets one point for each thing essentially.

16:19.434 --> 16:22.890
And then I just round it down to two decimal points.

16:22.890 --> 16:24.873
And you can see how it does this again, you know,

16:24.873 --> 16:27.600
it's just inspecting history, it's assessing the quality

16:27.600 --> 16:29.700
of the joke, it's giving the format.

16:29.700 --> 16:31.230
And then you can see it came back

16:31.230 --> 16:32.492
with the assessment answer.

16:32.492 --> 16:34.860
Now one thing I noticed, I dunno if this is some issue

16:34.860 --> 16:36.900
with DSPY in general,

16:36.900 --> 16:39.729
but I found that the prompt format comes back wrong.

16:39.729 --> 16:41.665
It's supposed to just say yes,

16:41.665 --> 16:44.633
but instead it repeated the word assessment answer.

16:44.633 --> 16:46.993
If you just follow the tutorials blindly,

16:46.993 --> 16:50.373
it's gonna tell you just to check for the word yes.

16:50.373 --> 16:53.550
But the problem is if you have this assessment answer

16:53.550 --> 16:56.257
in the answer here, then you'll get a zero.

16:56.257 --> 16:58.611
Even though it did answer yes.

16:58.611 --> 17:02.340
So I had to change this code a little bit where you can see,

17:02.340 --> 17:05.153
that's why I'm saying here if yes is in the answer,

17:05.153 --> 17:07.894
rather than if the answer equals yes,

17:07.894 --> 17:10.200
which is what they recommended in the tutorials.

17:10.200 --> 17:12.270
So just a little gotcha to check.

17:12.270 --> 17:14.040
'Cause it was saying my jokes weren't funny

17:14.040 --> 17:15.870
and I'm like, I'm sure they are funny

17:15.870 --> 17:17.520
and found out there was saying yes,

17:17.520 --> 17:19.440
it was just not coming back at the right format.

17:19.440 --> 17:22.260
So that's maybe something to look into, all right.

17:22.260 --> 17:23.367
And then just to test it

17:23.367 --> 17:25.736
and give a bad response just to check it.

17:25.736 --> 17:27.359
Here's a topic, fishing,

17:27.359 --> 17:29.670
and then I've, this is like one

17:29.670 --> 17:31.920
of the typical things you get from ChatGPT.

17:31.920 --> 17:33.310
It'll say, okay, here's a funny joke for you,

17:33.310 --> 17:35.760
or it'll say there's a disclaimer, okay,

17:35.760 --> 17:38.010
this is not very family friendly, but here's a joke.

17:38.010 --> 17:40.223
It doesn't like to tell funny jokes

17:40.223 --> 17:42.030
because it gets upset, I guess

17:42.030 --> 17:44.862
because it's been trained to be very politically correct.

17:44.862 --> 17:47.730
It, you know, it's afraid of offending people.

17:47.730 --> 17:49.860
So it usually gives some disclaimer

17:49.860 --> 17:51.480
or it tells you what it's gonna do

17:51.480 --> 17:52.857
before it does it all right?

17:52.857 --> 17:56.010
And that's a problem, so you can see here that we scored 33%

17:56.010 --> 17:59.422
because we did have the disclaimer at the beginning.

17:59.422 --> 18:03.120
And also the joke isn't very funny, but it was relevant.

18:03.120 --> 18:07.290
So we scored one point, you know, out three, cool.

18:07.290 --> 18:10.410
So now we've, this is a lot of setup as you can imagine,

18:10.410 --> 18:12.570
but I promise you the setup is gonna be worth it.

18:12.570 --> 18:13.764
So when you run this,

18:13.764 --> 18:17.250
and this normally takes a bit of time, right?

18:17.250 --> 18:19.256
Normally it takes a few minutes,

18:19.256 --> 18:22.140
but what this is doing is evaluating

18:22.140 --> 18:26.296
the actual devset, this is the,

18:26.296 --> 18:27.690
what we're gonna test against.

18:27.690 --> 18:30.630
It's establishing a bench line of basically like

18:30.630 --> 18:31.980
how good are the jokes.

18:31.980 --> 18:35.617
So it's using our evaluation metrics to see the jokes that,

18:35.617 --> 18:38.558
how well does the normal prompt work basically.

18:38.558 --> 18:40.189
And this is our make joke chain,

18:40.189 --> 18:41.520
this is the chain of thought.

18:41.520 --> 18:43.410
So this is without any optimization,

18:43.410 --> 18:45.261
we haven't changed anything about the prompt,

18:45.261 --> 18:46.559
how well does it work?

18:46.559 --> 18:50.580
And you can see that it does 18 out of 26 possible points

18:50.580 --> 18:52.047
or 70% right?

18:52.047 --> 18:54.930
And you can see which jokes it failed on as well.

18:54.930 --> 18:57.479
If you want you can opt, you can export this.

18:57.479 --> 18:59.160
All right, that's established the baseline.

18:59.160 --> 19:01.860
If we get better than 70%, we're happy, right?

19:01.860 --> 19:05.400
But now after all this setup, we've now gotten to the point

19:05.400 --> 19:07.589
where DSPY is gonna be useful for us

19:07.589 --> 19:09.680
and we're gonna use an optimizer.

19:09.680 --> 19:11.820
It might be a little bit confusing for you

19:11.820 --> 19:13.800
to think about this, but essentially

19:13.800 --> 19:16.320
because they have obscure names,

19:16.320 --> 19:19.650
but essentially there's only really four optimizers

19:19.650 --> 19:21.391
that are worth doing that there's,

19:21.391 --> 19:23.700
and it really depends on how many

19:23.700 --> 19:25.290
examples of the task you have.

19:25.290 --> 19:28.455
So the first one is just called BootstrapFewShot.

19:28.455 --> 19:31.470
And what that does is it, you can think about it

19:31.470 --> 19:33.840
as just adding the right examples to the task.

19:33.840 --> 19:35.100
So if they pass the metric,

19:35.100 --> 19:37.589
then it adds it into the prompt as an example

19:37.589 --> 19:39.907
of the task being done well.

19:39.907 --> 19:41.746
There's also a BootstrapFewShot

19:41.746 --> 19:43.980
with a random search and this, you know,

19:43.980 --> 19:47.091
if you have more than 10 examples, if you have 50 examples,

19:47.091 --> 19:49.936
then it's worth using because it actually does

19:49.936 --> 19:51.450
do some optimization.

19:51.450 --> 19:53.220
It will find the examples

19:53.220 --> 19:55.269
that improve the performance the most against

19:55.269 --> 19:56.496
your evaluation metric.

19:56.496 --> 20:00.064
So it is more expensive 'cause it does a lot of testing,

20:00.064 --> 20:03.213
but it does, you know, a better job than BootstrapFewShot.

20:04.500 --> 20:08.160
And then if you have a lot of examples, you can use MIPRO.

20:08.160 --> 20:11.010
And what this does is it does BootstrapFewShot

20:11.010 --> 20:13.860
with random search, like it does optimize the examples,

20:13.860 --> 20:15.689
but it also optimizes the instructions.

20:15.689 --> 20:18.080
And it's important to note that BootstrapFewShot

20:18.080 --> 20:20.430
and BootstrapFewShot with random search,

20:20.430 --> 20:22.530
they do not change your instructions at all.

20:22.530 --> 20:24.280
It's just uses what was in the doc string

20:24.280 --> 20:27.844
and then adds a lot of examples of the task being done.

20:27.844 --> 20:30.510
So that's the major difference here.

20:30.510 --> 20:32.241
There's also one more we're gonna talk about

20:32.241 --> 20:35.670
which just optimizes the instructions without

20:35.670 --> 20:36.750
messing with the examples.

20:36.750 --> 20:39.330
So we'll do that at the end, and then BootstrapFinetune

20:39.330 --> 20:41.700
is really if you want to take a smart model

20:41.700 --> 20:44.254
and distill it down to a smaller one if you wanna do some

20:44.254 --> 20:46.650
kind of fine tuning essentially.

20:46.650 --> 20:48.690
Alright, so let's see how this works.

20:48.690 --> 20:51.101
And again, normally it takes a few minutes to do this,

20:51.101 --> 20:54.630
but we end up with a compiled model at the end

20:54.630 --> 20:55.579
and that compiled model

20:55.579 --> 20:59.016
it supposedly will do a better job essentially.

20:59.016 --> 21:02.730
And you can see because it's added some prompts as some

21:02.730 --> 21:04.740
examples to the prompt, which we know is like one

21:04.740 --> 21:06.750
of the prompt engineering principles of like how

21:06.750 --> 21:08.130
to improve performance.

21:08.130 --> 21:10.336
So you can see it already does a better job.

21:10.336 --> 21:12.300
So it has the rationale here

21:12.300 --> 21:14.490
and it says, let's play on the idea

21:14.490 --> 21:16.652
of a novice fisherman misunderstanding common fishing terms.

21:16.652 --> 21:19.025
And the joke is better, I think.

21:19.025 --> 21:21.360
"Why don't fish make good musicians?

21:21.360 --> 21:24.510
I took my buddy fishing and he threw his guitar in the lake

21:24.510 --> 21:27.570
because I told him we're going to catch a few bass or bass".

21:27.570 --> 21:28.770
So it's better, but like,

21:28.770 --> 21:30.883
actually a little bit weird to set up.

21:30.883 --> 21:33.570
Anyway, so I would say it's at least more

21:33.570 --> 21:34.915
interesting than the first joke.

21:34.915 --> 21:38.760
All right, and let's see what DSPY actually did for us.

21:38.760 --> 21:41.670
All it did, again, it didn't change any of the instructions,

21:41.670 --> 21:43.140
that's the instruction that we gave it.

21:43.140 --> 21:45.311
All it did is it added these examples.

21:45.311 --> 21:50.311
So it added this example and it added this example.

21:51.910 --> 21:55.453
Okay, we can see the full thing added one on holidays.

21:55.453 --> 21:58.170
And you know, it's important to note

21:58.170 --> 22:02.661
that these synthetic examples, so these are things that like

22:02.661 --> 22:06.480
these are things that like it is added.

22:06.480 --> 22:08.061
So some of them are synthetic examples

22:08.061 --> 22:09.508
and then some of them are actual,

22:09.508 --> 22:11.490
some of them are actual examples

22:11.490 --> 22:13.830
from your training data that's what it does.

22:13.830 --> 22:17.026
So that's interesting and it does improve the performance.

22:17.026 --> 22:20.640
So which we could test but I'm gonna show you

22:20.640 --> 22:23.040
the more interesting one, which is the BootstrapFewShot

22:23.040 --> 22:24.120
with random search.

22:24.120 --> 22:26.040
And what this does is it makes sure it's including

22:26.040 --> 22:27.717
only the best examples.

22:27.717 --> 22:30.365
And so if we evaluate this, we can see

22:30.365 --> 22:34.620
that the performance is a lot better.

22:34.620 --> 22:37.890
We've got 94% instead of 70%.

22:37.890 --> 22:40.080
So that's the one you really wanna use if you have more

22:40.080 --> 22:41.130
than a hundred examples.

22:41.130 --> 22:42.960
It actually really improved the performance

22:42.960 --> 22:45.389
and we didn't have to do anything to optimize the prompt.

22:45.389 --> 22:50.389
And we can see if we run this we can see which

22:50.850 --> 22:52.771
this is the full prompt here, like which examples perform

22:52.771 --> 22:56.047
the best when it added them in, right?

22:56.047 --> 22:58.980
And some of them again are from the training data,

22:58.980 --> 23:01.244
some of them are synthetic examples

23:01.244 --> 23:05.280
and you can see as well adding some more offensive jokes

23:05.280 --> 23:07.303
in here seems to have improved the performance.

23:07.303 --> 23:10.140
But we don't have to know why one prompt

23:10.140 --> 23:13.380
and one example performed better than another, DSPY

23:13.380 --> 23:16.680
has figured that out for us, which is great.

23:16.680 --> 23:18.700
Cool, and yeah, this is really good

23:18.700 --> 23:20.454
and it's a big leap forward

23:20.454 --> 23:22.193
and we didn't have to do any prompting.

23:22.193 --> 23:23.970
Yes there's a lot of setup,

23:23.970 --> 23:25.680
but once you understand how this works

23:25.680 --> 23:27.428
and then you can just change a few variables

23:27.428 --> 23:29.310
and then you've got it and you can save

23:29.310 --> 23:30.540
this locally as well.

23:30.540 --> 23:32.726
So you can use this prompt and you can load it.

23:32.726 --> 23:35.070
So if you're using this in production,

23:35.070 --> 23:37.110
you could just load the turbo.joke.json

23:37.110 --> 23:40.290
and then you have the optimized prompt to use.

23:40.290 --> 23:41.247
Cool, so hopefully that makes sense.

23:41.247 --> 23:44.040
But we're gonna show you COPRO, which is just

23:44.040 --> 23:46.623
to show you like we're not gonna do a full MIPRO one

23:46.623 --> 23:48.000
where MIPRO is the one

23:48.000 --> 23:49.833
where optimizes the examples and the prompt.

23:49.833 --> 23:52.901
It's very intensive and wouldn't necessarily do it

23:52.901 --> 23:57.901
if you are using a GPT turbo or expensive AI like I am.

23:58.186 --> 24:00.298
'Cause it might cost a lot.

24:00.298 --> 24:03.960
But we'll try COPRO and you can set limits on this as well,

24:03.960 --> 24:05.951
but I just wanted to test it out and see how it works.

24:05.951 --> 24:08.631
COPRO only optimizes the instructions.

24:08.631 --> 24:11.390
So the way you set it up, it's very similar

24:11.390 --> 24:14.620
except you pass it the eval directly into the prompt

24:14.620 --> 24:16.320
optimizer to compile it.

24:16.320 --> 24:18.351
If you look here, oh here we go.

24:18.351 --> 24:22.267
So it is actually changing it and making new responses

24:22.267 --> 24:26.367
and you can see that wherever it has tried

24:26.367 --> 24:30.120
that prompt in the past it did just jump ahead

24:30.120 --> 24:33.000
and then ones where it hasn't tried that prompt in the past

24:33.000 --> 24:34.860
and then it goes and runs it again.

24:34.860 --> 24:36.879
So this is again, one of the major benefits

24:36.879 --> 24:41.879
of using DSPY is that it saves you time

24:42.156 --> 24:45.150
for retesting something that's already been tested.

24:45.150 --> 24:47.070
But you can see here that it's, you know,

24:47.070 --> 24:49.080
live updating the score.

24:49.080 --> 24:53.328
This one, it's got 74% response, which is pretty good,

24:53.328 --> 24:55.919
but you know, it's doing okay.

24:55.919 --> 24:59.100
This is just changing the instructions, right?

24:59.100 --> 25:01.047
It's not adding any examples to the prompt,

25:01.047 --> 25:03.750
but you would get a better result if you changed both

25:03.750 --> 25:05.490
the instructions and the prompts.

25:05.490 --> 25:06.736
But it takes a long time.

25:06.736 --> 25:10.154
And you can see this is the previous winner that we had.

25:10.154 --> 25:12.303
We just got that through the candidate programs.

25:12.303 --> 25:13.830
You can see that the prompts

25:13.830 --> 25:15.600
that it comes up with are pretty weird.

25:15.600 --> 25:18.240
So it says like engineer humorous remark related

25:18.240 --> 25:19.860
to your specified topic with strategic use

25:19.860 --> 25:21.390
of satire understanding hyperbole

25:21.390 --> 25:22.826
for a heightened comedy effect.

25:22.826 --> 25:24.730
It just comes up with really odd prompts,

25:24.730 --> 25:26.940
but it does, it tends to work.

25:26.940 --> 25:30.420
And this one got 86% with just improving the instructions,

25:30.420 --> 25:31.530
no examples added.

25:31.530 --> 25:34.645
So you can see why it would really improve performance.

25:34.645 --> 25:37.910
Now a couple of things to be wary of with DSPY.

25:37.910 --> 25:40.413
It does make hundreds of tests,

25:40.413 --> 25:43.470
so it can cost you a lot if you're using AI

25:43.470 --> 25:46.180
to test whether it did well or not.

25:46.180 --> 25:48.450
So be careful both if you run it,

25:48.450 --> 25:50.850
especially if you're using GPT-4 to evaluate

25:50.850 --> 25:53.337
and it might be cheaper just to optimize it yourself.

25:53.337 --> 25:56.304
But yeah, hopefully you understand how this works

25:56.304 --> 26:00.003
and you can use it for optimizing some of your own prompts.