WEBVTT

00:00.780 --> 00:01.800
-: All right, let's figure out

00:01.800 --> 00:05.580
how to optimize our prompts using SAMMO,

00:05.580 --> 00:10.050
so SAMMO is quite useful in terms of optimization.

00:10.050 --> 00:12.330
In particular, prompt engineering

00:12.330 --> 00:15.060
is what I find it's most useful for.

00:15.060 --> 00:18.540
I still prefer DSPy, to be honest, for optimization.

00:18.540 --> 00:21.060
I think it works a little bit better out of the box,

00:21.060 --> 00:23.790
but I suspect that both of these libraries

00:23.790 --> 00:25.620
will learn from each other

00:25.620 --> 00:28.410
and steal each other's best features, I imagine, over time,

00:28.410 --> 00:30.480
because there's a lot of similarities between the two.

00:30.480 --> 00:35.480
Import SAMMO, and then we need to import, from the runners,

00:36.480 --> 00:38.430
we need to get the OpenAIChat.

00:38.430 --> 00:42.389
We also wanna get our components, so what do we need?

00:42.389 --> 00:46.447
We just need output, I think, at this point.

00:46.447 --> 00:50.410
We also wanna get the data table

00:55.380 --> 00:59.523
and the evaluations,

01:01.710 --> 01:03.564
and then import os

01:03.564 --> 01:07.593
and import pandas as pd.

01:08.730 --> 01:11.880
One thing I wanna do here, when we're doing optimization,

01:11.880 --> 01:16.880
is just to set the logger as just "WARNING"

01:17.240 --> 01:21.360
We don't wanna get a bunch of outputted text,

01:21.360 --> 01:22.857
which can get a little bit annoying

01:22.857 --> 01:24.900
but that's optional, you don't have to do that

01:24.900 --> 01:27.480
so we're gonna set up our OpenAIChat

01:27.480 --> 01:29.627
and it's a pretty standard setup

01:29.627 --> 01:34.627
Here I'm gonna do just the timeout, 30 seconds

01:35.340 --> 01:39.990
and then we're going to define a function for loading data

01:39.990 --> 01:44.470
so the way we do that is we load the transaction data,

01:48.300 --> 01:51.518
which is that file name in pandas

01:51.518 --> 01:56.518
and you can get that from, get that into the format,

01:57.810 --> 02:00.490
to define the input fields

02:02.850 --> 02:04.443
Here it's just "description"

02:07.890 --> 02:08.920
Output fields

02:13.230 --> 02:15.120
which is just classification

02:15.120 --> 02:17.430
If we had more than one, we would put this as a list,

02:17.430 --> 02:19.680
but we don't need to

02:19.680 --> 02:22.410
and then we just define the constants,

02:22.410 --> 02:25.590
which is the instructions we're gonna use for all the data

02:25.590 --> 02:28.200
and pull that into our prompt

02:28.200 --> 02:31.450
and we just need to return mydata

02:31.450 --> 02:34.637
Okay, we also need here,

02:34.637 --> 02:39.637
I need to define a function for accuracy as well,

02:39.990 --> 02:44.280
put y_true and y_prod.

02:44.280 --> 02:46.030
These are both gonna be data tables

02:46.890 --> 02:49.800
and then we're gonna return an evaluation score

02:49.800 --> 02:54.000
so what that means is we're gonna get the y_true outputs

02:54.000 --> 02:59.000
and then we're gonna get define n_correct,

03:02.250 --> 03:04.997
which is just gonna get the sum

03:06.700 --> 03:08.783
of whether y_p equals y_t

03:11.542 --> 03:13.342
so y_prod equal,

03:13.342 --> 03:15.270
and we're just gonna iterate through the zip

03:15.270 --> 03:17.250
of the y_prod and true.

03:17.250 --> 03:19.873
And so what this is basically gonna do is

03:19.873 --> 03:24.030
it's gonna zip up these two lists of the results,

03:24.030 --> 03:25.920
so they're both next to each other,

03:25.920 --> 03:27.510
and then it's gonna iterate through them

03:27.510 --> 03:29.910
and then just check whether they're the same.

03:29.910 --> 03:31.800
And then, once you have that,

03:31.800 --> 03:35.910
we can just get the final result,

03:35.910 --> 03:39.040
which will sum n_correct

03:39.877 --> 03:41.637
divided by len(y_true),

03:45.210 --> 03:48.240
and that should, okay, the other thing we need

03:48.240 --> 03:51.690
to pull in here, which I didn't pull in before,

03:51.690 --> 03:55.920
we need to go from SAMMO instructions,

03:55.920 --> 03:59.523
we wanna get MetaPrompt, Section,

04:02.640 --> 04:06.097
Paragraph, InputData, and FewshotExamples

04:10.470 --> 04:12.570
so that's gonna help us build our template

04:13.989 --> 04:15.910
and we also need

04:22.050 --> 04:24.850
from the dataformatters is import

04:25.993 --> 04:28.326
the QuestionAnswerFormatter.

04:40.500 --> 04:42.993
Okay, I think that's everything we need there.

04:43.987 --> 04:46.650
Next do is we're gonna set up our labels,

04:46.650 --> 04:49.890
which is just other food, entertainment, utilities,

04:49.890 --> 04:52.110
we're gonna explain what this means in a second

04:52.110 --> 04:55.920
but we're basically, this is like from a data set

04:55.920 --> 04:58.580
that has transactions

04:58.580 --> 05:01.470
so we need to classify into one of these buckets

05:01.470 --> 05:03.720
so that's what our prompt is gonna do

05:03.720 --> 05:05.570
so the set up a MetaPrompt,

05:06.960 --> 05:11.960
and this is using basically just a list

05:11.970 --> 05:13.500
of the different sections.

05:13.500 --> 05:16.150
so the first section we're gonna have is instructions

05:19.860 --> 05:22.770
and you can get that from the constant,

05:22.770 --> 05:26.460
actually, yeah, it's, I've used this type of thing before.

05:26.460 --> 05:28.110
We also got examples

05:28.110 --> 05:30.459
The featured examples comes from my data

05:30.459 --> 05:33.898
I haven't loaded my data yet so let's do that actually

05:33.898 --> 05:38.103
MyData equals load_data, the set function we created here,

05:39.990 --> 05:41.540
I'm gonna get a sample as well.

05:48.730 --> 05:51.480
In fact, we can make this sample,

05:52.680 --> 05:54.473
actually this is gonna be a new sample.

05:57.572 --> 06:00.670
Yeah, this is a specific paragraph

06:02.430 --> 06:03.900
and now that's loading properly

06:03.900 --> 06:05.400
so we have our output labels

06:05.400 --> 06:09.030
What we actually wanna do is we wanna

06:09.030 --> 06:14.030
define this dynamically to be,

06:20.370 --> 06:23.820
so this is just taking the labels that we defined here

06:23.820 --> 06:26.080
and then it's just joining them together with a comma

06:26.080 --> 06:29.280
so that way we know we're working from the same labels

06:29.280 --> 06:30.540
when we're using this prompt

06:30.540 --> 06:32.478
so if we ever wanna change these labels up here,

06:32.478 --> 06:35.280
it'll be reflected later, prompt.

06:35.280 --> 06:38.790
We're rendering areas, rendering it as marked down

06:38.790 --> 06:41.060
and then we're also passing those labels into this

06:41.060 --> 06:41.900
QuestionAnswerFormatter

06:41.900 --> 06:44.130
so which is gonna do the hub for us,

06:44.130 --> 06:46.809
of extracting them from the final prompt

06:46.809 --> 06:49.030
and then there's imprompt

06:50.894 --> 06:53.040
and then this basically just defines

06:53.040 --> 06:55.510
what happens if there's an empty result

06:55.510 --> 06:58.513
and it automatically sets this stuff up for us

06:58.513 --> 07:01.768
and then we can just run it, which is

07:01.768 --> 07:06.768
we set the prompt, which is this, the mini batch size,

07:07.524 --> 07:10.273
which is how many labels it does at one time

07:10.273 --> 07:13.929
and then on arrow and then we run with the runner

07:13.929 --> 07:15.780
and the sample

07:15.780 --> 07:19.710
and we're just gonna say, just see,

07:19.710 --> 07:21.573
look at the last five.

07:23.743 --> 07:25.893
Okay, we have an issue here.

07:41.081 --> 07:43.413
The issue just slide.

07:48.090 --> 07:52.607
Yeah so this is the main issue that we had is we wanted

07:52.607 --> 07:57.188
to get the values of the outputs out before.

07:57.188 --> 08:01.170
Yeah, here we go so this run

08:01.170 --> 08:03.825
and we can see the inputs, see the transactions,

08:03.825 --> 08:05.820
and then the outputs

08:05.820 --> 08:07.681
so that prompt is working,

08:07.681 --> 08:10.351
we can check the accuracy of this prompt

08:10.351 --> 08:14.155
so this is doing 85% accuracy

08:14.155 --> 08:17.760
and so it's got 85% of these labels correct,

08:17.760 --> 08:19.860
Which is really helpful to see

08:19.860 --> 08:22.590
and now we have like a baseline,

08:22.590 --> 08:24.748
we can start to optimize that for.

08:24.748 --> 08:26.280
If you're interested in seeing

08:26.280 --> 08:30.180
what the actual prompts look like when they're compiled,

08:30.180 --> 08:32.680
we can actually get that and see what the post

08:34.904 --> 08:36.854
and we're just gonna get the first one.

08:45.030 --> 08:46.830
Here we go so this is what the prompt looks like

08:46.830 --> 08:50.550
We put in the fewshot examples, got the instructions,

08:50.550 --> 08:52.830
and then you can see the output labels.

08:52.830 --> 08:54.484
It's what we've told it to use

08:54.484 --> 08:57.719
and then we've given it these inputs.

08:57.719 --> 09:00.249
Cool so that's how that works

09:00.249 --> 09:03.915
We just want to try and improve this prompt in some way

09:03.915 --> 09:06.510
so there's a few different things we could test.

09:06.510 --> 09:10.140
We could test whether, you know, having a different number

09:10.140 --> 09:12.060
of fewshot examples helps

09:12.060 --> 09:15.120
We could test whether change the instructions helps

09:15.120 --> 09:17.070
or the output labels, whatever we want

09:17.070 --> 09:19.140
We could add extra sections to the prompt,

09:19.140 --> 09:22.473
but we can, we can do a pretty easy test here,

09:22.473 --> 09:26.010
which is by adding in the operators

09:26.010 --> 09:29.250
from the recent samples to the AB test prompts,

09:29.250 --> 09:34.250
which is search EnumerativeSearch operator

09:34.410 --> 09:38.763
We also need the search app,

09:41.310 --> 09:44.250
which is one of the search operator's gonna tell us

09:44.250 --> 09:48.810
to get one of those of the prompts that we're testing

09:48.810 --> 09:51.625
and then put,

09:51.625 --> 09:56.625
and which the convention is, we create like a prompt space,

09:57.840 --> 09:59.449
which is all the different instructions

09:59.449 --> 10:02.580
that we need in order to create the prompt

10:02.580 --> 10:04.740
so ours is gonna be pretty simple

10:04.740 --> 10:07.263
We're just gonna do one of,

10:08.284 --> 10:11.460
and then we're gonna have two different prompts here

10:11.460 --> 10:12.570
so yeah, here we go

10:12.570 --> 10:15.330
We have the one that, this is the instructions that we had

10:15.330 --> 10:18.612
before and then this is the same thing,

10:18.612 --> 10:21.780
but with just, this is very important to my career,

10:21.780 --> 10:24.720
I appended to it instructions

10:24.720 --> 10:28.770
We'll get enumerated each time when we do one of,

10:28.770 --> 10:30.008
so it's gonna test both

10:30.008 --> 10:34.980
and then we've got everything else set up, same you had

10:34.980 --> 10:39.699
before, except, so I'm just gonna copy this, except

10:39.699 --> 10:43.196
you know, now the instructions is coming from this one of,

10:43.196 --> 10:44.700
makes sense.

10:44.700 --> 10:49.700
Okay and then just gonna get returns,

10:52.080 --> 10:54.780
so now we're just returning from this function

10:54.780 --> 10:58.380
We're telling the output, got the mini batch size stuff.

10:58.380 --> 11:01.435
Okay, so we have this ready

11:01.435 --> 11:03.870
and now we can run our AB test

11:03.870 --> 11:05.796
so the way that we run our AB test

11:05.796 --> 11:08.027
now that we've got this setup done

11:08.027 --> 11:09.806
is we just create a sample

11:09.806 --> 11:14.806
and get like 25 and then we create a searcher

11:18.150 --> 11:20.790
and we set up the enumerative search into that

11:20.790 --> 11:22.650
We need the run of first

11:22.650 --> 11:26.737
and then the labeling_prompt_space,

11:29.700 --> 11:34.290
and then the accuracy true, the evaluation at true,

11:34.290 --> 11:36.930
so those, what this is gonna do is gonna run this

11:36.930 --> 11:41.152
enumerative search, which will basically just go through

11:41.152 --> 11:44.717
with the run, iterate through the labeling prompt space,

11:44.717 --> 11:47.250
in this case it was just gonna AB test these two

11:47.250 --> 11:49.127
instructions and then,

11:49.127 --> 11:52.203
it's gonna measure them based on the accuracy.

11:54.267 --> 11:57.934
Now, let's do y_pred, searcher.fit_transfrom

12:06.090 --> 12:11.090
sample and then show_report, let's run this.

12:21.510 --> 12:23.190
Okay, yeah so that was cached already

12:23.190 --> 12:24.870
so that ran pretty quickly here,

12:24.870 --> 12:27.199
but you can see, there we go

12:27.199 --> 12:31.206
This one here, this prompt with very important to my career

12:31.206 --> 12:33.420
I had zero errors like the other one,

12:33.420 --> 12:35.123
except it got slightly better accuracy, 84%

12:35.123 --> 12:40.123
so that's a super simple way to AB test.

12:40.230 --> 12:43.560
The optimizer I found is a little bit buggy,

12:43.560 --> 12:45.660
but I'm gonna copy and paste

12:45.660 --> 12:47.553
an example here that I did get working

12:47.553 --> 12:51.288
The example in the documentation

12:51.288 --> 12:53.520
is a little unclear to be honest,

12:53.520 --> 12:55.140
I ran into a bunch of issues

12:55.140 --> 12:59.310
but essentially the way this works is similar to this,

12:59.310 --> 13:00.630
I actually couldn't get it to work

13:00.630 --> 13:03.727
with the method prompting structure I had before,

13:03.727 --> 13:06.798
but if I set up basically the same prompt here,

13:06.798 --> 13:11.093
but instead they seem to use paragraphs instead of sections

13:11.093 --> 13:14.993
and then it's still got the one of as you can see here

13:14.993 --> 13:17.780
and then you've still got these things that you

13:17.780 --> 13:21.240
should like fit output labels, et cetera

13:21.240 --> 13:23.880
and they also it doesn't seem to work

13:23.880 --> 13:26.430
unless you have render as raw

13:26.430 --> 13:29.490
and then you have this like other formatter,

13:29.490 --> 13:30.902
which is the plain formatter

13:30.902 --> 13:33.683
and then we're getting the labels from

13:33.683 --> 13:37.170
the training data in this case

13:37.170 --> 13:40.860
so we're setting up a class initializing the training data

13:40.860 --> 13:42.463
and then we're getting the labels

13:42.463 --> 13:45.267
and then we're setting up slightly differently here

13:45.267 --> 13:46.800
I would just copy and paste,

13:46.800 --> 13:49.230
if you're gonna do an optimization, I'd copy and paste this

13:49.230 --> 13:51.032
and then just change some of the instructions

13:51.032 --> 13:53.340
and kind of use the same form

13:53.340 --> 13:56.190
I also found that in the documentation it recommends

13:56.190 --> 13:57.510
so this is raise,

13:57.510 --> 14:00.047
I would just follow this exactly if I was you

14:00.047 --> 14:04.162
but then the way it works, I'm just gonna demo this for you,

14:04.162 --> 14:07.940
pull in the data and then you create the training data

14:07.940 --> 14:11.266
In this case we're getting sampling 10 from my data

14:11.266 --> 14:15.023
and then you create these mutation operators

14:15.023 --> 14:18.990
and you just tell it what it is you're trying to mutate

14:18.990 --> 14:20.360
It's just gonna rewrite the prompt

14:20.360 --> 14:22.500
That's basically what they mean by mutations

14:22.500 --> 14:25.417
Gonna rewrite the instructions and paraphrase

14:25.417 --> 14:28.303
and then we'll see how that works

14:28.303 --> 14:33.083
so if we run the beam search,

14:33.083 --> 14:36.030
basically just passing in all these different things,

14:36.030 --> 14:38.211
the runner, the mutation operators accuracy,

14:38.211 --> 14:40.036
and these parameters,

14:40.036 --> 14:41.825
don't worry too much about what these do

14:41.825 --> 14:45.330
This is the report after actual report, right?

14:45.330 --> 14:47.224
It shows the different things It's traded blank one,

14:47.224 --> 14:49.528
try rephrasing it in different ways

14:49.528 --> 14:51.810
and then you have the different objectives

14:51.810 --> 14:54.657
so it seems to get 90% on some of them, which is great

14:54.657 --> 14:57.705
and then you can just here print out the best prompt

14:57.705 --> 15:01.320
so it's a little bit more accessible than DSPy I find

15:01.320 --> 15:04.440
but the optimizer I found it's really buggy

15:04.440 --> 15:06.960
Sometimes you run it and you get the objective of zero

15:06.960 --> 15:08.970
It just really depends on what you do and what you get

15:08.970 --> 15:11.820
Yeah, I would say it's still pretty early for optimization

15:11.820 --> 15:13.413
I probably wouldn't use it for that,

15:13.413 --> 15:15.719
but I imagine by the time you watch this video

15:15.719 --> 15:17.483
that they would've improved it

15:17.483 --> 15:20.280
and I found it is very elegant

15:20.280 --> 15:22.710
and easy to use for AB testing

15:22.710 --> 15:24.813
so I think there's some promise here.
