WEBVTT

00:00.040 --> 00:02.600
Dspi is an unusual framework, right?

00:03.280 --> 00:05.640
Uh, actually met up with the founder last week.

00:05.680 --> 00:08.120
Like, really smart guy that came out of Stanford.

00:08.160 --> 00:12.240
And they've used it internally at Stanford for a few of the other projects like storm, which was like

00:12.240 --> 00:14.680
precursor to deep research, was built on DSP.

00:14.720 --> 00:15.400
Oh, interesting.

00:15.440 --> 00:21.080
So I'd say DSP is like one of those kind of open secrets that a lot of people are using.

00:21.120 --> 00:27.040
Like someone went sleuth and found that all of the major vibe coding apps replit windsurf, like all

00:27.080 --> 00:30.080
of the CEOs, follow DSP, but they never talk about DSP.

00:30.120 --> 00:34.040
The reason I think is that the documentation is very academic.

00:34.080 --> 00:39.080
It doesn't really talk that much about real world business use cases, and it assumes that things like

00:39.520 --> 00:45.120
what, like Multihop is, or that you're familiar with this specific academic data set that nobody uses

00:45.120 --> 00:45.840
in the real world.

00:45.960 --> 00:50.600
So once you trudge through all that, you'll find it's actually pretty simple.

00:50.640 --> 00:50.920
Okay.

00:50.960 --> 00:57.000
They just looks very complicated because the people who made it are really smart and therefore the they

00:57.000 --> 01:00.230
have a bit of a blind spot, I think, in terms of making it more accessible.

01:00.230 --> 01:00.670
Yeah.

01:01.190 --> 01:05.350
They also settled on this weird domain specific language that is not very pythonic.

01:05.390 --> 01:11.070
Here is a Dspi signature, which is just if you extract it out, it's literally just defining what the

01:11.070 --> 01:12.910
inputs and outputs are of your prompt.

01:12.950 --> 01:13.310
Okay.

01:13.350 --> 01:15.390
And and these are the instructions right.

01:15.430 --> 01:17.710
Extract structured information from text.

01:18.150 --> 01:18.310
Right.

01:18.350 --> 01:20.070
So you can give really simple instructions.

01:20.070 --> 01:22.670
And that ends up going into the and that ends up going into the prompt.

01:22.710 --> 01:23.230
That's weird.

01:23.270 --> 01:24.150
Yeah really weird.

01:24.150 --> 01:26.630
And then you have to specify the types.

01:26.870 --> 01:30.110
And this is like despite input fields Pi output field.

01:30.390 --> 01:30.510
Right.

01:30.510 --> 01:35.510
Once you've created a signature that is your prompt and the it builds the prompt from that.

01:35.910 --> 01:40.950
But once you've got it in that format, you can swap out models really easily because you just it's

01:40.950 --> 01:47.110
literally the same signature and it has this adapter underneath that kind of makes it run on any of

01:47.110 --> 01:48.470
the different llms.

01:48.470 --> 01:52.710
So you don't have to worry about how does Google do formatting of inputs and outputs?

01:52.710 --> 01:56.830
How does OpenAI, which is nice, but the really big thing is the optimizer.

01:56.830 --> 02:02.550
So once you've told it, here are the inputs, here the outputs, and you've given it an evaluation

02:02.550 --> 02:08.590
metric and a data set to evaluate on it can automatically improve the prompt.

02:09.030 --> 02:10.710
So that's the real payoff.

02:10.750 --> 02:11.110
Yeah.

02:11.350 --> 02:12.910
Once you see that magic in action.

02:13.110 --> 02:13.390
Okay.

02:13.430 --> 02:16.270
Now it's worth it to me to go through all this extra effort.

02:16.310 --> 02:16.710
Yeah.

02:16.750 --> 02:18.870
And you can regularly see it pretty easy.

02:18.910 --> 02:20.950
Like a ten point percentage increase.

02:20.990 --> 02:24.670
Like you can go from, say, 40% accuracy to 50% accuracy.

02:24.710 --> 02:28.550
Or in some cases, I've gone from 50% accuracy to 90% accuracy.

02:28.590 --> 02:35.070
The really difficult thing about DSP as well is just that it forces you to really think about what you

02:35.070 --> 02:36.670
actually want out of your program.

02:36.830 --> 02:41.670
And quite often when people struggle with DSP, it's actually because they're struggling to specify

02:41.710 --> 02:43.190
what they want their program to do.

02:43.510 --> 02:49.830
What is the evaluation metric I want to use to define whether this program is working or not, and formalizing

02:49.830 --> 02:51.390
an evaluation metric.

02:51.430 --> 02:56.620
Most people, when they're building AI tools, are doing vibe based right?

02:56.660 --> 02:58.220
This is what I do and this is what you should do.

02:58.260 --> 03:03.060
Like in the beginning, you should say, hey, I think the output looks good or I think the output looks

03:03.060 --> 03:08.900
bad, or my CEO thinks output looks good or he's noticed in these scenarios it does badly.

03:08.900 --> 03:17.380
So you start with vibe based and it's very hard to go from like my CEO likes this to we have a formal

03:17.580 --> 03:24.100
data set of inputs and expected outputs, and we have an evaluation metric that can check whether the

03:24.100 --> 03:25.460
outputs are good given the input.

03:25.500 --> 03:27.060
So like that is actually the hard part.

03:27.100 --> 03:29.340
But that's the hard part of building AI applications.

03:29.540 --> 03:31.220
It's not his fault.

03:31.260 --> 03:32.140
Yeah that makes sense.

03:32.180 --> 03:32.780
That makes sense.

03:32.820 --> 03:33.180
Yeah.

03:34.220 --> 03:36.380
That is like where my mind trips up.

03:36.380 --> 03:39.300
I'm like, okay, so this is describe something.

03:39.300 --> 03:42.380
But I'm used to like being in the details.

03:42.380 --> 03:45.940
And this is actually saying, no, don't be in the details.

03:45.940 --> 03:46.900
Let us do the details.

03:46.900 --> 03:50.180
But I'm like but like, how do I even steer it?

03:50.220 --> 03:55.410
I think a lot of the tension comes from the fact that they've almost accidentally built the world's

03:55.410 --> 03:56.970
greatest prompt optimizer.

03:57.210 --> 04:00.530
Yeah, but what they really want to be is like a full framework.

04:00.570 --> 04:05.730
They want to use this domain specific language and not care about the details, because you shouldn't

04:05.730 --> 04:06.050
worry.

04:06.050 --> 04:10.850
You can just compile the program with a different model if you want, and don't worry about what the

04:10.850 --> 04:11.570
program did.

04:11.610 --> 04:16.130
Like just worry about the whether the outputs have good evals or not.

04:16.170 --> 04:16.650
Yeah.

04:16.690 --> 04:20.490
So one question because length chain is also oh don't worry about the details.

04:20.490 --> 04:25.890
But like at some point no one was even understanding what was going on.

04:25.890 --> 04:29.170
And no one was like, yeah, it does something, but no one knows how it works.

04:29.290 --> 04:35.930
I'm a little bit like worried that it's like a length chain thing, whereas yeah, sure, like we promise

04:35.930 --> 04:36.770
a great library.

04:36.770 --> 04:40.170
But if no one knows what the hell is going on, what's the value?

04:40.210 --> 04:43.450
So I'm curious what your take on that is as well.

04:43.490 --> 04:43.970
Exactly.

04:43.970 --> 04:44.090
Yeah.

04:44.090 --> 04:48.930
So I use language chain a lot early on and then I like started digging into it and I was like, wait,

04:48.930 --> 04:49.570
hold on.

04:49.610 --> 04:55.170
They've just wrapped the length, the Python length statement like in the 100 lines of code for no reason.

04:55.210 --> 04:57.210
Yeah, I always have this love hate relationship with.

04:57.610 --> 05:03.130
But I would say where DSP is fundamentally different from long chain is that long chain is like just

05:03.130 --> 05:05.010
use our framework and it will make everything easy.

05:05.050 --> 05:08.610
Don't worry about the details, but sometimes you really do have to worry about the details.

05:08.650 --> 05:10.810
Yeah, but they're not really giving you anything in return.

05:10.850 --> 05:12.570
DSP is giving you the optimizers.

05:12.610 --> 05:18.250
You genuinely can ignore the details if you have a good evaluation metric.

05:18.650 --> 05:19.370
And that's the key.

05:19.370 --> 05:24.450
And I would say where DSP is really useful is if you have a formal evaluation metric that you trust,

05:24.490 --> 05:31.770
and if we have one which is dated, guess the category correctly, which is a categorization which is

05:31.770 --> 05:32.810
super easy.

05:32.810 --> 05:36.650
So I would say probably for a use case like that, that's probably ideal, right?

05:37.170 --> 05:37.730
Exactly.

05:37.730 --> 05:37.970
Yeah.

05:38.010 --> 05:40.250
There are a million different ways to build AI applications.

05:40.250 --> 05:44.090
And what I found is because I was working as a prompt engineer for the past few years, is basically

05:44.130 --> 05:47.690
like a single pattern that you can just use and ignore everything else.

05:48.250 --> 05:54.640
This is the evaluator Optimizer pattern, and what this solves is quite often the task you want to do

05:54.640 --> 05:55.200
is fuzzy.

05:55.240 --> 05:58.200
Like, there is no like, real formal evaluation.

05:58.200 --> 06:02.720
If you're doing a blog post generator you could generate, you could check the length of the blog post.

06:02.720 --> 06:06.840
But like, how do you check whether it's a good what is the what are you optimizing for exactly?

06:06.840 --> 06:11.800
So the way I would approach that typically with clients is I would say, okay, who's your domain expert?

06:11.840 --> 06:13.840
Or it might be the CEO, it might be someone else.

06:13.880 --> 06:18.800
So like I worked with a team of psychologists, for example, to do like a personality quiz type thing

06:18.840 --> 06:20.280
and at that time is really valuable.

06:20.280 --> 06:23.560
So you don't want them like manually evaluating every single response.

06:23.720 --> 06:27.600
What I started to do was this LLM as a judge type pattern.

06:27.640 --> 06:34.240
Don't worry about eval ING the actual generator or the optimizer in this case, like the thing doing

06:34.240 --> 06:34.960
the task.

06:35.080 --> 06:38.160
Instead, build a judge to replace the domain expert.

06:38.200 --> 06:38.560
Yeah.

06:38.640 --> 06:42.320
And so you also normally or is it like prompt.

06:42.480 --> 06:47.680
So you prompt that or you fine tune that or you actually use the spy to.

06:47.720 --> 06:48.200
Exactly.

06:48.240 --> 06:50.030
So that's also a DSP program.

06:50.230 --> 06:50.870
So you just.

06:50.910 --> 06:51.150
Yeah.

06:51.190 --> 06:51.990
You do that as well.

06:52.030 --> 06:52.270
Yeah.

06:52.270 --> 06:52.990
Yeah, exactly.

06:52.990 --> 06:56.830
So like the judge will check whether it passes the test.

06:56.870 --> 06:57.070
Yeah.

06:57.110 --> 06:58.670
And it'll give you some sort of score.

06:58.790 --> 07:04.870
So with my DSP metric like the LM judge is my DSP evaluator.

07:04.870 --> 07:09.350
So it allows me to do the fuzzy as long as the judge agrees with me or agrees with the domain expert.

07:09.390 --> 07:14.910
Most of the time, 80%, 90%, then you can trust it to do the optimization and it doesn't cost too

07:14.910 --> 07:18.550
much, especially if you can get the judge working with GPT mini or one of the cheaper models.

07:18.590 --> 07:24.110
And it's usually possible it's actually a much easier job to evaluate whether something is good than

07:24.110 --> 07:27.670
to create something that is good, like P and p equals NP.

07:27.710 --> 07:33.190
And this is in a sense of where you need LLM as a judge, because obviously categorization you don't

07:33.190 --> 07:33.790
need that.

07:34.350 --> 07:35.110
Yeah, exactly.

07:35.110 --> 07:38.150
So categorization is like a judge type task.

07:38.310 --> 07:43.510
The like the fuzzier type tasks, like the example I'm using today, I've got like a notebook that I

07:43.510 --> 07:46.110
could share afterwards is a telling a joke.

07:46.150 --> 07:48.150
Getting an AI to tell a funny joke.

07:48.150 --> 07:52.270
If you ask it, it just uses dad jokes and funny, but like, not real.

07:52.310 --> 07:53.950
Like stand up comedian type jokes.

07:54.270 --> 08:00.270
The first thing you want to do is train a judge to check whether the joke was good or not.

08:00.310 --> 08:01.990
Like just 1 or 0 is really simple.

08:01.990 --> 08:03.150
It could be really simple.

08:03.190 --> 08:06.230
Are you always binary or is it just truthful?

08:06.230 --> 08:10.590
I actually found binary works much better, and it's actually better to just stitch together a bunch

08:10.590 --> 08:17.630
of binary evals than it is to use a Likert scale, or rate this one out of five, because llms they

08:17.630 --> 08:20.630
tend to be too positive and it's like everything's a four.

08:21.110 --> 08:21.350
Yeah.

08:21.390 --> 08:26.350
So yeah, I just do one and zero quite often and you can build up more weights and you can say, I don't

08:26.350 --> 08:32.070
know if you're generating like an article for every, you could say, what are all the different things

08:32.070 --> 08:33.870
that I care about in terms of a good article?

08:33.910 --> 08:35.910
Like it needs to have a catchy hook at the beginning.

08:35.910 --> 08:36.950
It needs to be this length.

08:36.950 --> 08:37.430
It needs to be.

08:37.470 --> 08:41.990
You can actually string all these things together into one master eval and weight them and say, okay,

08:42.030 --> 08:46.700
like catchy hook is 80% of the value, whereas length is like 10% or whatever.

08:46.740 --> 08:47.940
So that's the way I think about it.

08:48.780 --> 08:49.020
Cool.

08:49.060 --> 08:52.620
So let's jump into the code, install DSP.

08:52.620 --> 08:57.300
And then what about sentiment like positive negative neutral or is that too much.

08:57.580 --> 08:58.140
It should be.

08:58.580 --> 09:02.260
The way I see that is that is still like a in a way like a classification.

09:02.300 --> 09:02.540
Okay.

09:02.740 --> 09:03.180
So okay.

09:03.180 --> 09:03.380
Okay.

09:03.420 --> 09:08.220
Yeah I would say it's important that it's mutually exclusive and collectively exhaustive.

09:08.660 --> 09:10.220
As in there's no ambiguity between.

09:10.260 --> 09:10.380
Yeah.

09:10.380 --> 09:10.740
Exactly.

09:10.740 --> 09:10.900
Yeah.

09:10.940 --> 09:11.380
Okay okay.

09:11.420 --> 09:11.820
Makes sense.

09:11.820 --> 09:12.020
Yeah.

09:12.060 --> 09:12.220
Yeah.

09:12.300 --> 09:15.020
The way you set up DSP is very easy.

09:15.260 --> 09:20.340
It's just provider forward slash, GPT or mini or whatever one you're using.

09:20.820 --> 09:24.980
You can set up multiple and you can use one as the teacher for the optimizer.

09:24.980 --> 09:27.260
And one is the student to do the task.

09:27.300 --> 09:28.700
I'll show you that in a second.

09:29.220 --> 09:31.820
But you just generate it and you can put your prompt in.

09:31.820 --> 09:35.260
So you actually can just use DSP as like a scripting language, right?

09:35.300 --> 09:37.340
You don't have to worry about any.

09:37.420 --> 09:40.380
You don't actually don't really need to use it like for optimizers.

09:40.420 --> 09:40.580
Right.

09:40.620 --> 09:45.610
And so I use it quite often now for like throwaway programs and like Python experiments and stuff,

09:45.810 --> 09:46.450
which is nice.

09:46.490 --> 09:49.330
There's a couple of nice features that DSP has straight out of the box.

09:49.410 --> 09:53.610
One is that it's probably the easiest way to get something working on Azure or AWS or something.

09:53.610 --> 09:56.450
It's literally just add a couple of more environment variables.

09:56.690 --> 10:02.330
And and also, if you wanted to see how this run on another LM, you could just add in anthropic slash

10:03.010 --> 10:03.610
chords on it.

10:03.610 --> 10:04.050
Right.

10:04.730 --> 10:06.250
It also has caching built in.

10:06.290 --> 10:07.930
You can see how fast that ran.

10:08.290 --> 10:08.610
Right.

10:08.650 --> 10:11.930
But if I maybe change the temperature.

10:13.170 --> 10:15.490
So it has like default zero temperature.

10:17.250 --> 10:19.690
If I want one it'll run again.

10:20.050 --> 10:20.450
Right.

10:20.490 --> 10:22.570
Because I've changed one of the parameters.

10:22.570 --> 10:25.170
So it will skip the cache.

10:25.610 --> 10:27.890
And you can also do cache equals false if you want.

10:27.930 --> 10:30.810
But but you see how fast that just loaded that from the cache.

10:30.850 --> 10:30.970
Right.

10:31.010 --> 10:34.130
It can save you a lot of money when you're doing experiments.

10:34.170 --> 10:38.010
Even if you didn't remember that you had used that combination before.

10:38.130 --> 10:38.330
Yeah.

10:38.330 --> 10:39.530
Or optimization.

10:39.570 --> 10:41.690
It's it happens to have a cache hit.

10:41.730 --> 10:44.600
It will it will just use the cached version like it won't cost you anything.

10:45.320 --> 10:45.680
Cool.

10:46.400 --> 10:47.440
So that's interesting.

10:47.440 --> 10:51.200
Then to create a program, you can actually just do it in one line if you want to.

10:51.240 --> 10:53.240
So this is a basic joke program.

10:53.240 --> 10:56.240
It just takes a topic and then gives you a joke, right?

10:56.280 --> 11:01.280
Everything on the left side of the arrow, it just turns into an input variable and everything on the

11:01.280 --> 11:02.600
outside it turns into an arrow.

11:02.640 --> 11:04.960
Is this a DSL like input?

11:05.560 --> 11:06.200
Exactly.

11:06.200 --> 11:07.400
They've just added that as well.

11:07.440 --> 11:10.320
Again, extra confusing, but for convenience is quite fun.

11:10.320 --> 11:14.040
And if you wanted to say, I don't know, a comedian, you could do that.

11:14.040 --> 11:16.560
And then literally like the program would just work.

11:16.760 --> 11:17.960
So I really don't like that.

11:18.000 --> 11:18.400
Yeah.

11:18.400 --> 11:19.240
It's terrible.

11:19.520 --> 11:27.000
So I will use I use Ruby, I don't like Python, so I will use dsp y RB which is oh yeah, it's in Ruby

11:27.040 --> 11:30.000
which has a way nicer way to do this.

11:30.680 --> 11:33.440
They don't have a DSL in strings, they just have it.

11:33.720 --> 11:34.080
Yeah.

11:34.960 --> 11:36.160
Ruby I think so.

11:36.480 --> 11:38.440
Yeah a couple of hacks though.

11:38.440 --> 11:41.560
Workarounds for doing it the more pythonic way.

11:41.680 --> 11:43.600
You can see this is the type of dad joke it has.

11:43.600 --> 11:45.560
Why do Python programmers prefer dark mode?

11:45.560 --> 11:46.920
Because light attracts bugs.

11:49.000 --> 11:50.000
It's a classic now.

11:50.040 --> 11:56.040
One of the fun hacks to get use out of DSP is you can just see what prompt was run the last.

11:56.080 --> 11:57.280
So this is n equals one.

11:57.320 --> 12:00.080
This is the last thing that was run on the LM.

12:00.480 --> 12:01.640
And that's a global thing.

12:02.120 --> 12:04.480
Um and so you can see the actual prompt here.

12:04.480 --> 12:05.880
This is what goes in.

12:05.920 --> 12:08.560
Actually it's probably better if you do the scroll down.

12:08.760 --> 12:11.960
It's turned that whole thing into literally just the system message.

12:11.960 --> 12:12.320
Here.

12:12.480 --> 12:13.880
Your input fields are topic.

12:13.880 --> 12:15.160
Your output fields are joke.

12:15.160 --> 12:18.760
And it's given the types all interactions to be structured in the following way.

12:19.360 --> 12:26.960
By default it uses like this kind of markdown type thing, um, rather than like JSON, but you can

12:26.960 --> 12:28.680
change it to JSON if you want.

12:28.720 --> 12:31.000
This does tend to work better actually.

12:31.040 --> 12:32.920
It's like JSON mode.

12:32.920 --> 12:37.280
I think it ends up making it a little bit less intelligent as what people have found with testing.

12:37.760 --> 12:42.030
But yeah, then put in the user message for you and then you get the response.

12:42.150 --> 12:47.150
The really nice thing is it's all automatically formatted and then all automatically passed afterwards

12:47.150 --> 12:47.990
as well for you.

12:48.350 --> 12:54.550
Do you look at these like normally before running or like to see if you can optimize it?

12:54.550 --> 12:56.310
And how do you see if it's good or not?

12:56.790 --> 13:01.750
I used to really hate this because I don't like the style that they use, but it does.

13:01.990 --> 13:06.990
Now I just go, I just use whatever they have like straight away and then it tends to work okay.

13:07.030 --> 13:14.070
So yeah, I manage my like really important prompts like the core prompt rally is not using DSP, but

13:14.110 --> 13:17.670
like I run all of my experiments with DSP and then I find something that works.

13:17.670 --> 13:19.710
And then I'll incorporate that into my main prompt.

13:19.990 --> 13:20.510
Yeah.

13:20.990 --> 13:21.310
Yeah.

13:21.350 --> 13:24.470
So because it's in this format it can go and optimize this.

13:24.470 --> 13:26.990
So one of the optimizers adds few shot examples.

13:27.030 --> 13:29.230
It will test a bunch of your examples that you give it.

13:29.230 --> 13:30.430
And you'll see which ones are the best.

13:30.430 --> 13:35.590
And it'll add them and but it automatically puts it in as user message response user message response.

13:35.590 --> 13:39.180
So you don't have to do all these convenience methods that you might have to do normally, which is

13:39.180 --> 13:39.740
quite nice.

13:41.180 --> 13:41.580
Cool.

13:41.620 --> 13:48.820
Um, so the optimizer you run first and then you store that version or something like that.

13:48.860 --> 13:49.260
You can.

13:49.300 --> 13:49.540
Yeah.

13:49.580 --> 13:50.020
Exactly.

13:50.020 --> 13:50.260
Yeah.

13:50.300 --> 13:54.460
But this is my way of, like, more pythonic way of creating DSP programs.

13:54.700 --> 13:58.700
I just create the fields and the instructions.

13:58.700 --> 14:05.020
And then there's this, like, very deep in their library, there's like this DSP signature, which,

14:05.060 --> 14:08.460
like, doesn't appear in the documentation anywhere, but it works.

14:08.460 --> 14:13.860
And so you can just give the signature name of whatever you want, the instructions and then the fields.

14:13.860 --> 14:15.100
And it does use all of this.

14:15.100 --> 14:21.020
So like actually the like some of the optimizers are program aware as in like they it shows itself what

14:21.020 --> 14:22.180
the program looks like.

14:22.420 --> 14:23.860
So even the name of the program and stuff.

14:23.860 --> 14:26.900
So like what you name things is actually super important to the DSP.

14:28.180 --> 14:30.500
It's funny because that's how the Ruby version works.

14:30.500 --> 14:35.740
You have a class that is the signature with an input and an output and a description, and you run that,

14:35.740 --> 14:37.100
which makes more sense to me.

14:37.100 --> 14:37.940
It's like this.

14:38.340 --> 14:38.540
Yeah.

14:38.580 --> 14:40.260
So you see here I put Dot predict.

14:40.260 --> 14:43.820
This is just like the base model train of thought or whatever there as well.

14:44.140 --> 14:46.020
But you can change that to change the thought.

14:46.180 --> 14:48.140
And then now you have.

14:48.180 --> 14:48.380
Yeah.

14:48.380 --> 14:50.340
So I've just printed output joke.

14:50.380 --> 14:52.620
But if I just print what is output.

14:53.500 --> 14:57.180
It's a prediction object with reasoning and joke.

14:57.180 --> 15:02.220
So it's the reasoning that's the really cool thing is like they have a bunch of they have the react

15:02.220 --> 15:04.260
pattern for agents to use.

15:04.300 --> 15:07.900
They have sampling where it just generates five versions and chooses the best.

15:08.140 --> 15:09.180
You know, they have things like that.

15:09.180 --> 15:13.460
So they have a lot of these prompt engineering techniques that are built in, which is quite nice.

15:14.380 --> 15:17.380
And it's just a one liner to add them, which is useful.

15:18.260 --> 15:18.620
Cool.

15:18.660 --> 15:24.380
Ever use multiple predictors and then go from there like in a multi or.

15:24.820 --> 15:25.700
Yeah exactly.

15:25.740 --> 15:25.900
Yeah.

15:25.900 --> 15:32.740
So you can basically once you've created the signature like then any valid Python object can be a module.

15:32.740 --> 15:35.810
Essentially you can string loads of different predictors together.

15:35.850 --> 15:37.450
Create like complex workflows.

15:37.450 --> 15:43.210
That's where you get more into the long chain type stuff where yeah, like multi-step and like synthesis

15:43.250 --> 15:44.090
and all that stuff.

15:44.130 --> 15:44.690
Yeah, exactly.

15:44.730 --> 15:44.930
Yeah.

15:44.970 --> 15:49.490
Just to show you, like I just brought in Gemini Flash and all I needed was my Gemini API key.

15:49.490 --> 15:53.450
I don't need to worry about how Google's currently doing stuff, which is batshit.

15:54.490 --> 15:55.490
So all good.

15:55.610 --> 15:56.170
Really nice.

15:56.610 --> 16:02.210
And you can see here, by the way, you can have a globally configured like I configured the LLM up

16:02.210 --> 16:02.810
here somewhere.

16:02.810 --> 16:06.850
Yeah DSP configure and it would just use that one automatically.

16:06.850 --> 16:09.530
But you can also just run it in context.

16:09.530 --> 16:13.250
So you say with DSP context LLM equals Gemini LLM.

16:13.570 --> 16:16.850
Then it will just use Gemini for anything in there.

16:18.610 --> 16:20.410
But like the OpenAI still config.

16:20.850 --> 16:21.010
Yeah.

16:21.810 --> 16:22.250
Yeah.

16:22.290 --> 16:25.370
And some of the modules and stuff that you can pass in the LLM.

16:25.410 --> 16:27.410
But I found it's inconsistent actually.

16:27.770 --> 16:28.090
Yeah.

16:28.690 --> 16:33.760
So yeah you could do different steps in pipelines with different models and stuff.

16:34.040 --> 16:34.480
Exactly.

16:34.480 --> 16:34.600
Yeah.

16:34.640 --> 16:34.920
Yeah.

16:34.960 --> 16:36.560
This is like a chain of thought.

16:36.600 --> 16:38.240
Example, which we've already put in there.

16:38.680 --> 16:39.880
And you can print out the reasoning.

16:39.880 --> 16:44.160
And because it's all automatically passed, you can just do like output dot reasoning output dot joke.

16:44.200 --> 16:45.200
Yeah it's quite nice.

16:45.760 --> 16:47.480
So you really don't have to worry about that.

16:48.600 --> 16:54.600
And is there does it is there a cycle feedback loop where you give feedback and it looks at the reasoning

16:54.600 --> 16:58.040
and sees if it can optimise from the feedback?

16:58.320 --> 16:59.040
And the reason.

16:59.040 --> 16:59.280
Yeah.

16:59.280 --> 17:01.360
So some of the optimizers work that way.

17:01.480 --> 17:02.000
Yeah yeah.

17:02.040 --> 17:02.840
And I'll cover that.

17:02.840 --> 17:04.600
But it is it's more of a static thing.

17:04.640 --> 17:07.960
Like you run an optimisation job like there's no active learning.

17:08.000 --> 17:08.160
Yeah.

17:08.160 --> 17:15.000
But for example if I have like hundreds of people saying things, this summary was like needed something

17:15.000 --> 17:16.760
more like that or something like that.

17:16.760 --> 17:19.240
And I have the chain of thought and everything also saved.

17:19.240 --> 17:27.440
I could run it once a while to update the prompt with an optimizer to like, optimize for that output

17:27.440 --> 17:27.920
and more.

17:27.960 --> 17:28.760
Yeah, exactly.

17:28.760 --> 17:29.240
Yeah, yeah.

17:29.280 --> 17:34.360
And, and that you start to get into this really healthy, kind of virtuous loop where because you have

17:34.360 --> 17:40.120
the eval metric, you can either make the eval metric perform better or to cover that, that new use

17:40.120 --> 17:42.120
case that you've seen coming up.

17:42.120 --> 17:44.000
And then you can run the optimization job again.

17:44.000 --> 17:46.760
But you could also go back and change the way the program works.

17:46.760 --> 17:50.480
You can start from yeah, you can use a different optimizer altogether.

17:50.520 --> 17:51.440
See if that works or.

17:51.480 --> 17:52.480
Yeah, or try it with Gemini.

17:52.720 --> 17:54.240
Gemini does better or whatever it is.

17:54.280 --> 17:54.720
Yeah.

17:54.760 --> 17:55.080
Yeah.

17:55.120 --> 17:55.480
Cool.

17:55.520 --> 17:57.920
The again, this is like stupid.

17:57.920 --> 18:03.240
But the way it works is this is actually how you build a real kind of DSP program.

18:03.280 --> 18:08.240
You you create the class, you know, with the this is the instructions.

18:08.240 --> 18:10.360
And then this is the input field.

18:10.360 --> 18:11.600
This is the output field.

18:11.600 --> 18:13.800
And then you create another class for the module.

18:13.800 --> 18:17.120
And it took me a long time actually to kind of get my head around why this is necessary.

18:17.120 --> 18:20.800
But essentially this allows you to build programs of arbitrary complexity.

18:20.840 --> 18:25.680
The DSP module is the base class that chain of thought and predict inherit from.

18:25.680 --> 18:28.680
So it's a way of you to add your own prompt engineering techniques to it.

18:28.680 --> 18:33.670
So you just define an init and it inherits from from the module.

18:33.670 --> 18:36.590
And then like in this case I just create the joke generator.

18:36.630 --> 18:37.030
Right.

18:37.070 --> 18:40.030
And then you just have to define a forward method.

18:40.030 --> 18:45.030
And the forward method basically just takes the inputs in this case just topic and then gives you the

18:45.030 --> 18:46.550
outputs you can pass.

18:46.710 --> 18:48.310
So this is a simple one right.

18:48.350 --> 18:52.430
It just does the prediction and then ignores the chain of thought and just returns the joke.

18:52.790 --> 18:53.230
Yeah.

18:53.270 --> 18:57.710
But you could call another predictor or do whatever you want there as well right.

18:57.750 --> 18:58.110
Exactly.

18:58.110 --> 19:00.110
You can see that the, the yeah.

19:00.150 --> 19:01.990
Like the I think the result is basically.

19:01.990 --> 19:03.230
Yeah the same.

19:03.270 --> 19:05.110
Why did Python script need therapy.

19:05.150 --> 19:08.750
Because I had too many deeply nested if statements can handle the indentation.

19:09.630 --> 19:09.910
Yeah.

19:09.950 --> 19:14.910
But yeah the so that that's like the just getting a script working.

19:15.070 --> 19:16.590
This is where it gets really powerful.

19:16.590 --> 19:22.750
So here what I'm doing is I'm training my optimizer because I'm going I think, I think this in terms

19:22.750 --> 19:24.390
of we used to have test driven development.

19:24.430 --> 19:26.550
Now it's eval driven development.

19:26.590 --> 19:28.420
Like you have to formalize your eval.

19:28.420 --> 19:33.100
And if you formalize your eval, then you can train an arbitrary program, whether it's fine tuning.

19:33.100 --> 19:36.260
And actually GSP does fine tuning for you as well.

19:36.300 --> 19:40.900
They have this like Better Together optimizer, which does both fine tuning and optimization.

19:40.940 --> 19:41.780
That's crazy.

19:41.940 --> 19:46.940
But like you actually, funnily enough, like I find that prompt optimization almost always beats fine

19:46.940 --> 19:47.420
tuning.

19:47.660 --> 19:51.620
Yeah, I've never had fine tuned work in my work.

19:51.980 --> 19:57.140
Yeah, unless you've got like thousands like I think the there was a paper that said that until you

19:57.140 --> 20:02.460
have 2000 data like inputs and outputs, lines prompting beats fine tuning.

20:02.500 --> 20:03.140
Okay, great.

20:03.180 --> 20:03.300
Yeah.

20:03.340 --> 20:09.380
What I did and this is another little hack that I do to get to build a data set, is I just went online

20:09.380 --> 20:15.420
and found a lot of funny jokes and just I used a deep research here and then told it to give it into

20:15.420 --> 20:15.900
Python.

20:16.220 --> 20:20.740
So I got like the topic, the joke, and then the comedian in here just for attribution.

20:20.740 --> 20:22.580
But so I got, you know, I got it to create.

20:22.620 --> 20:25.540
I think it's like you look like Ricky Gervais if you.

20:25.580 --> 20:27.860
Yeah, I'll tell that once, actually.

20:27.900 --> 20:30.060
Yeah, I'll take that as a compliment.

20:30.100 --> 20:30.860
Not that you're saying.

20:31.340 --> 20:32.900
Oh, he's great.

20:36.340 --> 20:36.740
And then.

20:36.740 --> 20:38.980
And then I took the unfunny jokes quite often.

20:39.020 --> 20:39.580
The task.

20:39.620 --> 20:44.340
It sounds like a complicated task to build a data set, but quite often it's just that you need to go

20:44.340 --> 20:47.260
and pick out, like, select things that you think are good.

20:47.340 --> 20:49.780
And you didn't even have to really explain why you think they're good.

20:49.980 --> 20:51.500
And then you just have to.

20:51.820 --> 20:57.420
Then as the things that are not good, you just have to generate a bunch of GPT answers because that's

20:57.420 --> 20:58.780
what you're not good from.

20:58.780 --> 21:00.820
You're trying to get it to not sound like ChatGPT.

21:01.060 --> 21:01.500
Yeah.

21:01.540 --> 21:01.820
Yeah.

21:01.820 --> 21:08.460
So here I just I literally just got, I said to ChatGPT mini just to make sure it was especially dumb,

21:08.500 --> 21:10.700
like, just give me a bunch of jokes on these topics.

21:10.740 --> 21:12.700
And, and then so that's how I got these.

21:12.900 --> 21:15.860
So that's a really good hack for building a arrested.

21:15.900 --> 21:17.220
It held up a pair of pants.

21:17.340 --> 21:17.820
So.

21:20.220 --> 21:22.420
There's a couple of useful things to think about here.

21:22.460 --> 21:24.180
And the name is inconvenient.

21:24.220 --> 21:25.490
Like in the documentation.

21:25.490 --> 21:30.330
They don't really explain like what it means or why you would have a training set versus validation

21:30.330 --> 21:30.730
set.

21:30.770 --> 21:39.650
But as as best as I understand, training set is the data that you give to the optimizer to train on.

21:39.690 --> 21:41.130
Like it can see that data.

21:41.130 --> 21:43.130
It will evaluate that data.

21:43.130 --> 21:48.210
And it will use some of those examples when it's optimizing like it'll add them as few shot examples.

21:48.250 --> 21:48.690
Right.

21:48.890 --> 21:54.850
Validation set is you give that to optimizer as well or you don't have to, but you can give it to optimizer

21:54.850 --> 21:55.170
as well.

21:55.170 --> 21:56.730
And it uses that as the test set.

21:56.730 --> 22:01.890
So it will try a bunch of it'll look at the data, try a bunch of things and then it'll run a test on

22:01.930 --> 22:02.970
the validation set.

22:03.330 --> 22:07.450
If you don't give it the validation set, it will just run it on the full training data so it can end

22:07.450 --> 22:08.610
up costing quite a lot.

22:08.690 --> 22:10.210
That's one of the primary reasons.

22:10.210 --> 22:14.370
Also, I think the results don't generalize as much if you don't have a separate test set.

22:14.690 --> 22:19.930
And then the reason why you need a development set is that because it also sees like the test set,

22:19.930 --> 22:25.520
when it's running the results, sometimes it will bleed across and it won't give you like really good,

22:25.560 --> 22:28.640
like generalization, like you'll get good optimization scores.

22:28.640 --> 22:31.440
But when you try and run a new joke or whatever, it won't be very good.

22:31.840 --> 22:34.400
Um, and then so the development set is what you use.

22:34.440 --> 22:35.800
The optimizer never sees.

22:35.840 --> 22:36.000
Right.

22:36.040 --> 22:41.640
Like you just create a new like you just run an evaluation yourself after the optimizer is done.

22:41.680 --> 22:46.080
And that's, that's really important, I think, for proving to the client or proving to stakeholders

22:46.120 --> 22:47.920
that like, it actually did do a good job.

22:47.960 --> 22:49.640
It's not part of the loop you need.

22:49.680 --> 22:52.960
Yeah, you need to have the evaluation step in in the end.

22:52.960 --> 22:54.840
But it's just running it.

22:54.880 --> 22:57.240
It's not training or optimizing or doing anything.

22:57.280 --> 22:57.760
Exactly.

22:57.760 --> 23:03.480
So the way I split it usually is, I'll try and get like more than 100 examples or 200 examples of good

23:03.480 --> 23:07.920
versus bad, and then I'll split it 60, 20, 20.

23:09.280 --> 23:09.440
Yeah.

23:09.480 --> 23:11.880
You do new examples I assume, right.

23:11.920 --> 23:12.200
Yeah.

23:12.240 --> 23:12.760
Exactly.

23:12.760 --> 23:13.160
Yeah.

23:13.200 --> 23:17.440
So you can see here we split this out a couple of weird things.

23:18.120 --> 23:22.240
The way that GSP wants the examples is also like a bit odd.

23:22.520 --> 23:24.920
So you see here it's despite example.

23:25.120 --> 23:31.080
And then you just give the different parameters in this case because it's the judge we want.

23:31.080 --> 23:33.400
This funny equals true to be the output.

23:33.760 --> 23:39.760
And then you just do dot with inputs and then list the ones that you want as inputs.

23:39.760 --> 23:42.720
And then it will assume the funny is the output.

23:43.320 --> 23:48.480
And examples if you always use examples for optimization right.

23:48.520 --> 23:51.280
Or yeah that is what like a data set is for.

23:51.640 --> 23:52.800
That is the whole point right.

23:52.840 --> 23:53.800
That's how they format it.

23:53.800 --> 23:54.760
Yeah exactly.

23:55.120 --> 23:55.320
Yeah.

23:55.360 --> 24:00.000
And they have like internal stuff that will turn that into a pandas data frame essentially.

24:00.640 --> 24:01.120
Yeah.

24:01.680 --> 24:02.000
Cool.

24:02.040 --> 24:03.560
So we've got our data set.

24:03.600 --> 24:05.080
Just here's an example.

24:05.080 --> 24:07.240
We've got the topic whether it's funny or not.

24:07.520 --> 24:09.320
So we create our judge.

24:09.320 --> 24:14.320
And this is a case where like actually it's convenient to do the inline thing we're just taking because

24:14.320 --> 24:16.080
we don't care that much about this judge.

24:16.120 --> 24:17.760
We just care about whether it's accurate.

24:17.800 --> 24:20.070
We don't care how it gets to accuracy necessarily.

24:20.190 --> 24:21.790
So we're just giving a topic a joke.

24:21.790 --> 24:23.510
And then we're saying whether it's funny or not.

24:23.550 --> 24:25.270
You can also set types in here as well.

24:25.270 --> 24:30.390
So I set this as a boolean because I found that otherwise it gives you back like a big spiel of like

24:30.430 --> 24:31.750
why it's funny or not funny.

24:32.030 --> 24:36.310
You can see here like in this case judge says it's says it's funny.

24:36.350 --> 24:36.870
True.

24:36.910 --> 24:38.190
Actual ground truth is false.

24:38.190 --> 24:41.390
So the judge was wrong here and I made it a chain of thought.

24:41.430 --> 24:44.710
Because chain of thought tends to really improve the results of judges.

24:45.030 --> 24:46.430
It thinks it's funny, but it's not.

24:46.430 --> 24:50.590
So we need to train our judge so we can actually do an evaluation of our judge.

24:51.070 --> 24:51.870
Evaluate.

24:51.910 --> 24:57.870
Convenience method is like really helpful because it runs in threads so it runs in parallel.

24:57.870 --> 24:59.550
It makes it run a lot faster.

24:59.550 --> 25:02.550
So it'll run eight of them at the time in this case.

25:02.550 --> 25:05.030
But you can set that to whatever you want, which is nice.

25:05.830 --> 25:10.510
And then you just pass it in a metric development set as in what the.

25:10.550 --> 25:14.950
This is like the one that we kept aside that the optimizer won't see if that makes sense.

25:14.950 --> 25:16.230
The final judgement.

25:16.230 --> 25:17.870
And we can see that here.

25:17.980 --> 25:19.900
It's 51% accurate.

25:19.900 --> 25:21.660
And you can see the actual examples here.

25:21.660 --> 25:24.540
This one wasn't funny but the judge predicted.

25:24.580 --> 25:25.820
True that it was funny.

25:25.820 --> 25:27.140
So it got that wrong right.

25:27.180 --> 25:28.260
This one was true.

25:28.300 --> 25:29.220
It was funny.

25:29.420 --> 25:31.020
And it predicted that it was funny.

25:31.020 --> 25:32.660
So it got that one correct, right?

25:32.700 --> 25:35.900
The nice thing about judges is that they can have an exact match score.

25:35.940 --> 25:41.020
And the way I've set the prediction metric up is again specific to DSP.

25:41.100 --> 25:43.420
You have to have this like trace equals none.

25:43.660 --> 25:44.980
So you can set that.

25:45.300 --> 25:50.180
So that allows you to basically set like a different metric response.

25:50.220 --> 25:56.100
If the optimizer is evaluating versus if it is like learning from the failures.

25:56.140 --> 26:00.620
So if the if trace equals is not none then then it's like learning.

26:00.620 --> 26:04.260
So you could give it like a more strict evaluation criteria.

26:04.300 --> 26:07.900
Because the reason why you would do that is like in this case it's just exact match.

26:07.900 --> 26:08.380
So it's easy.

26:08.380 --> 26:12.580
But the reason why you might want to do that is maybe if you did have an evaluation metric, which is

26:12.620 --> 26:16.700
like rate this one out of five, you have a score then.

26:16.700 --> 26:21.220
But when the judge is learning, it's deciding what few shot examples to put in.

26:21.220 --> 26:27.300
So you might want to say I only want five star ratings as my examples, right?

26:27.380 --> 26:28.820
So you don't have to do that.

26:28.820 --> 26:30.140
But that's why that's there.

26:30.140 --> 26:32.220
And it automatically will pass in the prediction.

26:32.220 --> 26:36.780
And then the gold, which is like the essentially the the ground truth like from the data set.

26:37.340 --> 26:42.020
So in this case we're just checking whether the prediction funny equals gold funny.

26:42.260 --> 26:44.940
So that's like the simplest possible metric.

26:45.940 --> 26:46.180
Cool.

26:46.180 --> 26:48.660
So we have a judge that's wrong half the time.

26:48.660 --> 26:49.820
How do we make it better.

26:50.180 --> 26:51.540
This is where we're bringing in optimizer.

26:51.540 --> 26:52.940
And this is the simplest optimizer.

26:52.940 --> 26:54.140
It's just called bootstrap.

26:54.180 --> 26:54.860
Few shot.

26:55.020 --> 27:01.980
All it does is it just adds in a few or actually creates a few shot a couple of few shot examples and

27:01.980 --> 27:04.460
then checks them against the evaluation metric.

27:04.460 --> 27:08.660
If the evaluation metric is correct, then it will, then it will pass.

27:08.700 --> 27:10.460
It will add them in to the prompt.

27:10.460 --> 27:15.210
So when you run that you think it synthesizes jokes to She pauses.

27:15.250 --> 27:16.290
Yeah, exactly.

27:16.290 --> 27:16.690
Yeah.

27:16.690 --> 27:18.010
So yeah, here we go.

27:18.010 --> 27:22.090
So we have Max bootstrapped demos and Max labeled demos.

27:22.370 --> 27:27.930
Max label demos is like demonstrations, as in examples from your data set.

27:28.210 --> 27:33.930
So in it's allowing you're allowing it to add up to 16 few shot examples from your data set, from your

27:33.970 --> 27:34.890
training data set.

27:35.130 --> 27:39.890
And then Max bootstrap demos is you're allowing it to generate new examples.

27:40.370 --> 27:41.090
Yeah okay.

27:41.290 --> 27:43.570
Data set but still pass your metrics.

27:43.610 --> 27:47.450
So would you ever run this without examples.

27:47.450 --> 27:51.410
Or you will always add examples because you always add examples.

27:51.410 --> 27:51.810
Yeah.

27:52.170 --> 27:52.530
Yeah.

27:52.570 --> 27:57.170
Because it needs I guess you could just make the max label demos equals zero.

27:57.210 --> 27:58.610
And then it should be able to do.

27:58.730 --> 28:00.810
But yeah I mean it's but you would never do that.

28:00.810 --> 28:01.770
That's just not a thing.

28:01.810 --> 28:02.050
Yeah.

28:02.090 --> 28:07.370
Like I would even go to the length of I would go and run the program like 100 times to generate the

28:07.370 --> 28:08.410
data set and then run.

28:08.450 --> 28:08.610
Yeah.

28:08.650 --> 28:12.570
And then you manually pick or whatever examples.

28:12.690 --> 28:16.000
You can also set a few things like so you put the metric in there.

28:16.000 --> 28:19.640
But you can also set a threshold so you can say I just care about 80% accuracy.

28:19.920 --> 28:22.320
So if it gets 80% accuracy, stop optimizing.

28:22.480 --> 28:25.640
You could also set the teacher so you can say I want GPT.

28:26.200 --> 28:32.160
I want O3 to be the teacher and I want but I still want mini to be the task.

28:32.200 --> 28:33.440
And so it's distillation.

28:33.440 --> 28:38.800
Whenever it generates a bootstrapped demo, it will use GPT three to generate that.

28:38.840 --> 28:39.160
Oh sorry.

28:39.200 --> 28:40.760
We use O3 to generate that.

28:40.960 --> 28:41.360
Yeah.

28:41.400 --> 28:44.040
So that is like quite an effective approach.

28:44.080 --> 28:48.440
And how do you choose the number of examples and what examples.

28:48.440 --> 28:50.560
And when do you change the examples?

28:50.680 --> 28:54.800
I find it's a it's like a very much an art rather than the science.

28:54.840 --> 28:55.440
What do you do.

28:55.480 --> 28:58.200
I found it like really weird results sometimes.

28:58.200 --> 29:03.520
But yeah, broadly speaking I'll try it with the default and then I'll diagnose.

29:03.560 --> 29:05.600
So there's a couple of different optimizers, by the way.

29:05.600 --> 29:07.480
This doesn't change the instructions.

29:07.480 --> 29:10.720
This still keeps your instructions, but it only changes the few shot.

29:10.720 --> 29:13.670
And usually that's enough actually for classification tasks.

29:13.710 --> 29:19.910
I'd say typically I would say that you want like at least five, maybe more like ten examples in the

29:19.910 --> 29:24.470
final prompt, maybe more than that in order to get good accuracy on a classification task.

29:24.750 --> 29:25.990
But you don't want too much.

29:26.310 --> 29:27.710
Obviously there's a cost to running this.

29:27.710 --> 29:33.430
I just this just happened instantly because because I have this is a deterministic optimizer.

29:33.590 --> 29:37.230
And and because I have already run it before it pulled from the cache.

29:37.230 --> 29:40.230
But like this can actually take ten minutes to run or something.

29:40.230 --> 29:44.390
And it can cost like it can make a few hundred API calls or a few thousand API calls.

29:44.390 --> 29:48.870
So it can actually cost you into the hundreds or maybe even thousands of dollars, depending on how

29:48.870 --> 29:49.990
big your data set is.

29:50.030 --> 29:50.270
Yeah.

29:50.310 --> 29:55.230
So with Dspi, it's forced me to just basically be as dumb as possible, don't care about the prompt,

29:55.270 --> 29:58.350
like just do the inputs and outputs and see if it works straight off the bat.

29:58.390 --> 30:01.030
If it doesn't, then run an optimizer, see if that works.

30:01.030 --> 30:05.590
If it doesn't, run a more powerful optimizer and then go, okay, I'm going to try and run a few experiments.

30:05.590 --> 30:09.390
I'll do it with 16 demos, or I'll do it with five demos and just see what works.

30:09.390 --> 30:13.190
So I only apply the complexity like on demand.

30:13.230 --> 30:16.230
It's almost like just in time edition of complexity.

30:16.270 --> 30:16.550
Yeah.

30:16.590 --> 30:21.230
And like, for example, you made one last year or like half a year ago.

30:21.270 --> 30:27.670
Do you ever go back to the example because people do give feedback and things do change and models change,

30:27.670 --> 30:32.110
like when do you decide to redo this or add other examples?

30:32.150 --> 30:39.230
Yeah, I would say typically once you like, I would rerun a prompt when when we discover a new dangerous

30:39.230 --> 30:39.950
edge case.

30:40.190 --> 30:42.150
Oh everyone's complaining about this now.

30:42.190 --> 30:42.590
Yeah.

30:42.870 --> 30:45.910
So then I would do a job just specifically to solve it.

30:46.070 --> 30:52.270
Or if I've collected more than 50 new user responses, because you could have a whole system here where

30:52.270 --> 30:55.070
you have a thumbs up, thumbs down button on your on your.

30:55.110 --> 30:55.870
That's what I have.

30:55.910 --> 30:58.790
So I'm thinking and then you that generates the data set for you.

30:58.790 --> 31:00.830
And then you just rerun and see how well it does.

31:01.270 --> 31:01.710
Yeah.

31:01.950 --> 31:05.070
But yeah like it is like a continuous thing.

31:05.470 --> 31:05.910
Yeah.

31:05.950 --> 31:06.310
You could.

31:06.350 --> 31:10.580
The dream is what I want to get to with our system is just running overnight every night.

31:10.620 --> 31:11.540
It just keeps improving.

31:11.580 --> 31:12.940
Yeah, and I don't have to worry about it.

31:13.220 --> 31:13.500
Yeah.

31:13.540 --> 31:15.980
And really, it's like the example.

31:15.980 --> 31:22.780
So how but how do you then choose the examples or like you just give all the examples and say you figure

31:22.820 --> 31:25.660
out this optimizer chooses the examples for you.

31:25.700 --> 31:28.380
So you don't really need to deal with any of it.

31:28.380 --> 31:32.260
You just say whatever the default is 10 or 16, whatever that is.

31:32.300 --> 31:32.780
Exactly.

31:32.780 --> 31:37.700
And I find for simple tasks bootstrap, few shot works pretty well and it's the cheapest one to run,

31:37.700 --> 31:38.940
so it's like very easy.

31:39.220 --> 31:44.740
It's pennies you bootstrap few shot random search works much better for more important tasks.

31:44.940 --> 31:49.220
And again it only chooses examples, but it adds like a random search component.

31:49.220 --> 31:54.740
So it will go and you'll find a new like like this one just goes through the examples and chooses one.

31:54.740 --> 32:00.780
But it might it be at a local maximum, like random search actually just literally hops around your

32:00.780 --> 32:02.140
data set and like picks out.

32:02.180 --> 32:03.940
So it's more compute intensive.

32:03.940 --> 32:05.380
But it does a really good job.

32:05.900 --> 32:06.500
Yeah.

32:06.770 --> 32:09.690
And then so that's optimizing the judge.

32:09.690 --> 32:11.410
And here we go.

32:11.810 --> 32:14.610
Obviously the eval would run a lot slower if it wasn't cached.

32:14.610 --> 32:19.770
But we have even just adding a few short examples like we have 92% accuracy now.

32:20.090 --> 32:20.890
Yeah that's cool.

32:21.130 --> 32:21.410
Wow.

32:21.450 --> 32:22.610
So that's amazing right.

32:22.650 --> 32:23.410
Like you just go.

32:23.450 --> 32:24.650
We didn't even have to think about it.

32:24.690 --> 32:26.050
We didn't even look at the prompt.

32:26.210 --> 32:29.730
And now we've had an 80% improvement with a couple of lines of code.

32:29.730 --> 32:30.690
That's the magic.

32:30.730 --> 32:31.090
Yeah.

32:31.650 --> 32:35.810
I'm going to for sure use this for categorization like next week.

32:36.130 --> 32:36.370
Yeah.

32:36.410 --> 32:37.690
So categorization is perfect.

32:37.690 --> 32:41.330
But then every fuzzy problem also is a categorization problem, right.

32:41.370 --> 32:46.730
So now like the judge is the categorization problem, we can now use that to make our and you'll see

32:46.770 --> 32:47.650
it won't work as well.

32:47.690 --> 32:48.570
And it's more intense.

32:48.570 --> 32:52.770
But it does lead you to a path to get the fuzzy tasks under control.

32:53.210 --> 32:57.010
So now I've just created a new metric, which is the judge score.

32:57.250 --> 33:02.490
And again, literally just saying the judge like I'm using my bootstrap to optimize judge.

33:02.490 --> 33:02.690
Right.

33:02.730 --> 33:04.690
Like this program that we created.

33:04.770 --> 33:06.290
And that's the optimized.

33:06.930 --> 33:08.570
That's the optimized.

33:08.610 --> 33:08.930
Yeah.

33:09.010 --> 33:09.410
Exactly.

33:09.410 --> 33:09.810
Right.

33:09.850 --> 33:11.570
That's the final optimized judge.

33:11.610 --> 33:11.970
Right.

33:11.970 --> 33:12.210
Yeah.

33:12.250 --> 33:13.250
That we created here.

33:13.530 --> 33:15.090
And then I'm just using that as a program.

33:15.130 --> 33:15.290
Right.

33:15.290 --> 33:16.730
So I just get my judge result.

33:16.810 --> 33:19.290
So I pass in the topic and the joke.

33:19.290 --> 33:22.010
And then I give a 1 or 0 right.

33:22.930 --> 33:24.530
So that's really great.

33:24.530 --> 33:26.970
And then you can create a data set.

33:27.010 --> 33:32.970
In this case I basically filtered out the examples and I changed the inputs a little bit because before

33:32.970 --> 33:35.210
if you remember we had the topic and the joke.

33:35.210 --> 33:37.290
And then we check whether it's funny or not.

33:37.290 --> 33:41.690
And in this case I'm just passing in the topic and then getting a joke.

33:41.730 --> 33:41.930
Right.

33:41.970 --> 33:46.810
So I had to create a new data set, the topic data set, and the only input is topic.

33:46.810 --> 33:48.250
And then it would generate a joke.

33:48.570 --> 33:50.450
The other thing I did is I filtered it.

33:50.450 --> 33:56.770
So I'm only giving few shot examples where the joke was funny, because I find that it's actually really

33:56.770 --> 33:59.690
detrimental sometimes to give negative examples.

33:59.730 --> 34:04.560
You want it to only have positive examples, because when you give it negative examples, it tends to

34:04.600 --> 34:06.480
follow that negative example weirdly.

34:06.520 --> 34:10.360
It's like to say, don't put your hand in the toaster and it goes and puts its hand in the toaster.

34:10.960 --> 34:11.360
Yeah.

34:13.480 --> 34:15.200
But you can experiment on that.

34:16.080 --> 34:16.280
Okay.

34:16.320 --> 34:17.600
It's like edge cases.

34:17.600 --> 34:23.960
How would you choose to include edge cases in the example or like you just include them and it will

34:23.960 --> 34:24.760
be handled.

34:25.640 --> 34:26.000
Yeah.

34:26.000 --> 34:27.520
What I would do is I'll take the edge case.

34:27.520 --> 34:32.360
I'll get your domain expert to rewrite what should have been the correct answer, and then use that

34:32.360 --> 34:33.280
as the flu shot.

34:33.320 --> 34:36.520
Yeah, because otherwise it's confusing or like it will throw it.

34:36.560 --> 34:37.080
Exactly.

34:37.120 --> 34:37.560
Yeah.

34:37.600 --> 34:43.360
Or you could do like a rewriting type prompt, which would be again, another DSP program where it takes

34:43.360 --> 34:45.680
the wrong answer and rewrites it to be the correct answer.

34:45.680 --> 34:49.440
And then and then you can see in terms of optimization that.

34:49.640 --> 34:56.320
So what I've done here is this is taking the topic dev set, which is those those examples that we had

34:56.360 --> 34:56.640
before.

34:56.680 --> 34:59.920
So literally just the dev set and it's just given the topic.

34:59.960 --> 35:05.790
So it's just taking those topics and has found that out of 51 topics, only 2020 of them resulted in

35:05.790 --> 35:06.510
funny jokes.

35:06.510 --> 35:13.030
So we got 39% funny score from our joke generator, a baseline, right?

35:13.230 --> 35:19.750
So now we're using the Big Guns Me Pro, which is like a Bayesian optimization algorithm that very smart

35:19.750 --> 35:20.990
people worked on.

35:20.990 --> 35:27.910
And this not just does the optimization of few shot, but it also changes the prompt instructions.

35:27.910 --> 35:34.310
You can actually just run this as a prompt instruction optimizer if you want, by just setting max bootstrap

35:34.310 --> 35:36.590
demos to zero and max label demos to zero.

35:36.630 --> 35:41.230
In this case, I set it like you can also do light versus medium versus heavy.

35:41.230 --> 35:44.310
Like I said it heavy because it's non-deterministic.

35:44.350 --> 35:48.870
You want to set a seed as well so that like you get the same results and otherwise every time you run

35:48.870 --> 35:51.670
it, it's not going to be any worse or better.

35:51.990 --> 35:52.350
Yeah.

35:52.390 --> 35:55.350
You can also set the initial temperature so it starts at that temperature.

35:55.350 --> 35:58.390
And then it will try different temperatures at other sides and things like that.

35:58.390 --> 36:04.510
So it also optimizes those parameters for you, which is quite nice, but pass that in and then I must

36:04.550 --> 36:06.470
have messed something up because it's running again.

36:07.070 --> 36:07.550
There we go.

36:07.590 --> 36:07.910
Great.

36:09.030 --> 36:09.190
Yeah.

36:09.190 --> 36:09.830
Yeah, exactly.

36:09.950 --> 36:10.310
No, sorry.

36:10.310 --> 36:11.470
It was just running through the cache.

36:11.510 --> 36:11.870
Cool.

36:12.390 --> 36:12.510
Yeah.

36:12.550 --> 36:16.870
No, I mean, I'm using mini, so it's fine, but that could be like $100 mistake if you're using O3

36:16.870 --> 36:18.150
or Pro or something.

36:18.310 --> 36:18.710
Yeah.

36:19.350 --> 36:20.310
Um, yeah.

36:20.350 --> 36:25.550
Just to show you, like, again, a lot of this won't make sense, but you can see it's super optimized

36:25.550 --> 36:30.110
instruction texts like the few shot examples and then balance between different optimization strategies.

36:30.150 --> 36:36.670
First it starts like bootstrapping traces meaning it's like it's generating new few shot examples,

36:36.710 --> 36:41.150
and it'll have a certain number of attempts to see to get like a funny joke.

36:41.510 --> 36:46.470
And then it's like doing that a bunch of different rounds and then it's proposing instruction candidates.

36:46.470 --> 36:48.790
So we use the few shot examples from the previous step.

36:48.910 --> 36:51.630
Generated data set summary a summary of the program code.

36:51.630 --> 36:52.790
So it's like program aware.

36:52.790 --> 36:55.870
It can see what the code looks like of the full module.

36:56.150 --> 36:59.300
And then randomly selecting prompting tip to propose instruction.

36:59.460 --> 37:03.340
So like you can actually override those as well if you want to subclass Dnipro.

37:03.340 --> 37:06.260
But I'm too afraid to try and then.

37:06.300 --> 37:06.500
Yeah.

37:06.500 --> 37:12.340
So so it's proposed instructions like it did have a funny joke about the topic, but then it came up

37:12.340 --> 37:12.820
with this one.

37:12.820 --> 37:17.780
Generate a humorous joke related to the specified topic, suitable for general adult audience.

37:17.820 --> 37:19.660
Be mindful of potentially sensitive content.

37:19.660 --> 37:19.980
Right.

37:20.020 --> 37:21.900
So that's its prompt its new prompt.

37:21.900 --> 37:25.020
And then which model is it using to generate this prompt.

37:25.060 --> 37:28.020
So by default it's just using the same model that you passed.

37:28.020 --> 37:28.980
But you can set that.

37:28.980 --> 37:30.700
So you can choose the prompt model.

37:30.700 --> 37:32.220
You can choose the teacher model.

37:32.260 --> 37:32.500
Right.

37:32.660 --> 37:37.100
So the difference is the prompt model writes rewrites the prompt and creates candidates.

37:37.140 --> 37:40.300
The teacher model generates new few shot examples.

37:40.340 --> 37:40.700
Okay.

37:40.740 --> 37:41.100
Yeah.

37:42.140 --> 37:43.340
So it goes through all this.

37:43.380 --> 37:45.340
And you can see it runs for a long time right.

37:45.380 --> 37:46.300
Lots of trials here.

37:46.300 --> 37:48.020
You can set the number of trials and stuff.

37:48.340 --> 37:53.140
But once you have that you can run the optimization and then.

37:55.740 --> 37:55.940
Yeah.

37:55.980 --> 37:56.140
Yeah.

37:56.140 --> 37:57.730
So that's actually better for some reason.

37:58.130 --> 38:05.530
Maybe, maybe I didn't run it with that seed before, but yeah, it actually went up from 39% to 49%

38:05.570 --> 38:05.970
here.

38:06.010 --> 38:08.010
It got it did get a lot funnier.

38:08.050 --> 38:11.130
It still does a few dad jokes and stuff like that is what I found.

38:11.130 --> 38:16.330
But the way you would go to improve this is you would improve the optimizer, right?

38:16.370 --> 38:18.250
Because we only had 90% accuracy.

38:18.250 --> 38:22.570
But and we also had pretty easy examples where the dad jokes were very obviously bad.

38:22.690 --> 38:25.170
And like the comedian jokes are very obviously good.

38:25.210 --> 38:27.410
So we don't really have that many in the middle.

38:27.410 --> 38:29.130
So I would go a bit more fine grained.

38:29.170 --> 38:34.250
Now if I wanted to improve it further, and I would give it some harder examples to train on the jokes

38:34.250 --> 38:36.650
that you think, yeah, yeah, yeah, I'll show you that in a second.

38:36.690 --> 38:38.650
But this looks great.

38:38.690 --> 38:39.370
Show me.

38:39.890 --> 38:41.170
Yeah, show me the money.

38:41.730 --> 38:41.970
Yeah.

38:42.010 --> 38:43.050
This is like the basic one.

38:43.050 --> 38:44.130
So the topic was Python.

38:44.170 --> 38:45.930
Why do Python programmers prefer dark mode?

38:45.930 --> 38:47.250
Because light attracts bugs.

38:47.290 --> 38:49.010
Me is like, why did the programmer quit his job?

38:49.010 --> 38:50.410
Because he didn't get a raise.

38:50.770 --> 38:51.450
It's not funny.

38:51.450 --> 38:52.210
But that's better.

38:52.210 --> 38:52.770
That's better.

38:52.810 --> 38:53.170
Yeah.

38:53.570 --> 38:53.890
Yeah.

38:53.890 --> 38:54.770
So this one's a little bit better.

38:54.810 --> 38:56.850
Why did the coffee go to the police?

38:56.850 --> 38:57.570
They got mugged.

38:57.850 --> 38:58.650
I like my coffee.

38:58.690 --> 38:58.810
How?

38:58.810 --> 39:01.330
I like myself dark, bitter and too hot for you.

39:01.890 --> 39:02.850
Okay, this is a bit better.

39:02.890 --> 39:03.930
Why did the bicycle fall over?

39:03.930 --> 39:04.890
Because he was too tired.

39:04.890 --> 39:05.930
And then he said.

39:05.930 --> 39:07.850
I hate when I lose my motivation to exercise.

39:07.850 --> 39:10.330
It's like, where do these extra £10 keep coming from?

39:10.370 --> 39:11.370
I think that's much better.

39:11.370 --> 39:16.770
But yeah, anyway, so you can see that it is a little bit harder to get like for these fuzzier tasks.

39:16.770 --> 39:17.370
But we did.

39:17.410 --> 39:18.650
It's not to be sniffed at.

39:18.650 --> 39:18.850
Right.

39:18.850 --> 39:24.690
We went from 39% funny to like over 50% funny, almost 50% funny.

39:24.690 --> 39:28.890
So that obviously you keep working on this, but like the fuzzier tasks are just harder.

39:29.050 --> 39:29.250
Yeah.

39:29.290 --> 39:32.090
No, I mean that it was definitely interesting.

39:32.090 --> 39:35.970
So the way what I'm taking from this is the judge is like everything.

39:36.010 --> 39:36.650
Yeah.

39:37.050 --> 39:41.850
Actually my I'm like radicalized on this to the point where I actually don't think data matters.

39:41.850 --> 39:44.410
Everyone says it's all about your data, your unique data.

39:44.410 --> 39:48.090
I'm like, no, you can generate infinite synthetic data if you have a good judge.

39:48.410 --> 39:51.370
But that's also that's what you need data for, judge.

39:51.370 --> 39:51.610
Right.

39:51.610 --> 39:56.760
You need to get good data to create a good judge, which is, yeah, you only need the data to create

39:56.800 --> 39:57.160
good jobs.

39:57.160 --> 39:58.720
Once you've got a good judge, you don't really need to.

39:58.840 --> 40:00.080
Yeah, yeah, yeah.

40:00.120 --> 40:03.360
So that's what like you hear from the labs as well.

40:03.360 --> 40:05.160
Like a lot of them are doing synthetic data.

40:05.200 --> 40:07.200
Now I have two questions.

40:07.200 --> 40:10.440
One, how do you productionize this.

40:10.480 --> 40:13.440
Do you export like the prompt and you run that.

40:13.680 --> 40:14.120
Yeah.

40:14.160 --> 40:16.160
So I don't productionize this.

40:16.200 --> 40:19.240
Like I've never put a DSP program into production.

40:19.720 --> 40:20.120
Exactly.

40:20.120 --> 40:20.480
Yeah.

40:20.520 --> 40:21.960
I could never convince them to do that.

40:21.960 --> 40:23.120
And I wouldn't want to.

40:23.360 --> 40:26.120
I don't want that because it's like a kind of complicated.

40:26.160 --> 40:27.480
And yeah, maybe I'm warming to it.

40:27.480 --> 40:31.000
Like I actually have an internal tool that uses DSP end to end.

40:31.280 --> 40:31.760
Yeah, okay.

40:31.760 --> 40:35.080
But there is a way to get it running with the fast API.

40:35.400 --> 40:39.960
And so, so one of the things that is important is that you can save the program locally.

40:39.960 --> 40:44.600
And then that creates a folder with all the metadata and the pickle of the program.

40:44.600 --> 40:48.360
If you share this folder with someone, they can just literally just do this.

40:48.400 --> 40:52.390
They can just load the program and then it runs perfectly on that machine.

40:52.710 --> 40:54.670
But then I export a prompt.

40:54.710 --> 40:55.790
Is that something you do?

40:55.830 --> 40:56.110
Yeah.

40:56.110 --> 40:58.310
So this is where my special hack and this is like the.

40:58.350 --> 40:59.390
That's what I would want.

40:59.470 --> 41:06.550
I wrote I pasted Omar like insistently because he like really hates that, that people just want to

41:06.550 --> 41:11.230
get the prompt out and it's not that easy because even in inspect element it shows the outputs and stuff

41:11.270 --> 41:13.030
and it's hard to construct it programmatically.

41:13.190 --> 41:14.910
So anyway, he was like, you could just do this.

41:14.950 --> 41:15.910
And I'm like, okay, fine.

41:16.150 --> 41:22.070
So you basically pass in to the chat adapter, which is what it uses underneath the hood, the format

41:22.070 --> 41:27.950
function, and then you get the signature, the demos for all of the named predictors essentially.

41:27.950 --> 41:30.590
So it's yeah, I don't don't worry about it too much.

41:30.590 --> 41:31.990
Your eyes will bleed if you look at it.

41:31.990 --> 41:36.270
But but basically if you just run this then it will like you'll have the full.

41:36.310 --> 41:38.790
And this is the same like this is exactly the same.

41:38.830 --> 41:39.750
It's exactly the same.

41:39.750 --> 41:44.990
It's got the system message and it turns it into the user and assistant messages as well.

41:44.990 --> 41:51.830
So and and for variables like how like I normally use liquid but like it uses something else, I assume,

41:51.870 --> 41:54.990
or it doesn't use variables importantly, which is interesting.

41:55.030 --> 41:57.350
So okay, so how do I custom.

41:57.350 --> 41:58.430
So yeah.

41:58.590 --> 42:03.190
So how do I use make use of variables or like how does it even work.

42:03.190 --> 42:04.830
It just has instructions.

42:04.870 --> 42:06.470
And then you have a user message.

42:06.470 --> 42:11.150
And then it will follow up with whatever the reasoning and things is from the user message.

42:11.150 --> 42:13.430
So how do I trigger this thing then.

42:13.470 --> 42:14.310
Yeah, exactly.

42:14.310 --> 42:19.510
So you can see here it doesn't like use variable like even this actually isn't a variable.

42:19.510 --> 42:21.150
It's just a formatting.

42:21.470 --> 42:21.590
Yeah.

42:21.630 --> 42:25.790
And then all it all the only place it uses the variables are here.

42:25.790 --> 42:28.750
It just puts in the inputs and the outputs.

42:29.310 --> 42:29.510
Yeah.

42:29.550 --> 42:36.750
And then and then when you generate so you can see that the user message is like just topic like this.

42:36.790 --> 42:38.310
And then religious satire.

42:38.390 --> 42:40.030
So what I have to do is the exact same format.

42:40.030 --> 42:40.230
Yeah.

42:40.270 --> 42:41.470
You'd have to use the same format.

42:41.470 --> 42:44.430
Or you could use the display chat adapter if you wanted to.

42:44.470 --> 42:48.980
Or I replace those things with liquid text, like real stuff.

42:49.020 --> 42:50.020
Like how you'd normally pump.

42:50.060 --> 42:50.300
Yeah.

42:50.340 --> 42:52.420
You can also use a different chat adapter.

42:52.420 --> 42:54.580
So you can also create your own chat adapters.

42:54.620 --> 42:57.500
Like you can make it work automatically.

42:57.500 --> 43:00.380
Convert it into your own format the way you like to do it.

43:00.420 --> 43:02.620
If it's markdown or or whatever it is.

43:03.060 --> 43:04.900
So you can see we've got the prompt out here.

43:04.940 --> 43:07.460
We've got the this is just the messages stream essentially.

43:07.460 --> 43:12.860
So you could actually just pass this into ChatGPT as well or into OpenAI.

43:15.340 --> 43:21.180
So I have one one other question that says yeah, which is it's interesting to see that it's doing it's

43:21.180 --> 43:25.620
few shot examples as assistant user assistant a user assistant.

43:25.660 --> 43:25.980
Yeah.

43:26.020 --> 43:30.060
Is there a way to be like I want the examples to be inside just the system prompt.

43:30.060 --> 43:32.020
You just have to change the chat adapter.

43:32.060 --> 43:32.340
So yeah.

43:32.380 --> 43:33.260
And it's all open source.

43:33.420 --> 43:35.700
So you can see what they're doing and you can just subclass it.

43:35.740 --> 43:36.780
And which is quite nice.

43:36.780 --> 43:41.900
But but yeah I've found actually that it works better to just put the few shot examples in the system

43:41.900 --> 43:42.260
message.

43:42.300 --> 43:43.500
It tends to follow it better.

43:43.500 --> 43:43.740
Yeah.

43:43.740 --> 43:46.260
So I don't fully agree with the way they've done this, but it is.

43:46.300 --> 43:48.450
This is like the way that you're supposed to do it.

43:48.450 --> 43:52.730
But like, again, it's one of those hacky things where like, I found that doing it the wrong way actually

43:52.730 --> 43:54.130
sometimes works, works as well.

43:54.170 --> 43:54.370
Yeah.

43:54.410 --> 43:58.650
And once again, like, the nice thing is that you can still benefit from everything by just changing

43:58.650 --> 43:59.450
the adapter.

43:59.490 --> 44:04.450
And yeah, if you change the adapter and optimize with that, you can recompile the program and then

44:04.450 --> 44:06.690
it will just change a few short examples.

44:06.690 --> 44:09.050
It would change all of the output input.

44:09.090 --> 44:15.210
It would change everything to just match your format, which is because the adapter is used when running

44:15.250 --> 44:16.210
the optimization.

44:16.250 --> 44:16.770
Exactly.

44:16.770 --> 44:24.450
So every single time any prompt runs from any of the optimizer processes or processes, it will always

44:24.450 --> 44:27.450
go through the adapter to the LM and back.

44:28.130 --> 44:30.810
And one last question from me.

44:30.850 --> 44:35.450
So I have memories specifically tied to users.

44:35.450 --> 44:41.210
And I would inject memory into the context as well to personalize things more.

44:41.210 --> 44:43.130
But how would memory work?

44:43.130 --> 44:47.250
Or how would that work with an optimizer or you just leave it out completely.

44:47.290 --> 44:47.730
Yeah.

44:47.730 --> 44:53.010
So no, I yeah, I've thought about this a lot and and the way you do it is you give it as like a specific

44:53.010 --> 44:54.170
field in DSP.

44:54.210 --> 44:58.250
So they actually have a history field somewhere.

44:58.970 --> 45:01.890
They have memory like there is a concept of memory.

45:01.930 --> 45:02.450
Yeah.

45:02.690 --> 45:04.410
And they've got a tutorial about memory.

45:04.410 --> 45:09.450
But what I've used for those because we also have the same thing with with rally like we have the past

45:09.450 --> 45:10.250
chat history.

45:10.330 --> 45:10.690
Um, yeah.

45:10.730 --> 45:11.130
Exactly.

45:11.170 --> 45:16.810
So there's this specific DSP history, uh, which will then format it in the correct way as a message

45:16.810 --> 45:17.330
stream.

45:17.690 --> 45:17.890
Yeah.

45:17.930 --> 45:18.210
Okay.

45:18.250 --> 45:23.010
But then but how do you like, like we have 5000 users.

45:23.250 --> 45:28.210
Like how is the way you've got to think about it is it's like a, it's like a Excel sheet.

45:28.210 --> 45:28.730
Right.

45:28.770 --> 45:33.810
For every interaction with the user, you have the same as long as you have the same columns like the

45:33.810 --> 45:35.490
input from the user is this.

45:35.490 --> 45:37.330
And then the memory of the user is this.

45:37.330 --> 45:39.810
And then maybe you did some rag right.

45:39.850 --> 45:44.800
So like the rag result is this column, you know, and you can actually build the rag into the program

45:44.800 --> 45:45.160
as well.

45:45.160 --> 45:48.200
So you can do yeah, you can do all that stuff you can have.

45:48.240 --> 45:50.200
Actually I think they've got an example here.

45:50.720 --> 45:51.360
Yeah, yeah.

45:51.400 --> 45:52.720
Here's an example.

45:53.040 --> 45:53.360
Oh no.

45:53.360 --> 45:53.720
Sorry.

45:53.760 --> 45:53.960
Yeah.

45:54.000 --> 45:54.440
Rag.

45:54.560 --> 45:59.880
So like here they've used Colbert as the rag thing and they've just passed that in to search Wikipedia.

45:59.920 --> 46:00.080
Right.

46:00.120 --> 46:04.520
Or they have agents where it's like they have evaluate math and search Wikipedia and then they pass

46:04.520 --> 46:05.600
them as the tools.

46:05.640 --> 46:05.800
Right.

46:05.800 --> 46:10.360
You can actually build whatever program it is you want, but you just need to make sure your examples

46:10.360 --> 46:13.160
contain the user ID or something like that.

46:13.400 --> 46:19.400
And then it can do whatever memory lookup for that user, and then it's included in the optimization.

46:19.440 --> 46:20.080
Exactly.

46:20.080 --> 46:20.360
Yeah.

46:20.360 --> 46:26.840
And so that that's nice because it will make it extra generalizable in terms of it will because the

46:26.840 --> 46:28.640
memories are different every time.

46:28.640 --> 46:32.080
Like it will, I think, really, truly learn the task, if that makes sense.

46:32.120 --> 46:32.720
Yeah, yeah.

46:32.760 --> 46:37.840
But like every row is just a user interaction and what context interaction had at the time.

46:37.880 --> 46:38.240
Yeah, yeah.

46:38.280 --> 46:41.360
And you just include it or do it however you want.

46:41.360 --> 46:41.520
Yeah.

46:41.560 --> 46:41.720
Yeah.

46:41.790 --> 46:42.150
exactly.

46:42.190 --> 46:42.310
Yeah.

46:42.350 --> 46:42.990
So I have a question.

46:42.990 --> 46:43.590
If you have the time.

46:43.630 --> 46:43.990
Yeah, sure.

46:44.630 --> 46:47.270
So you say you don't use DSP in production.

46:47.630 --> 46:51.950
How do you go from the prompt that you have in production to DSP.

46:52.390 --> 46:52.870
Yeah.

46:52.910 --> 46:53.590
Good question.

46:53.630 --> 46:53.750
Yeah.

46:53.750 --> 46:58.470
So typically what I'll do is I'll run the optimization.

46:58.470 --> 47:00.150
So there's a few things that I tend to use it for.

47:00.190 --> 47:03.230
One is to just script to try different things in different models.

47:03.230 --> 47:05.710
So I'll be like okay does this work right.

47:05.750 --> 47:10.070
Does like I don't know, I was trying to make like a focus group where the agents talk to each other

47:10.070 --> 47:14.870
and it was much, much easier to just give them a talk tool and then make a react agent in DSP and then

47:14.870 --> 47:16.670
let it run and just see if it worked.

47:16.710 --> 47:20.430
So then it just gives me a broad strokes of does this interaction format work or not?

47:20.430 --> 47:21.350
And it's crappy.

47:21.350 --> 47:22.870
So I was like, okay, I'll abandon that.

47:22.910 --> 47:24.070
It's really good for that.

47:24.070 --> 47:29.390
I use it to test out theories of what might work, and then if it works in DSP, then I'll go into my

47:29.390 --> 47:31.990
code and I'll test it.

47:31.990 --> 47:35.070
The other way is I'll actually run the optimization.

47:35.590 --> 47:37.110
I'll create the eval metric.

47:37.310 --> 47:43.100
But then but then I'll export the prompt like you saw here, like I'll get the system message, etc.

47:43.100 --> 47:45.140
and then I'll just translate that to my format.

47:45.140 --> 47:47.180
So I'll just say, here's how I structure my prompt.

47:47.220 --> 47:52.540
It's not perfect because it might change actually the evaluation score when you change the format.

47:52.820 --> 47:58.300
But it's actually pretty easy to just give this to ChatGPT like paste it in and say, hey, I got this

47:58.300 --> 48:00.940
program from DSP, can you just change the few shot examples?

48:00.980 --> 48:01.820
So how do you go the other way?

48:01.860 --> 48:04.860
Like for example, I have my prompt for spiral.

48:05.180 --> 48:08.420
Let's say I have a very fuzzy class that I want to evaluate against.

48:08.460 --> 48:12.500
How would I take that and then put it into DSP to then evaluate against.

48:12.540 --> 48:12.900
Yeah.

48:12.900 --> 48:18.020
So the canonical way for DSP would they would say, don't worry about all the crazy stuff that you put

48:18.020 --> 48:18.540
in your prompt.

48:18.540 --> 48:22.820
Just describe the task as the instructions, like just a simple task.

48:22.860 --> 48:28.860
Describe what the inputs and outputs are and then dump your your eval data set and then create the focus

48:28.860 --> 48:29.980
on the eval metric.

48:30.020 --> 48:32.300
I found pretty bad results doing that.

48:32.340 --> 48:34.420
Like it takes a bit of time to get back up to.

48:34.460 --> 48:37.980
If you've got a really lovingly handcrafted prompt, it actually takes quite a while.

48:38.020 --> 48:44.140
Like I actually have a joke prompt which is not safe for work because it's like very good doing Dave

48:44.140 --> 48:44.660
Chappelle.

48:44.780 --> 48:46.340
Can I even train his voice?

48:46.380 --> 48:51.460
Um, and I didn't really want to get sued, but it is like you, it literally passes the Turing test.

48:51.460 --> 48:53.180
And this was back with before.

48:53.220 --> 48:55.060
Like before, uh, and it's unbelievable.

48:55.260 --> 48:58.580
So I still haven't created a DSP joke generator that beats that.

48:58.740 --> 49:04.340
But that took me like days, like a lot of I just love comedy, so I put a lot of effort into it, basically.

49:04.380 --> 49:10.340
So I would say quite often I use DSP for the things I don't care that much about, which is 8,090% of

49:10.340 --> 49:10.900
my problems.

49:10.940 --> 49:16.980
For rally, we have a really lovingly handcrafted prompt for querying the personas and for generating

49:16.980 --> 49:17.460
the personas.

49:17.460 --> 49:19.620
But we also have prompts for like generating the title.

49:19.660 --> 49:20.820
And I'm like, I don't care about that.

49:20.860 --> 49:22.220
Like, I just use DSP for that.

49:22.220 --> 49:26.740
Or I have a prompt for classifying the diversity of thought in the.

49:26.780 --> 49:30.180
And I don't even know what diversity of thought means really in a lot of cases.

49:30.180 --> 49:32.180
So like I just use DSP for that.

49:32.180 --> 49:37.690
And, and I put in into my data set cases where the users have complained and send them to me.

49:37.730 --> 49:40.410
So that's how I think about it, is create your classifiers.

49:40.410 --> 49:41.770
It's amazing at that.

49:41.810 --> 49:45.930
Create your like utility tasks and despite it's like amazing for that.

49:45.930 --> 49:47.770
And you don't need if you don't care that much about it.

49:47.770 --> 49:54.850
But just like I writing like I still handwrite stuff like my every article is actually the only thing

49:54.850 --> 49:58.970
really that I like write myself anymore because they're going to 100,000 people.

49:58.970 --> 50:01.050
It's actually, like worthwhile to spend a day.

50:01.090 --> 50:06.050
And it only takes me a day because, like, I don't when my life really be better if I like managed

50:06.050 --> 50:09.530
to get that down to a couple of hours, but like, maybe the quality won't be as good.

50:09.570 --> 50:09.810
Yeah.

50:09.850 --> 50:15.610
So like the things you really care about, I think you can still beat DSP, but like I would say increasing

50:15.610 --> 50:17.130
amount of things I just don't care about.

50:17.170 --> 50:17.570
Yeah.

50:18.010 --> 50:25.410
So if I have a categorization thing would be would it be better than hence doing it categorization or

50:25.410 --> 50:27.570
you would say still think it's better?

50:27.570 --> 50:28.290
It will be better.

50:28.290 --> 50:32.330
If you hadn't, I would say I haven't met a classification task.

50:32.530 --> 50:34.770
That DSP didn't completely blow out of the water.

50:34.810 --> 50:39.040
Okay, so I should use it for Quora email classification.

50:39.080 --> 50:39.760
Yeah, I think so.

50:39.800 --> 50:41.960
Yeah, I think that would be like a really good place to start.

50:42.000 --> 50:43.320
Yeah yeah yeah.

50:43.360 --> 50:48.040
The only thing there is like, how do I train it for all thousands of users?

50:48.040 --> 50:53.280
Because all users have different rules and different nuance like I.

50:53.400 --> 51:00.080
The problem is dynamic, like my problem changes per user and I'm just worried that will mess like it

51:00.080 --> 51:02.080
will generalize something that's really good.

51:02.080 --> 51:04.680
Yeah, to some degree the people.

51:05.040 --> 51:05.280
Yeah.

51:05.280 --> 51:09.000
To some degree I guess like it.

51:09.040 --> 51:15.640
If there is a coherent meta task, as in if most of your users have similar preferences, then it will

51:15.640 --> 51:20.480
do a pretty good job of teasing out if there is some pattern.

51:20.520 --> 51:22.320
Maybe that's what it will do a good job of.

51:22.520 --> 51:29.320
Break it up into steps where it's like one is the like the generalized version, and then it will do

51:29.320 --> 51:32.360
a secondary for things, but that's more costly.

51:32.360 --> 51:33.120
So think about it.

51:33.120 --> 51:33.240
Yeah.

51:33.280 --> 51:34.080
Yeah exactly.

51:34.080 --> 51:40.560
But what you could do if you have enough data per user is you could just have a DSP program per user,

51:41.320 --> 51:41.560
right?

51:41.920 --> 51:43.920
Like I've seen that with fine tuning.

51:44.080 --> 51:46.920
That would be like say writer did a really good job of this.

51:46.920 --> 51:48.360
They made their own AI models.

51:48.400 --> 51:53.600
They had the base model which was really good at writing, but then they trained a custom model for

51:53.600 --> 51:55.240
each because their enterprise clients, right?

51:55.280 --> 52:00.600
They trained a custom model for each enterprise client on their brand values and their tone of voice.

52:01.200 --> 52:04.040
Or you could have I see AI image companies do this right.

52:04.080 --> 52:06.280
Like they have a custom law of per brand.

52:06.520 --> 52:06.800
Yeah.

52:06.840 --> 52:08.560
Yeah, I think that might be the key.

52:08.560 --> 52:10.000
And I am going to try that.

52:10.560 --> 52:11.200
Yeah, yeah.

52:11.240 --> 52:14.720
Do you like how costly is it normally to run this.

52:14.920 --> 52:15.640
Do you know.

52:15.680 --> 52:22.840
I would say you basically don't really have to worry too much if you're using GPT Mini or Haiku or one

52:22.840 --> 52:28.280
of the Gemini Flash, you know, like that's you're never going to cost yourself more than a couple

52:28.280 --> 52:33.830
of bucks, even with the heavy optimizers, where it gets really dangerous and scary is when you're

52:33.870 --> 52:38.030
actually when you're using anything anthropic like sonnet and and particularly like opus.

52:38.070 --> 52:38.230
Right.

52:38.270 --> 52:38.830
Like that.

52:38.830 --> 52:41.270
That could be like a $200 mistake.

52:41.510 --> 52:43.670
Or if you're using the big open AI models as well.

52:43.710 --> 52:50.070
I actually did a big project for a client, which was 5000 AI personas, and we were just using for

52:50.070 --> 52:56.310
oh, but I was I did this big optimization and and my credit card got declined because we'd spent like

52:56.310 --> 53:00.310
$900 in an hour and I was like, oh, so I had to ring up my client.

53:00.350 --> 53:04.710
I was like, do you have a credit card that I could use to continue?

53:04.870 --> 53:05.830
Yeah, it's pretty funny.

53:06.070 --> 53:06.350
Yeah.

53:06.390 --> 53:10.230
This is like a really funny, really funny account, you know?

53:11.190 --> 53:12.350
Uh, so that was that was.

53:12.390 --> 53:12.510
Yeah.

53:12.550 --> 53:12.750
Okay.

53:12.790 --> 53:16.110
But yeah, because we couldn't afford any more tokens.

53:16.150 --> 53:17.030
Again, very fun.

53:17.510 --> 53:22.910
But yeah, my margins are not that big because my product is $15 per month, and I already spent around

53:22.910 --> 53:25.550
two and a half dollars per month in inference.

53:25.550 --> 53:28.870
Just so like a dollar is, is important.

53:28.870 --> 53:30.780
Yeah, yeah, so is important.

53:30.820 --> 53:33.900
So I would say that you might want to do like a more of a meta task.

53:33.940 --> 53:39.540
Then if you can optimize the prompt once and it gets 50% better for all users is better.

53:39.580 --> 53:39.700
Yeah.

53:39.740 --> 53:39.980
Yeah.

53:40.020 --> 53:43.980
Then then like, you can up those costs over, like, the whole use space.

53:44.300 --> 53:44.460
Yeah.

53:44.500 --> 53:44.740
Yeah.

53:44.780 --> 53:46.900
I at least I need to experiment and see.

53:46.940 --> 53:51.900
Is it worth a dollar or $2 once to run or.

53:52.060 --> 53:53.340
Yeah, to some degree as well.

53:53.340 --> 53:57.100
There is a real you can actually really bring down your actual costs with DSP.

53:57.140 --> 53:58.780
Optimization is an investment.

53:58.780 --> 54:04.780
But you I've seen it happen where like you can take like I would take GPT 4.5 because it's good at creative

54:04.780 --> 54:05.340
writing.

54:05.540 --> 54:09.580
And then I would use that as the teacher model, or I'd use that to generate a synthetic data set.

54:09.620 --> 54:09.980
Yeah.

54:10.020 --> 54:14.940
You just need to get almost as good, like 8,090% good is 4.5 for that task.

54:14.980 --> 54:15.180
Yeah.

54:15.220 --> 54:20.220
So like distillation I think is like the primary the classification and distillation are like the two

54:20.260 --> 54:21.500
major brain.

54:21.540 --> 54:22.620
I already own the cheapest.

54:22.980 --> 54:27.300
Essentially I already use flash 2.0 for classification.

54:27.340 --> 54:27.620
Yeah.

54:27.660 --> 54:28.300
Flash is great.

54:28.420 --> 54:29.020
Unbelievable.

54:29.020 --> 54:30.940
I hope Google keeps subsidizing it.

54:30.980 --> 54:31.300
Yeah.

54:33.020 --> 54:33.380
Cool.

54:33.660 --> 54:34.420
This is amazing.

54:34.420 --> 54:35.420
Thank you so much.

54:35.500 --> 54:35.780
Cool.

54:35.820 --> 54:37.780
Yeah yeah yeah yeah yeah.

54:37.820 --> 54:39.620
It'll be good to get the recording afterwards.

54:40.180 --> 54:41.460
We could turn this into.

54:41.620 --> 54:44.300
Maybe I don't have to write my next every column now.

54:44.340 --> 54:46.540
So yeah, we should like.

54:46.780 --> 54:47.180
Yeah.

54:47.220 --> 54:54.380
It's like I'm going to I'm definitely going to try this out next week or weeks because this is exactly

54:54.380 --> 54:55.860
the problem I'm trying to solve.

54:55.860 --> 54:58.220
And I'm just lazy to work on the problem.

54:58.260 --> 55:01.740
Yeah, it is like one of those things where actually similar to like coding.

55:01.780 --> 55:08.180
Like when I first started using copilot, I was like checking every line and then now I'm like, oh,

55:08.220 --> 55:09.620
just see if code can do it.

55:09.660 --> 55:10.900
And yeah, we'll see.

55:10.940 --> 55:16.060
I think Jess and I used to care a lot about my prompts, and now I'm like, ah, well, just see if

55:16.300 --> 55:16.780
I can do it.

55:16.820 --> 55:18.060
Yeah, yeah.

55:20.180 --> 55:20.660
Yeah.

55:20.700 --> 55:21.260
Cool, cool.

55:21.300 --> 55:21.580
All right.

55:21.580 --> 55:22.260
Thanks, guys.

55:22.540 --> 55:22.860
Yeah.

55:22.860 --> 55:23.540
Thank you.

55:24.220 --> 55:24.540
All right.

55:24.580 --> 55:25.140
Take care.

55:25.300 --> 55:25.860
See you.
