WEBVTT

00:01.230 --> 00:03.600
-: All right guys, I wanna show you the evaluator

00:03.600 --> 00:05.040
optimizer pattern.

00:05.040 --> 00:09.300
So we're gonna be using something based off this code right

00:09.300 --> 00:12.390
here, but this is like an agent pattern

00:12.390 --> 00:14.280
and it's quite useful.

00:14.280 --> 00:18.840
I use it fairly regularly for situations where the cost

00:18.840 --> 00:20.880
of the task is not that important

00:20.880 --> 00:22.320
and also the time it takes

00:22.320 --> 00:24.000
to do the task is not that important.

00:24.000 --> 00:26.760
But it is very important that you get the task done well.

00:26.760 --> 00:30.330
What it's using is like an LLM judge

00:30.330 --> 00:34.620
where it will generate the task,

00:34.620 --> 00:36.000
send the solution,

00:36.000 --> 00:39.150
and the LLM judge will tell it where it failed

00:39.150 --> 00:40.770
or what's wrong with it.

00:40.770 --> 00:42.270
And it'll keep going in the loop

00:42.270 --> 00:44.730
until it reaches the maximum number of iterations

00:44.730 --> 00:47.520
or until it says, "Yeah, you've passed."

00:47.520 --> 00:50.100
So there's many different ways to implement this.

00:50.100 --> 00:52.050
They have some good code here

00:52.050 --> 00:54.090
and I've used that as a starting point,

00:54.090 --> 00:57.720
but just adapted it to a specific use case that I had.

00:57.720 --> 01:02.280
So just going to go into our notebook,

01:02.280 --> 01:03.780
and we'll just code along.

01:03.780 --> 01:07.597
One thing, I'm just gonna grab the standard code

01:09.210 --> 01:11.820
that you use to get the OpenAI API key

01:11.820 --> 01:14.970
in Google Notebooks.

01:14.970 --> 01:16.950
You just need to like turn on the API key

01:16.950 --> 01:18.420
that you already have in the secret.

01:18.420 --> 01:20.610
So you can just add the secret here,

01:20.610 --> 01:22.350
and this is the easiest way to do it.

01:22.350 --> 01:23.793
And then if you just run that,

01:23.793 --> 01:26.640
then you hopefully should have your OpenAI key.

01:26.640 --> 01:28.740
Go into the OpenAI account

01:28.740 --> 01:31.110
and get one there if you don't have it already.

01:31.110 --> 01:34.350
All right, so I just need to upgrade OpenAI

01:34.350 --> 01:35.643
to the latest version.

01:36.570 --> 01:41.490
Just gonna run that. Okay?

01:41.490 --> 01:45.010
And then we are going to

01:45.870 --> 01:47.640
create a function called the LLM.

01:47.640 --> 01:51.967
So say run LLM.

01:54.120 --> 01:55.650
This is just like standard practice.

01:55.650 --> 01:57.870
Typically you're doing a bunch of this stuff.

01:57.870 --> 02:01.263
So just gonna call this a user prompt.

02:03.660 --> 02:08.133
It's gonna be a string. We need to pass the model.

02:10.950 --> 02:12.450
It's gonna be a string.

02:12.450 --> 02:16.200
We're gonna do gt4,

02:16.200 --> 02:18.390
on average it can be the model

02:18.390 --> 02:19.800
we're gonna be using the most.

02:19.800 --> 02:22.233
And then we have a system prompt,

02:25.230 --> 02:28.503
also string, and a default is none.

02:30.150 --> 02:33.420
Alright, underneath here we're gonna have our messages.

02:33.420 --> 02:36.183
We're gonna have if system prompt,

02:38.490 --> 02:40.560
going to, yeah, add that.

02:40.560 --> 02:42.483
It's now called developer by the way.

02:44.787 --> 02:45.900
So we're gonna change that.

02:45.900 --> 02:49.170
And yeah, we want to add the user prompt in here.

02:49.170 --> 02:52.173
That makes sense. And then we do completion.

03:00.210 --> 03:03.030
Okay, so it's got the old version here.

03:03.030 --> 03:06.813
We need to get a client.

03:09.750 --> 03:14.750
So let me just paste that in here.

03:15.030 --> 03:19.830
And then we're gonna say client.chatcompletions.

03:19.830 --> 03:22.830
Here we go. So now it's found the right thing here.

03:22.830 --> 03:27.830
All right, we're gonna pass in model and messages.

03:32.850 --> 03:34.100
Okay, I think we're good.

03:36.660 --> 03:38.763
Now let's just check that this is working.

03:41.340 --> 03:43.083
Yeah, what is the meaning of life?

03:50.187 --> 03:52.603
Here we go.

03:52.603 --> 03:54.150
Hopefully, it would say 42,

03:54.150 --> 03:55.950
but yeah, here we are.

03:55.950 --> 04:00.483
Okay and then we need to set up our generator

04:01.560 --> 04:04.710
and discriminator or evaluator and optimizer

04:04.710 --> 04:06.660
or infrastructure.

04:06.660 --> 04:08.460
So let's have the task.

04:08.460 --> 04:11.244
This is like the first thing we need.

04:11.244 --> 04:14.010
So the task is one that I was using the other day.

04:14.010 --> 04:17.970
Write a one sentence bedtime story.

04:21.002 --> 04:24.024
(keyboard clacking)

04:24.024 --> 04:27.063
about a unicorn for a five-year old girl.

04:29.130 --> 04:30.780
I was doing that for my daughter.

04:34.050 --> 04:36.300
Always good to automate your life.

04:36.300 --> 04:39.060
And we're gonna do a generator prompt.

04:39.060 --> 04:42.330
I'm just gonna paste this one in because it's a bit longer.

04:42.330 --> 04:43.650
I'll walk you through it.

04:43.650 --> 04:46.650
So your goal is to complete the task based on user input.

04:46.650 --> 04:48.750
If there are feedback from your previous generations,

04:48.750 --> 04:51.090
you should reflect on them to improve your solution.

04:51.090 --> 04:53.400
I'll put your answer concisely in this format.

04:53.400 --> 04:55.500
Thoughts and then response.

04:55.500 --> 04:59.340
And we're taking that from the code in here.

04:59.340 --> 05:03.810
So they have this generator prompt here.

05:03.810 --> 05:06.630
We're also gonna get the evaluator prompt.

05:06.630 --> 05:08.280
I'm gonna paste that one in here as well.

05:08.280 --> 05:09.720
I've changed this one a little bit though,

05:09.720 --> 05:11.280
so I'll walk you through it.

05:11.280 --> 05:13.770
Evaluate this final response four.

05:13.770 --> 05:15.510
And then you have your criteria.

05:15.510 --> 05:18.870
These are the things that you want to get out of this task.

05:18.870 --> 05:22.770
And it's gonna keep going until it gets to this criteria.

05:22.770 --> 05:25.200
So one, I want it to be age appropriate.

05:25.200 --> 05:26.970
I want it to follow style

05:26.970 --> 05:29.460
and best practices for like children's story,

05:29.460 --> 05:31.980
but I also want it to be only 10 words or fewer.

05:31.980 --> 05:33.900
And this is gonna be a hard thing for it to do

05:33.900 --> 05:36.933
because it tends to be a bit verbose otherwise.

05:37.770 --> 05:41.313
All right, so now let's set up our generate task.

05:43.350 --> 05:45.690
Generate for the jerry task.

05:45.690 --> 05:50.280
We need a string as the task.

05:50.280 --> 05:52.623
Then we need our generator prompt.

05:56.100 --> 05:59.553
So string we'll need some context.

06:00.540 --> 06:04.173
Also, string default is just empty string.

06:06.420 --> 06:08.560
And what this is gonna return

06:10.260 --> 06:15.003
is a tuple and two strings.

06:19.960 --> 06:22.473
Okay, so we're gonna say full prompt.

06:26.433 --> 06:29.040
And the full prompt is gonna be this,

06:29.040 --> 06:31.920
just gonna copy this across from the code.

06:31.920 --> 06:34.890
We're gonna take the generator prompt, put it in first,

06:34.890 --> 06:38.733
then we're gonna add the context, then we'll add the task.

06:40.710 --> 06:42.450
Then that's if there is context.

06:42.450 --> 06:43.890
By the way, if there isn't context,

06:43.890 --> 06:46.977
then it would just be generate a prompt and task.

06:46.977 --> 06:49.893
All right, so hopefully that makes sense for you.

06:51.030 --> 06:53.070
Just gonna give this a comment as well,

06:53.070 --> 06:55.350
just so of the doc string.

06:55.350 --> 06:59.193
And then we just need to run that response.

07:00.750 --> 07:03.660
So run the response, print out, generation start

07:03.660 --> 07:06.660
and the output and then return the response.

07:06.660 --> 07:09.000
Okay now we just need to do the same thing for evaluate,

07:09.000 --> 07:11.190
which is a bit more complicated.

07:11.190 --> 07:13.383
And so we have our evaluator prompt.

07:14.610 --> 07:17.223
Let's do our evaluate function.

07:18.720 --> 07:21.510
So the evaluate needs task,

07:21.510 --> 07:24.400
it also needs the evaluator prompt

07:28.020 --> 07:30.837
and we need the generators content.

07:38.190 --> 07:40.893
Okay and so far so good.

07:42.060 --> 07:44.073
Also gonna return the tuple strings.

07:46.050 --> 07:48.570
And here we just need to say,

07:48.570 --> 07:53.287
evaluate if a solution meets requirements,

07:55.410 --> 07:57.240
because that's what we're gonna try and do.

07:57.240 --> 07:59.640
All right, first we need to get the full prompt.

07:59.640 --> 08:02.590
I'm just gonna paste this across rather than typing it all.

08:08.370 --> 08:10.260
So we're gonna take the evaluator prompt,

08:10.260 --> 08:11.880
and then we're gonna put the original task

08:11.880 --> 08:13.350
and then the content to evaluate.

08:13.350 --> 08:16.890
So it has all the full context of what it's looking at.

08:16.890 --> 08:19.920
And then we're just gonna run it right, run the LLM,

08:19.920 --> 08:22.730
get the full prompt and we need to do some regex.

08:22.730 --> 08:25.050
So in the original one, they used pedantic

08:25.050 --> 08:26.910
and they did structured outputs.

08:26.910 --> 08:29.160
I'm not trying to teach that right now.

08:29.160 --> 08:30.330
So we're gonna use regex.

08:30.330 --> 08:33.930
That means we need to do import re.

08:33.930 --> 08:35.820
This is like the dumb version.

08:35.820 --> 08:38.070
So all I'm doing here is I'm getting the status

08:38.070 --> 08:39.090
and I'm getting the feedback

08:39.090 --> 08:41.640
because it's gonna send it back in this format.

08:41.640 --> 08:42.720
We're gonna get the status

08:42.720 --> 08:44.340
and then we're gonna get the feedback.

08:44.340 --> 08:47.610
And I just asked ChatGPT to write this for me.

08:47.610 --> 08:50.610
You can ask Gemini in here if you want as well.

08:50.610 --> 08:52.980
LMS are pretty good at regex.

08:52.980 --> 08:56.970
If there's an error, we need to raise that error

08:56.970 --> 08:59.070
and just say it didn't come back with a response.

08:59.070 --> 09:00.450
So we'll be able to see that.

09:00.450 --> 09:02.640
But then if we do get the match,

09:02.640 --> 09:06.060
then we're gonna get the evaluation

09:06.060 --> 09:08.620
and then we're gonna get the feedback

09:09.780 --> 09:12.663
and we'll just print them out and then return them.

09:15.390 --> 09:17.580
So that should be good. Alright, cool.

09:17.580 --> 09:19.710
So now we have everything we need

09:19.710 --> 09:21.460
and we just need to write the loop.

09:23.100 --> 09:25.173
Okay, so loop workflow,

09:26.280 --> 09:27.960
this is where we tie everything together.

09:27.960 --> 09:31.530
So this is like our master orchestration.

09:31.530 --> 09:33.630
Starting to do this for us now, right?

09:33.630 --> 09:36.390
So we have, 'cause we've been adding type hints

09:36.390 --> 09:38.880
and stuff along the way, so now it knows what we're doing.

09:38.880 --> 09:42.000
Tasks, string, generator prompt is a string.

09:42.000 --> 09:43.980
Evaluator prompt is a string

09:43.980 --> 09:46.020
and generator prompt is a string,

09:46.020 --> 09:48.303
and then it's actually returning.

09:49.950 --> 09:51.990
We're gonna return a tuple, which is a string,

09:51.990 --> 09:53.740
and there's a list of dictionaries.

09:54.870 --> 09:57.120
All right, what do we need to do in this function?

09:57.120 --> 09:59.280
We need to keep generating and evaluating

09:59.280 --> 10:02.820
until it passes the last generated response.

10:02.820 --> 10:04.833
So first we need a marry.

10:09.150 --> 10:10.800
It's just gonna be an array

10:10.800 --> 10:12.510
and then we're gonna generate a response

10:12.510 --> 10:14.730
and add it to memory, right?

10:14.730 --> 10:18.300
So far so good. Then we're gonna go into a loop.

10:18.300 --> 10:21.240
So we're gonna do a maximum number of iterations.

10:21.240 --> 10:22.830
You could put this in parameters

10:22.830 --> 10:25.140
or I'd just like to hard code it.

10:25.140 --> 10:29.620
And we're gonna say while max iterations

10:33.360 --> 10:36.210
is bigger than zero, then start doing some stuff.

10:36.210 --> 10:39.240
So this will mean it will keep looping through.

10:39.240 --> 10:42.580
And then at the bottom we're gonna say max iterations

10:43.770 --> 10:45.360
minus equals one.

10:45.360 --> 10:47.280
So that way it's gonna start at five

10:47.280 --> 10:50.910
and then go to 4, 3, 2, 1 and then stop.

10:50.910 --> 10:52.470
It's really important with agents

10:52.470 --> 10:54.390
that you don't just let it keep going forever

10:54.390 --> 10:55.950
because it might cost you a lot.

10:55.950 --> 10:58.050
That's the reason why we do that. Okay?

10:58.050 --> 11:02.010
And then we need to get the evaluation,

11:02.010 --> 11:04.110
the feedback back from evaluate

11:04.110 --> 11:06.060
because that's what it sends back here.

11:06.930 --> 11:09.990
And we just need to send in the task, the evaluator prompt

11:09.990 --> 11:11.640
and the response.

11:11.640 --> 11:13.560
So that's pretty straightforward.

11:13.560 --> 11:16.830
And then to terminate the condition, we're gonna say,

11:16.830 --> 11:21.690
let's say upper if, if we get pass back instead of fail,

11:21.690 --> 11:24.780
then we're just gonna return the response okay?

11:26.850 --> 11:30.210
So that means everything is gonna be done, right?

11:30.210 --> 11:32.820
So we're gonna get exit out of this function

11:32.820 --> 11:34.650
and that doesn't matter if,

11:34.650 --> 11:37.830
it's not gonna do all five iterations if we don't need to,

11:37.830 --> 11:42.003
but if we, presumably we didn't pass;

11:43.590 --> 11:47.370
then we need to join all the previous attempts together.

11:47.370 --> 11:50.370
So this is just iterating through the memory,

11:50.370 --> 11:53.100
getting the feedback, and just joining all that together.

11:53.100 --> 11:55.980
So you're gonna see that once we run it

11:55.980 --> 12:00.270
and we're gonna run the response again.

12:00.270 --> 12:03.660
So this is, we could do chat session management,

12:03.660 --> 12:05.460
we haven't done it in this way,

12:05.460 --> 12:06.820
but this is like an easy way to do it,

12:06.820 --> 12:10.020
where the loop itself understands the memory.

12:10.020 --> 12:11.730
The reason why you might wanna do that, by the way,

12:11.730 --> 12:13.980
rather than keeping the memory

12:13.980 --> 12:17.280
within the individual messages streams is

12:17.280 --> 12:20.370
that you can format it in whatever way you want,

12:20.370 --> 12:21.370
which is quite nice.

12:23.610 --> 12:28.170
Cool alright, so now let's try our loop workflow

12:28.170 --> 12:30.783
and hopefully if we've done all this correct,

12:31.620 --> 12:34.470
then it'll run.

12:34.470 --> 12:37.713
So we need our evaluator prompt, generator prompt.

12:43.800 --> 12:45.840
Okay, I think that's everything.

12:45.840 --> 12:47.290
Hopefully we did again there.

12:49.980 --> 12:51.873
All right. Okay, we did get an error.

12:53.160 --> 12:54.900
Could not pass the response.

12:54.900 --> 12:56.750
Did we get these the wrong way round?

13:01.290 --> 13:02.123
Yeah.

13:11.430 --> 13:14.193
Here we go. See if this works.

13:19.260 --> 13:21.090
Okay, here we go. It's working.

13:21.090 --> 13:25.080
So first we output the thoughts and then the response.

13:25.080 --> 13:29.460
And you can see it's too long. So we failed.

13:29.460 --> 13:31.830
The evaluation started, it said it failed.

13:31.830 --> 13:34.140
It is appropriate for a 5-year-old,

13:34.140 --> 13:36.390
but it's not 10 words are fewer

13:36.390 --> 13:38.040
and it does follow best practices.

13:38.040 --> 13:39.150
You can see it's going through,

13:39.150 --> 13:41.100
it's in a couple of loops now.

13:41.100 --> 13:44.070
Then look at this, it was much shorter on the second try,

13:44.070 --> 13:45.420
which is cool.

13:45.420 --> 13:47.400
But in this case, it's saying

13:47.400 --> 13:49.500
that it's not following best practices.

13:49.500 --> 13:52.200
Also it's saying there's still not 10 words or fewer.

13:52.200 --> 13:54.060
Yeah, this one's a bit too long.

13:54.060 --> 13:56.060
Let's see what we're at, where we're at.

13:56.970 --> 13:58.173
This is much shorter.

13:59.460 --> 14:04.320
Just taking in that information from the previous one.

14:04.320 --> 14:06.300
Here we go. We've got a pass yay.

14:06.300 --> 14:08.010
Luna, the unicorn soared,

14:08.010 --> 14:10.800
leaving sparkling rainbows in her wake.

14:10.800 --> 14:11.633
Perfect, right?

14:11.633 --> 14:12.900
It is age appropriate.

14:12.900 --> 14:14.310
It successfully captures magical

14:14.310 --> 14:17.730
and engaging image using vivid imagery.

14:17.730 --> 14:20.850
Yeah, it's doing well on all of our counts.

14:20.850 --> 14:23.160
This is a really good example where

14:23.160 --> 14:27.840
the evaluation criteria might not be obvious

14:27.840 --> 14:29.550
in the prompt itself.

14:29.550 --> 14:32.880
And in this case, for example, we didn't have in the prompt

14:32.880 --> 14:34.830
that it should be 10 words or fewer.

14:34.830 --> 14:38.780
But also generally LLMs, if we put, make this 10 words

14:38.780 --> 14:42.303
or fewer or make this a thousand word blog post,

14:43.170 --> 14:44.520
they tend to do a bad job of that.

14:44.520 --> 14:47.910
So something like this can really help you.

14:47.910 --> 14:50.970
It's actually much better at evaluating whether it

14:50.970 --> 14:52.620
did a good job the first time,

14:52.620 --> 14:55.980
then it would be at doing the job the second time.

14:55.980 --> 14:58.023
Cool, yeah, this is really helpful.

14:59.042 --> 15:00.090
Where you want to use this type of thing

15:00.090 --> 15:01.410
because it is slower, right?

15:01.410 --> 15:03.840
Like you have to wait for it to keep coming back and forth.

15:03.840 --> 15:05.340
And it's also much more expensive

15:05.340 --> 15:08.910
because now we've done, we're doing two calls per round

15:08.910 --> 15:11.460
and we're doing five rounds, there's 10 calls,

15:11.460 --> 15:13.050
whereas normally we just do one.

15:13.050 --> 15:15.750
The benefit is that we're trading off time

15:15.750 --> 15:17.490
and money for quality

15:17.490 --> 15:20.400
because we can put whatever criteria we want

15:20.400 --> 15:22.350
in this evaluation prompt,

15:22.350 --> 15:25.740
and it's gonna pass them eventually,

15:25.740 --> 15:28.620
so we can just throw money at the problem until it passes.

15:28.620 --> 15:30.120
There's lots of different ways to implement it,

15:30.120 --> 15:32.430
but it just depends on what makes sense

15:32.430 --> 15:33.543
for your application.