WEBVTT

00:00.090 --> 00:00.990
-: Hey, welcome back.

00:00.990 --> 00:03.270
And so basically in this video we're gonna have a really

00:03.270 --> 00:06.129
long Jupyter Notebook talking about how you can build

00:06.129 --> 00:09.120
customer valuation metrics for LMs.

00:09.120 --> 00:11.070
We're gonna explore a different couple of options

00:11.070 --> 00:12.570
that you've got and we'll look

00:12.570 --> 00:13.740
at some of the trade-offs as well.

00:13.740 --> 00:15.510
And there's a lot more text that you can read

00:15.510 --> 00:16.920
inside this Jupyter Notebook.

00:16.920 --> 00:18.780
So the main focus is gonna be just

00:18.780 --> 00:20.070
what are those different techniques

00:20.070 --> 00:21.523
and also maybe some of the trade-offs

00:21.523 --> 00:23.775
and showing you different implementations of those.

00:23.775 --> 00:27.240
So let's start by talking about what are the different ways

00:27.240 --> 00:30.060
that we can evaluate an LLM's response.

00:30.060 --> 00:32.130
The first one is called, we can call it something like

00:32.130 --> 00:33.582
programmatic rule-based evaluation.

00:33.582 --> 00:36.194
And basically what you're doing there is creating custom

00:36.194 --> 00:39.748
rules that determine whether or not that result is good.

00:39.748 --> 00:41.785
So if you're thinking about the context

00:41.785 --> 00:43.372
or a task of a blog post,

00:43.372 --> 00:46.413
that could be creating an e-bar metric like does it contain

00:46.413 --> 00:49.960
more than a hundred words or is it a certain length?

00:49.960 --> 00:52.860
And we can obviously create, some of the advantages for

00:52.860 --> 00:55.446
that is it is a quick feedback loop and it's cost effective.

00:55.446 --> 00:58.236
The disadvantages are that if the task is very nuanced,

00:58.236 --> 01:01.856
then you're maybe not able to actually evaluate the quality

01:01.856 --> 01:05.096
of the LLM output using such a rudimentary metric.

01:05.096 --> 01:08.009
But the second one that's quite interesting is

01:08.009 --> 01:09.996
LLM based evaluation.

01:09.996 --> 01:11.370
And so what you're really trying

01:11.370 --> 01:13.950
to do here is say "If this task is too nuanced

01:13.950 --> 01:16.290
to create a Python function that would evaluate,

01:16.290 --> 01:17.520
or a series of Python functions

01:17.520 --> 01:18.870
that would evaluate performance."

01:18.870 --> 01:20.572
And so we can use another LLM

01:20.572 --> 01:24.120
after we've produced a result to then evaluate whether

01:24.120 --> 01:25.857
that thing was a good result or not.

01:25.857 --> 01:27.930
And so this is really helpful when you are trying

01:27.930 --> 01:31.093
to do things such as evaluate helpfulness or readability.

01:31.093 --> 01:34.560
Those metrics are too vague and too nuanced on language.

01:34.560 --> 01:37.817
So we can almost ask an LLM to generate 10 examples

01:37.817 --> 01:41.970
or 10 outputs, and then we can ask another LLM on top of

01:41.970 --> 01:43.746
that to then rate each of those outputs.

01:43.746 --> 01:48.030
And the third way is basically doing human-based evaluation.

01:48.030 --> 01:50.490
Now, human-based evaluation is basically

01:50.490 --> 01:53.100
where you produce a load of outputs from your LLM

01:53.100 --> 01:55.214
and then you'll ask a human to rate those.

01:55.214 --> 01:57.529
And there's two different approaches to that.

01:57.529 --> 01:59.790
So you can either do like a simple approval

01:59.790 --> 02:01.320
and disapproval system,

02:01.320 --> 02:03.652
or you could say to the human, "Which of these three

02:03.652 --> 02:05.470
do you do you like the most?"

02:05.470 --> 02:07.795
And obviously with the human based evals

02:07.795 --> 02:09.690
you get the highest amount of accuracy.

02:09.690 --> 02:12.900
So it's really good for tasks that have a high customer risk

02:12.900 --> 02:14.310
and getting false positives

02:14.310 --> 02:16.828
or false negatives is gonna cause consumer damage.

02:16.828 --> 02:19.837
But the disadvantage is obviously there's a labeling step.

02:19.837 --> 02:21.096
So feedback is a bit slower

02:21.096 --> 02:23.580
and also it's time consuming to do that

02:23.580 --> 02:24.858
and there's a cost associated with that.

02:24.858 --> 02:27.883
And so it's really good for best to high stake tasks.

02:27.883 --> 02:31.380
And so that's like a brief overview into those three

02:31.380 --> 02:33.360
different types of methodologies that you might decide

02:33.360 --> 02:35.646
to use when you are creating these custom eval metrics

02:35.646 --> 02:37.800
for your LLMs.

02:37.800 --> 02:39.300
Now I wanna just go through

02:39.300 --> 02:42.240
and so we will start by, there's also a little bit

02:42.240 --> 02:44.280
of text here if you want to talk about the fact

02:44.280 --> 02:46.290
that you can start off with one eval strategy

02:46.290 --> 02:47.820
and then move on to another one.

02:47.820 --> 02:49.376
And you can also minimize human error by

02:49.376 --> 02:52.680
specifically on the last one when we're talking about doing

02:52.680 --> 02:55.260
a human labeling, maybe you get three evaluators

02:55.260 --> 02:57.750
to rate the examples rather than just

02:57.750 --> 02:59.400
relying on one labeling, a person.

02:59.400 --> 03:01.050
So there's a couple of different tricks

03:01.050 --> 03:02.413
and tips you can do there.

03:02.413 --> 03:05.222
So let's look at programmatic rule-based evaluation.

03:05.222 --> 03:08.159
So in this, what you do is we've got you setting up Pandas,

03:08.159 --> 03:11.805
we're using NumPy and this TQDM package.

03:11.805 --> 03:13.650
Now we've got some LLM results.

03:13.650 --> 03:15.390
I just thought I'd hard code some of these.

03:15.390 --> 03:17.010
And what we're trying to do here is produce

03:17.010 --> 03:18.810
a Twitter or an X post.

03:18.810 --> 03:20.340
And so what we wanna do is we wanna

03:20.340 --> 03:24.360
evaluate these results from the LLM and we wanna see whether

03:24.360 --> 03:26.193
or not these are useful or not.

03:26.193 --> 03:28.694
Now one thing that is obvious is,

03:28.694 --> 03:30.873
and maybe we wanna see whether the post has a hashtag

03:30.873 --> 03:33.539
and we also might want a suitable length.

03:33.539 --> 03:37.020
Like if the post is, the X post is too small,

03:37.020 --> 03:39.082
it's probably not a good use of,

03:39.082 --> 03:40.447
that's a good post.

03:40.447 --> 03:43.380
So we'll first just load those into a Panda data frame

03:43.380 --> 03:45.840
and then we've got the social post DF data frame.

03:45.840 --> 03:48.060
So you can have a look there and you've got the text in this

03:48.060 --> 03:50.340
generated social media post column.

03:50.340 --> 03:53.010
Now the first thing we're gonna do is we're gonna set up

03:53.010 --> 03:54.180
our two eval metrics.

03:54.180 --> 03:56.220
And just so I was saying earlier like let's say

03:56.220 --> 03:58.445
that we'll define a good social media post

03:58.445 --> 04:00.990
that they must contain at least one hashtag

04:00.990 --> 04:03.630
and they must be greater than or equal to 30 characters,

04:03.630 --> 04:05.370
but not more than 200 characters.

04:05.370 --> 04:07.436
So we can set up these two individual eval functions

04:07.436 --> 04:10.350
that takes in the LLM text that is produced.

04:10.350 --> 04:12.944
And then we're gonna run that through a series of steps.

04:12.944 --> 04:15.126
Now I've also got a nice helper function here

04:15.126 --> 04:17.670
that's just gonna take a Boolean result in.

04:17.670 --> 04:19.138
And if the Boolean result is true,

04:19.138 --> 04:22.050
then we're gonna return one, else we're gonna return zero

04:22.050 --> 04:23.856
and that'll be quite nice for doing some counting so

04:23.856 --> 04:26.880
that when we come to calculating the accuracy scores.

04:26.880 --> 04:28.200
But we'll run those functions

04:28.200 --> 04:31.020
and then we're gonna set up these eval functions inside

04:31.020 --> 04:32.430
of a Python dictionary.

04:32.430 --> 04:34.828
And then we're just gonna loop through all of the rows.

04:34.828 --> 04:37.020
And for each individual eval function,

04:37.020 --> 04:39.990
we're gonna figure out what eval function we need to call.

04:39.990 --> 04:41.740
And then we're gonna just pass in that text,

04:41.740 --> 04:43.161
get the eval result.

04:43.161 --> 04:46.620
So this is basically saying like for each individual row

04:46.620 --> 04:50.109
of the social media text, let's run a function on the text,

04:50.109 --> 04:52.554
get the result of that which will be a Boolean value,

04:52.554 --> 04:55.501
and then we're gonna pass that into our helper function,

04:55.501 --> 04:57.204
which is gonna take that Boolean result

04:57.204 --> 05:00.510
and basically update that specific column in

05:00.510 --> 05:03.060
that specific cell with either zero or one.

05:03.060 --> 05:04.260
So let's go and have a look at this.

05:04.260 --> 05:06.237
So you can now see we've ran that on

05:06.237 --> 05:09.030
and we've got the first five rows here

05:09.030 --> 05:10.530
using the dot head method.

05:10.530 --> 05:13.254
And you can see like we've got this eval has hashtag column,

05:13.254 --> 05:15.600
so you can see like the first result doesn't have a hashtag,

05:15.600 --> 05:18.480
but we know that these ones actually do have hashtags.

05:18.480 --> 05:20.333
And similarly we've also evaluated on the length

05:20.333 --> 05:21.874
of a social media post.

05:21.874 --> 05:25.005
And so all of them in the first five results

05:25.005 --> 05:26.580
are all an adequate length,

05:26.580 --> 05:28.440
but we've failed on that first one.

05:28.440 --> 05:31.470
So the great thing about these individual eval functions

05:31.470 --> 05:33.930
is we don't have a cost associated with these.

05:33.930 --> 05:35.760
They're very similar to unit tests in the sense

05:35.760 --> 05:37.375
of we can run them very quickly,

05:37.375 --> 05:38.814
we've got a fast feedback loop,

05:38.814 --> 05:40.320
which is really, really good.

05:40.320 --> 05:43.101
So let's move on and see how can you calculate the accuracy

05:43.101 --> 05:46.500
for each evaluation column within the Pandas data frame.

05:46.500 --> 05:49.350
So we're just gonna get, go do a list comprehension

05:49.350 --> 05:51.150
and just check for this eval,

05:51.150 --> 05:53.070
which will just give us all the eval columns

05:53.070 --> 05:54.210
and we can just have a look at that.

05:54.210 --> 05:56.130
So that's just getting all the eval columns.

05:56.130 --> 05:58.110
And then we're gonna set up a Python dictionary

05:58.110 --> 06:00.060
called eval accuracy results.

06:00.060 --> 06:01.740
We'll loop through every eval column

06:01.740 --> 06:02.940
and then we'll just get that.

06:02.940 --> 06:05.310
And then what we're gonna do is we're gonna sum up

06:05.310 --> 06:07.890
the number of ones in there against the length

06:07.890 --> 06:09.150
of that individual column.

06:09.150 --> 06:11.160
And then we're gonna just save the result of that

06:11.160 --> 06:13.860
to a Python dictionary and then we can print out that.

06:13.860 --> 06:16.653
So in our case, we've got around about 77%

06:16.653 --> 06:21.540
eval accuracy on the social media posts having a hashtag.

06:21.540 --> 06:24.209
And we've got around 44% of the eval length form,

06:24.209 --> 06:26.640
unlike the actual social posts itself.

06:26.640 --> 06:30.608
Different accuracy ratings based on different eval metrics.

06:30.608 --> 06:33.210
Now just before we move on to the other strategies,

06:33.210 --> 06:34.830
I think it's worthwhile talking about the fact

06:34.830 --> 06:36.584
that we might wanna take a blend

06:36.584 --> 06:38.850
of these two different types of eval metrics.

06:38.850 --> 06:40.290
So we might wanna look at the mean,

06:40.290 --> 06:41.563
or we might want to say,

06:41.563 --> 06:43.830
"Let's actually weight these differently.

06:43.830 --> 06:45.960
I care more about one metric than the other one."

06:45.960 --> 06:48.120
So let's just go and have a look at how we could do that.

06:48.120 --> 06:50.520
So in this scenario, we've just got our Python dictionary,

06:50.520 --> 06:52.620
we're just gonna take the NumPy dot mean

06:52.620 --> 06:55.560
of these two values and that gives us a mean accuracy of

06:55.560 --> 06:57.101
around 61% accuracy rate.

06:57.101 --> 07:00.570
But we could also actually decide,

07:00.570 --> 07:03.872
I really care more about my LLM outputs having a hashtag.

07:03.872 --> 07:06.037
And so what we could then do is we could say,

07:06.037 --> 07:07.800
"Well I'm gonna say that the weight of that one

07:07.800 --> 07:09.260
should be 70% and the weight

07:09.260 --> 07:11.280
of the other one should be 30%,"

07:11.280 --> 07:13.252
and then I can just get a weighted average accuracy.

07:13.252 --> 07:15.846
So my accuracy's gone up to around 68%

07:15.846 --> 07:18.930
and you could play around with these weights depending upon

07:18.930 --> 07:20.203
what's more important for you.

07:20.203 --> 07:22.380
So there's a couple of different approaches there

07:22.380 --> 07:24.895
and obviously you could have multiple eval metrics.

07:24.895 --> 07:28.230
So let's move on to the secondary strategy,

07:28.230 --> 07:30.690
which is using another LLM.

07:30.690 --> 07:33.240
After you've got some results from your LLM,

07:33.240 --> 07:35.893
you're gonna use another LLM to figure out whether

07:35.893 --> 07:37.805
or not the thing is valid

07:37.805 --> 07:40.440
and then you're gonna compare that against your true labels.

07:40.440 --> 07:42.230
The first thing that I just wanted to say is,

07:42.230 --> 07:43.410
in this example

07:43.410 --> 07:46.060
we're gonna be comparing against whether

07:46.060 --> 07:50.563
those previously social media posts are a topic of coding

07:50.563 --> 07:52.200
or software engineering.

07:52.200 --> 07:54.030
And so the first thing that we need to do is we need

07:54.030 --> 07:57.390
to add on the social media labels, like the true labels for,

07:57.390 --> 07:59.293
is that specific social media post

07:59.293 --> 08:02.040
actually talking about software engineering or coding?

08:02.040 --> 08:03.270
That specific topic.

08:03.270 --> 08:05.700
And you'll see, so what we do here is if I then print out

08:05.700 --> 08:06.964
this social post df,

08:06.964 --> 08:10.050
what you'll see is when I get scrolled back up,

08:10.050 --> 08:12.002
so like for example, when I say "I love my dogs,

08:12.002 --> 08:14.700
I love my dog," that's definitely not

08:14.700 --> 08:15.990
talking about software, right?

08:15.990 --> 08:18.451
But like all the other ones we've sort of said "Yes, give

08:18.451 --> 08:20.490
that a 1 because it's definitely

08:20.490 --> 08:21.660
talking about software," right?

08:21.660 --> 08:23.493
So we've got a couple of posts here

08:23.493 --> 08:25.648
that like really aren't talking about software

08:25.648 --> 08:28.230
and we've just added the true labels to those

08:28.230 --> 08:30.857
because we still need to reference whether

08:30.857 --> 08:32.887
or not the evaluation metric was right.

08:32.887 --> 08:35.475
And so we're still generating some true labels

08:35.475 --> 08:37.440
and at the moment we're doing that manually.

08:37.440 --> 08:39.930
And now that we've got the true labels,

08:39.930 --> 08:41.400
we can then ask ChatGPT.

08:41.400 --> 08:42.892
So we're gonna ask ChatGPT 4.

08:42.892 --> 08:44.910
So we set up a chat model,

08:44.910 --> 08:46.220
we'll set up the prompt template

08:46.220 --> 08:48.090
and notice we're passing in things

08:48.090 --> 08:50.697
like the format instructions and we're talking about

08:50.697 --> 08:53.280
what the generated social media post is gonna be.

08:53.280 --> 08:55.221
So that's the post to analyze.

08:55.221 --> 08:57.030
And we've also got like the topic

08:57.030 --> 08:58.246
that we want to classify against.

08:58.246 --> 09:01.000
And so after we've set up that prompt template,

09:01.000 --> 09:02.635
the next bit of the puzzle

09:02.635 --> 09:05.250
of making a social media post classifier.

09:05.250 --> 09:07.602
And so we've got, this "is" topic, which is an integer

09:07.602 --> 09:09.024
and it defaults to zero.

09:09.024 --> 09:11.410
And we've put this field is a classification result,

09:11.410 --> 09:14.075
whether result is identified against a known topic,

09:14.075 --> 09:16.650
one for yes and zero for no.

09:16.650 --> 09:18.390
So telling the LLM,

09:18.390 --> 09:21.810
give us a 1, if it is gonna be the coding topic

09:21.810 --> 09:24.210
and give us the 0, for it's not the coding topic.

09:24.210 --> 09:26.810
And then we then basically create that output parser

09:26.810 --> 09:30.030
and we create a LangChain LCL chain.

09:30.030 --> 09:32.470
And then we're gonna set up a results Python list

09:32.470 --> 09:34.110
and we're gonna say this is the topic, right?

09:34.110 --> 09:36.510
So coding, software or data science,

09:36.510 --> 09:38.580
we loop through each individual row

09:38.580 --> 09:41.190
and basically what we have is we're parsing the topic,

09:41.190 --> 09:43.860
the output passes format instructions

09:43.860 --> 09:45.870
and then also the social media posts.

09:45.870 --> 09:47.730
And we're just gonna wait for that to run.

09:47.730 --> 09:49.680
And obviously what you can see here is

09:49.680 --> 09:52.320
that when we're using LLM based evaluation,

09:52.320 --> 09:54.660
there's obviously a latency effect here

09:54.660 --> 09:57.441
where you can't get those quick evaluation metrics

09:57.441 --> 09:59.463
and also there's a cost as well.

09:59.463 --> 10:00.847
So we then added on that.

10:00.847 --> 10:03.998
So we can see, we call this eval is coding topic.

10:03.998 --> 10:07.148
And you can see here for example, that it's very accurate

10:07.148 --> 10:10.711
but it has actually made a mistake on this last one

10:10.711 --> 10:13.088
where we know that it's not a coding topic

10:13.088 --> 10:15.749
but that the GPT-4 has decided it is.

10:15.749 --> 10:19.560
So what we do is we get the true labels in one variable

10:19.560 --> 10:21.300
and the eval results in another

10:21.300 --> 10:23.924
and we just create a count variable starting at zero

10:23.924 --> 10:27.060
and then we loop through social post DF

10:27.060 --> 10:31.070
and we just check like this row does it compare to this?

10:31.070 --> 10:32.837
So we're just checking row by row

10:32.837 --> 10:35.340
do these things, are they the same?

10:35.340 --> 10:37.241
And if they are the same, then we increment the count

10:37.241 --> 10:39.390
and then what we do is we take the count

10:39.390 --> 10:41.670
and we divide that by the length of the true labels

10:41.670 --> 10:43.761
and then we can see, so GPT-4

10:43.761 --> 10:47.361
is getting a rough accuracy about 89%

10:47.361 --> 10:49.170
against the true labels.

10:49.170 --> 10:50.400
But the disadvantage

10:50.400 --> 10:52.350
of this strategy is obviously we still had

10:52.350 --> 10:54.000
to add the labels manually.

10:54.000 --> 10:55.770
Now that can be quite tedious.

10:55.770 --> 10:57.090
So we can take it a step further

10:57.090 --> 10:59.638
and we can say maybe let's use GPT-4 to

10:59.638 --> 11:02.459
synthetically generate the right answer.

11:02.459 --> 11:06.210
And we could go in and change those synthetic answers later

11:06.210 --> 11:08.797
and to catch various false positives or false negatives.

11:08.797 --> 11:11.770
But we could use GPT-4 to create synthetic data.

11:11.770 --> 11:14.790
And then after that we could then now use something like

11:14.790 --> 11:17.114
GPT 3.5 to evaluate against

11:17.114 --> 11:20.301
that grounded truth labeled data.

11:20.301 --> 11:22.410
So I've got some transactions here

11:22.410 --> 11:23.970
and we're just gonna load those in.

11:23.970 --> 11:26.460
And then what we're gonna do is look at these.

11:26.460 --> 11:29.010
So you've got like a cash deposit, a local branch,

11:29.010 --> 11:30.794
you've got like "withdrew money for rent payment"

11:30.794 --> 11:32.400
and we've got like a "withdrew cash"

11:32.400 --> 11:34.680
and we've got various different types of pieces of text

11:34.680 --> 11:36.753
that represent what happened at a bank transaction.

11:36.753 --> 11:39.120
And so what we now do is we've got pretty much

11:39.120 --> 11:40.470
the same kind of strategy.

11:40.470 --> 11:42.787
We've got ChatGPT-4 and JSON mode

11:42.787 --> 11:45.510
telling it that you can analyze bank transactions

11:45.510 --> 11:47.087
and here's the transaction to analyze.

11:47.087 --> 11:50.520
We set up our prompt template and also as well as

11:50.520 --> 11:52.290
that then we've got a different pedantic model,

11:52.290 --> 11:54.150
which is getting the transaction type

11:54.150 --> 11:56.310
and also getting the transaction category.

11:56.310 --> 11:58.410
Now we pass that to the output parser

11:58.410 --> 11:59.883
and we set up our LCL chain

11:59.883 --> 12:02.250
and then again, we're creating the results

12:02.250 --> 12:03.660
and then looping through the data frame,

12:03.660 --> 12:04.740
we're then appending the results.

12:04.740 --> 12:06.420
So I'm just gonna leave this to run now

12:06.420 --> 12:08.916
and that's gonna take a bit longer 'cause it's using GPT-4.

12:08.916 --> 12:10.980
But the important point here is that you want

12:10.980 --> 12:12.504
to use a very powerful model

12:12.504 --> 12:14.520
to create these ground truth labels

12:14.520 --> 12:15.775
'cause you don't want to create labels

12:15.775 --> 12:17.238
that aren't actually correct.

12:17.238 --> 12:20.242
And the other trick here is that when we come to evaluating

12:20.242 --> 12:23.810
against these ground truth labels, we're going

12:23.810 --> 12:26.882
to use a a cheaper model, a less performant model

12:26.882 --> 12:29.526
because that's not gonna get 100% of it right.

12:29.526 --> 12:32.610
And that allows us to then work on optimizing the prompt

12:32.610 --> 12:35.430
and sort of making sure that we can really get,

12:35.430 --> 12:37.436
if we can get a smaller model working well

12:37.436 --> 12:39.210
through optimizing the prompt

12:39.210 --> 12:41.160
or optimizing the retrieval, then

12:41.160 --> 12:43.110
that means we can take all those learnings

12:43.110 --> 12:45.120
and put it back into a more powerful model.

12:45.120 --> 12:47.400
So in this scenario we've got two things.

12:47.400 --> 12:49.020
We wanna store the transaction type

12:49.020 --> 12:50.303
and the transaction categories.

12:50.303 --> 12:52.963
So we're gonna just loop through the results and store that.

12:52.963 --> 12:55.181
And so now we've got our grounded truth labels

12:55.181 --> 12:56.390
on our transactions.

12:56.390 --> 12:59.300
Now what we now need to do is we're gonna go

12:59.300 --> 13:03.870
and test the accuracy of GPT 3.5 turbo against that.

13:03.870 --> 13:05.700
So it's pretty much exactly the same

13:05.700 --> 13:07.560
but we're just switching out the model here.

13:07.560 --> 13:10.397
And again, we're using this GPT 3.5 turbo model

13:10.397 --> 13:11.687
and creating our chain.

13:11.687 --> 13:13.335
And then we're just basically, yeah,

13:13.335 --> 13:15.247
just getting the transaction subscription

13:15.247 --> 13:17.733
and then if there is an error from the parsing,

13:17.733 --> 13:19.410
where we're just basically saying

13:19.410 --> 13:20.940
the transaction type should be null

13:20.940 --> 13:21.773
so we didn't get that one.

13:21.773 --> 13:23.370
And then we do the exact same thing here.

13:23.370 --> 13:24.968
So getting those results

13:24.968 --> 13:27.360
and then basically printing those out.

13:27.360 --> 13:28.860
So let's leave that to run.

13:28.860 --> 13:30.748
And again, you can see there's a slight latency

13:30.748 --> 13:31.666
involved in this,

13:31.666 --> 13:35.040
so that will take about maybe 10, 15 seconds.

13:35.040 --> 13:36.075
So we'll just wait for that.

13:36.075 --> 13:39.210
So whilst that's running, we're saving the data

13:39.210 --> 13:42.125
to this GPT 3.5 transaction type category

13:42.125 --> 13:45.406
and also the the GPT 3.5 transaction category.

13:45.406 --> 13:48.930
And we are gonna compare those two columns against the

13:48.930 --> 13:51.270
grounded columns, the grounded truth labeled columns

13:51.270 --> 13:52.200
that GPT-4 created.

13:52.200 --> 13:53.790
Now let's go

13:53.790 --> 13:55.770
and have a look at the code whilst this is finishing up.

13:55.770 --> 13:58.561
So we've got a transaction type counter

13:58.561 --> 14:00.719
and a transaction category counter.

14:00.719 --> 14:03.030
And so what we're gonna loop through each row

14:03.030 --> 14:05.310
of the data frame and we're gonna see, okay,

14:05.310 --> 14:07.950
does GP 3.5 transaction type,

14:07.950 --> 14:09.660
you pulled the transaction type row

14:09.660 --> 14:11.730
and the same for the transaction category,

14:11.730 --> 14:13.002
and we're gonna increment those counters

14:13.002 --> 14:15.570
and then we can work out what the accuracy is.

14:15.570 --> 14:19.172
So GPT 3.5 is giving us a transaction type accuracy

14:19.172 --> 14:24.172
of about 0.9% against synthetically created data from GPT-4.

14:24.417 --> 14:27.850
And we're getting around about 75% of accuracy against

14:27.850 --> 14:29.924
the category accuracy.

14:29.924 --> 14:32.730
Now let's go and have a look at the third type,

14:32.730 --> 14:34.438
which is human-based evaluation.

14:34.438 --> 14:36.420
So obviously that's where you would get

14:36.420 --> 14:38.510
some immediate feedback from your team,

14:38.510 --> 14:40.590
and they would do the labeling manually

14:40.590 --> 14:42.518
or you would use a cloud vendor solution

14:42.518 --> 14:45.900
like Amazon have got a Amazon Mechanical Turk

14:45.900 --> 14:48.250
and they've got a data labeling service and GPT

14:48.250 --> 14:51.090
and Azure have also got data labeling services.

14:51.090 --> 14:53.370
So if you're not

14:53.370 --> 14:55.881
specifically interested in waiting a long time,

14:55.881 --> 14:58.380
then you could pave RN API to get a lot

14:58.380 --> 14:59.940
of the data labeling done for you.

14:59.940 --> 15:01.890
And that would help speed up the feedback loop.

15:01.890 --> 15:03.403
But let's dive in and have a look.

15:03.403 --> 15:06.047
A good example of maybe we're generating some images

15:06.047 --> 15:07.502
and it's for a mortgage offer

15:07.502 --> 15:09.266
and we want to personalize this

15:09.266 --> 15:11.909
and we wanna make sure that each of the images is good.

15:11.909 --> 15:15.330
So what we're gonna do is just use the open AI package

15:15.330 --> 15:18.210
and we're gonna generate images using Dolly 3

15:18.210 --> 15:20.460
and then we're gonna set up our standard chat model

15:20.460 --> 15:22.369
and then we're gonna tell it, "I want you to generate

15:22.369 --> 15:26.298
a really good image prompt using some customer information

15:26.298 --> 15:27.930
and customer profile."

15:27.930 --> 15:29.240
And we've got a couple of different things that we want

15:29.240 --> 15:31.335
to make sure that the visual prompt is concise,

15:31.335 --> 15:33.630
avoid including any sensitive information

15:33.630 --> 15:34.890
in the prompt, et cetera.

15:34.890 --> 15:37.650
So we're really trying to get some good housing images

15:37.650 --> 15:39.390
for our mortgage deal for the bank.

15:39.390 --> 15:40.680
So what we now have

15:40.680 --> 15:43.050
as well is we've got a Python function here

15:43.050 --> 15:45.210
that's gonna take in that visual prompt

15:45.210 --> 15:48.390
that it generates and it's gonna generate an image

15:48.390 --> 15:49.410
using Dolly.

15:49.410 --> 15:52.440
And it's also gonna save it to this out directory as well.

15:52.440 --> 15:55.018
So that's gonna generate the image using Dolly 3.

15:55.018 --> 15:57.690
And then we're gonna get the JSON version of that,

15:57.690 --> 15:58.977
the basic 64 JSON

15:58.977 --> 16:00.549
and we're gonna download that image

16:00.549 --> 16:04.230
and save that and decode the basic 64 image

16:04.230 --> 16:05.445
and return the image path.

16:05.445 --> 16:07.749
And so we set up our image chain.

16:07.749 --> 16:10.440
This one is taking a customer information

16:10.440 --> 16:11.580
and a customer profile.

16:11.580 --> 16:12.960
So we need to parse both of those

16:12.960 --> 16:16.140
when we invoke this LangChain LCL chain.

16:16.140 --> 16:17.398
And that gets past the image prompt,

16:17.398 --> 16:19.310
which then creates the image,

16:19.310 --> 16:22.550
which then we have a prompt which gets piped into the string

16:22.550 --> 16:25.380
and then that string, so that visual prompt

16:25.380 --> 16:27.643
then gets injected into that Python function.

16:27.643 --> 16:29.070
And then from there then

16:29.070 --> 16:31.020
what you can do is generate an image.

16:31.020 --> 16:31.853
Cool.

16:31.853 --> 16:32.730
So we've got that set up

16:32.730 --> 16:36.139
and just to save some time, I've created some good profiles.

16:36.139 --> 16:37.680
So these are some examples

16:37.680 --> 16:39.960
of different things that we'd expect to see

16:39.960 --> 16:40.982
and the outputted images.

16:40.982 --> 16:42.660
And you've got like the profile

16:42.660 --> 16:43.839
and the information about that.

16:43.839 --> 16:45.930
And so we've got a Python list of some customers

16:45.930 --> 16:47.779
and various things that we want to see.

16:47.779 --> 16:50.635
So we're just gonna loop through and generate some of these

16:50.635 --> 16:52.200
and then we're gonna store those.

16:52.200 --> 16:54.126
So you'll see a couple of different things going on here

16:54.126 --> 16:56.624
where we've got that image chain that we're invoking

16:56.624 --> 16:59.490
and we're parsing in the customer information.

16:59.490 --> 17:02.160
So that's that profile information and the profile.

17:02.160 --> 17:03.360
So that's just coming from each

17:03.360 --> 17:04.770
individual Python dictionary.

17:04.770 --> 17:07.512
You've got the profile key here, the information, notice

17:07.512 --> 17:09.820
how we're lining it up with what the

17:09.820 --> 17:12.960
LangChain prompt template expects, right?

17:12.960 --> 17:14.667
So if I scroll back up, whilst that's running, you'll see

17:14.667 --> 17:17.312
that it expects a customer information

17:17.312 --> 17:20.880
and a customer profile input into that prompt.

17:20.880 --> 17:21.713
Cool.

17:21.713 --> 17:23.670
So that's now running, it's gonna take about a minute

17:23.670 --> 17:26.273
to run 'cause we're doing just a couple of customers.

17:26.273 --> 17:28.680
We're also saving the image path,

17:28.680 --> 17:31.530
which is the output from this final chain.

17:31.530 --> 17:33.890
We can see that that's the output by, if I scroll up here,

17:33.890 --> 17:35.363
this generate image,

17:35.363 --> 17:37.935
the final thing it returns is an image path

17:37.935 --> 17:41.070
and that'll be inside the out folder, right?

17:41.070 --> 17:43.260
So then what I'm gonna do is I'm gonna scroll back down

17:43.260 --> 17:44.449
and this will finish up in a second,

17:44.449 --> 17:46.606
but now that we've got those images,

17:46.606 --> 17:49.435
what we wanna do is we want to manually evaluate that

17:49.435 --> 17:51.976
and that's where the human will start coming into the loop.

17:51.976 --> 17:54.300
So they'll actually generate a load of results

17:54.300 --> 17:55.830
and it might not necessarily be images,

17:55.830 --> 17:57.754
it could be text, it could be video content.

17:57.754 --> 18:00.720
Your evaluation is not necessarily tied

18:00.720 --> 18:02.400
to a specific content format,

18:02.400 --> 18:04.170
but in this example we've done images.

18:04.170 --> 18:05.534
But this same process would work

18:05.534 --> 18:09.000
the exact same way with text you generate,

18:09.000 --> 18:13.126
your LLM results or your AI results using generative AI.

18:13.126 --> 18:15.930
And then what we do is we've got like a nice interactive

18:15.930 --> 18:18.390
approval system, a couple of different functions here,

18:18.390 --> 18:21.480
like on the button clicked, it's gonna run some code,

18:21.480 --> 18:23.512
and it's also gonna update the response as well.

18:23.512 --> 18:26.070
And that will be updating the data frame.

18:26.070 --> 18:27.750
Feel free to take a look at that code if you want,

18:27.750 --> 18:29.880
but you don't really have to fully understand it.

18:29.880 --> 18:32.265
But just know that it's gonna generate some widgets

18:32.265 --> 18:33.827
inside of a Jupyter Notebook.

18:33.827 --> 18:35.550
But let's actually run it and see.

18:35.550 --> 18:38.010
So you can see the image each time,

18:38.010 --> 18:40.003
and I like that image, so I'm gonna give that a thumbs up.

18:40.003 --> 18:42.420
I think this one's may be a bit too corporate

18:42.420 --> 18:44.610
and I think the text here is hallucinating.

18:44.610 --> 18:47.370
So I'm gonna give that a thumbs down, what it could be,

18:47.370 --> 18:49.380
what it could, new mortgage deal.

18:49.380 --> 18:50.940
And it's spelt mortgage wrong here,

18:50.940 --> 18:52.088
so I'm gonna give it a thumbs down.

18:52.088 --> 18:53.989
And I also don't like this one.

18:53.989 --> 18:55.980
And then I quite like this one.

18:55.980 --> 18:57.463
I think it's good, it's very playful

18:57.463 --> 18:58.980
and there's no text in this.

18:58.980 --> 19:00.230
So I'm gonna give this a thumbs up.

19:00.230 --> 19:02.100
And now you can see here we've got

19:02.100 --> 19:03.660
the number of approved images.

19:03.660 --> 19:06.450
So we've got about two divided by five accuracy.

19:06.450 --> 19:08.280
So you've got around, let's have a look at that.

19:08.280 --> 19:11.454
So it'll be around 40% accuracy out of our images

19:11.454 --> 19:13.126
and we could keep changing the prompts

19:13.126 --> 19:16.187
and changing the different image models to try

19:16.187 --> 19:17.739
and improve the accuracy.

19:17.739 --> 19:21.540
Now I just want to talk as well about the different ways

19:21.540 --> 19:23.190
that you want to think about,

19:23.190 --> 19:24.929
now that you've got these evaluation metrics,

19:24.929 --> 19:28.410
how you could specifically optimize using any

19:28.410 --> 19:29.445
of those three strategies.

19:29.445 --> 19:32.580
So the first one is obviously optimizing the prompts, right?

19:32.580 --> 19:34.440
Can you add extra information

19:34.440 --> 19:36.145
or using the prompt engineering principles

19:36.145 --> 19:37.587
or adding extra context

19:37.587 --> 19:40.020
to the prompt using things like retrieval from vector

19:40.020 --> 19:42.132
databases or Postgres or MongoDB.

19:42.132 --> 19:45.326
So thinking about what can you include into the prompt

19:45.326 --> 19:47.273
to make the prompts more effective.

19:47.273 --> 19:49.440
Now the other thing you could play around with

19:49.440 --> 19:51.270
is the second thing, which is model selection.

19:51.270 --> 19:54.040
So you could use a more powerful model if you're not getting

19:54.040 --> 19:56.871
the level of accuracy or the high level of accuracy

19:56.871 --> 19:57.973
that you're looking for.

19:57.973 --> 20:00.090
The third thing we've already covered are using more

20:00.090 --> 20:01.983
advanced or enhanced retrieval techniques,

20:01.983 --> 20:03.630
which will then feed into the prompt.

20:03.630 --> 20:05.460
So that one's kind of related to number two.

20:05.460 --> 20:08.370
And then number four is using task decomposition.

20:08.370 --> 20:10.428
So rather than actually going

20:10.428 --> 20:14.550
and deciding to put everything inside of one prompt,

20:14.550 --> 20:16.405
maybe a better way is to break down

20:16.405 --> 20:19.735
a very large problem into a series of sub-problems so

20:19.735 --> 20:22.200
that you can then solve that more elegantly

20:22.200 --> 20:24.150
with very fine-tuned prompts.

20:24.150 --> 20:26.516
A very optimized prompts for that specific sub-task.

20:26.516 --> 20:28.620
And the final one is using fine tuning.

20:28.620 --> 20:30.644
So all of these different ways,

20:30.644 --> 20:33.750
these three strategies will provide you with the ability

20:33.750 --> 20:34.902
to run fine tuning.

20:34.902 --> 20:37.590
And you could then use an open AI fine tuning job,

20:37.590 --> 20:40.950
or you could tune, fine tune an open source model

20:40.950 --> 20:42.567
by running that inside of AWS

20:42.567 --> 20:44.340
or Azure or Google cloud platform.

20:44.340 --> 20:45.570
So there's various different ways

20:45.570 --> 20:46.980
that you can apply fine tuning,

20:46.980 --> 20:49.020
but you could then use your own data

20:49.020 --> 20:51.390
and your own labeling process or your own evaluation metrics

20:51.390 --> 20:54.540
to then optimize and fine tune an LLM.

20:54.540 --> 20:55.805
So hopefully this helps, we tried

20:55.805 --> 20:58.710
and make a more of a comprehensive guide this time,

20:58.710 --> 21:00.263
I'll see you in the next lesson.
