WEBVTT

00:00.840 --> 00:02.820
-: Let's walk you through LangWatch.

00:02.820 --> 00:07.170
So LangWatch is a observability tool for LLMs.

00:07.170 --> 00:11.550
What that means is, you can log your API calls to LangWatch

00:11.550 --> 00:14.100
and then you can see some statistics,

00:14.100 --> 00:16.200
and you can do evaluations in here,

00:16.200 --> 00:18.090
annotation, stuff like that.

00:18.090 --> 00:20.400
The thing that I've been finding really useful

00:20.400 --> 00:23.460
with it recently is the workflow stuff.

00:23.460 --> 00:26.340
And this is brand new, so I think it's still in beta

00:26.340 --> 00:30.060
where if you sign up on their website,

00:30.060 --> 00:32.550
then I think you just need to ask them for access

00:32.550 --> 00:34.710
and it should hopefully oblige.

00:34.710 --> 00:37.170
Tell them I sent you, but I have no affiliation.

00:37.170 --> 00:39.360
I just think that the tool is really cool

00:39.360 --> 00:42.090
and it matches the way that I tend

00:42.090 --> 00:43.860
to do this in Jupyter Notebooks,

00:43.860 --> 00:47.340
except you don't need to use actual code, right?

00:47.340 --> 00:49.650
Like, you can do it in this no-code environment

00:49.650 --> 00:50.677
and it's gonna be a lot quicker.

00:50.677 --> 00:52.500
So, I'm just gonna show you how that works.

00:52.500 --> 00:55.068
And so, here we're gonna do something interesting,

00:55.068 --> 00:57.985
we're rating jokes as funny or not.

00:58.920 --> 01:01.980
This is a common one that I use,

01:01.980 --> 01:04.170
and it's creating the workflow here.

01:04.170 --> 01:06.540
There's a few different components you have to think about.

01:06.540 --> 01:09.180
One is the dataset,

01:09.180 --> 01:12.180
the other is an LLM call or LLM signature.

01:12.180 --> 01:14.760
And then there's also retrievers,

01:14.760 --> 01:17.160
like if you're doing rapid (indistinct) and evaluators,

01:17.160 --> 01:19.590
so drag the evaluator in here,

01:19.590 --> 01:22.164
and we're gonna start getting this set up.

01:22.164 --> 01:25.110
It is just basically like a spreadsheet.

01:25.110 --> 01:26.850
And what we'll do is I'm gonna click in

01:26.850 --> 01:30.003
and edit the dataset and delete that column.

01:30.960 --> 01:32.190
I'm gonna set this up.

01:32.190 --> 01:33.390
The column headers that I have,

01:33.390 --> 01:36.510
so I have a joke, the topic,

01:36.510 --> 01:38.943
and then I have whether it's a funny,

01:39.810 --> 01:42.180
and that's a 1 or a 0.

01:42.180 --> 01:44.460
And then I'm just gonna hit Save.

01:44.460 --> 01:47.403
Actually, we just need to give it a name, Funny jokes.

01:48.690 --> 01:50.610
And then you can just add these in manually

01:50.610 --> 01:52.620
or you could upload from CSV.

01:52.620 --> 01:57.557
So, just gonna go grab this data set from DSPyUI,

01:59.050 --> 02:00.450
It's the thing I made.

02:00.450 --> 02:02.070
Let's see here.

02:02.070 --> 02:02.970
Checked.

02:02.970 --> 02:04.200
And by the way, I think this is like

02:04.200 --> 02:07.410
a drop in replacement for using DSPyUI.

02:07.410 --> 02:09.840
I'm not sure I gonna use the tool I built now.

02:09.840 --> 02:13.063
It's the way I expected it to.

02:13.063 --> 02:15.510
I can see here I have a bunch of jokes, the topics,

02:15.510 --> 02:16.800
and then whether they're funny or not.

02:16.800 --> 02:20.190
I actually have an extra row here,

02:20.190 --> 02:22.590
so I'm just gonna delete that row.

02:22.590 --> 02:23.820
Okay, select dataset.

02:23.820 --> 02:27.570
So now we have the data, we have all the inputs and stuff,

02:27.570 --> 02:30.150
and then we can just drag them across to the LLM call.

02:30.150 --> 02:33.630
So, in this case we're doing gpt-4o-mini.

02:33.630 --> 02:35.880
The prompt is, you know,

02:35.880 --> 02:40.880
tell whether a joke is funny or not.

02:41.550 --> 02:44.370
And then we're gonna input the joke.

02:44.370 --> 02:47.510
We're also gonna input the...

02:51.990 --> 02:54.273
And then we're gonna output the funny one.

02:55.950 --> 02:59.670
And actually, we can test this.

02:59.670 --> 03:02.250
You say, knock.

03:02.250 --> 03:03.083
Who's there?

03:07.080 --> 03:08.640
Terrible joke, great.

03:08.640 --> 03:09.717
Okay, thinking.

03:12.830 --> 03:14.478
It actually thinks it's funny.

03:14.478 --> 03:16.320
(laughs) It's not a great prompt, right?

03:16.320 --> 03:18.960
But you can see the execution cost,

03:18.960 --> 03:22.230
and so you can see here some of the observability features

03:22.230 --> 03:24.120
like what the temperature was

03:24.120 --> 03:27.570
and then there's also, like, annotation.

03:27.570 --> 03:30.313
So you could annotate this and say it was good

03:30.313 --> 03:32.685
or it was not good, et cetera.

03:32.685 --> 03:33.753
We'll dig into that.

03:34.650 --> 03:39.090
That's the general set of how you run an LLM.

03:39.090 --> 03:41.700
But where the magic happens is you can connect your dataset.

03:41.700 --> 03:43.590
What we're gonna do is gonna put the joke

03:43.590 --> 03:46.140
and input to the topic.

03:46.140 --> 03:50.580
And then what we'll do is we're gonna pull the funny column

03:50.580 --> 03:52.140
across to expected output,

03:52.140 --> 03:54.780
'cause you know that's been human-rated,

03:54.780 --> 03:57.300
so we think we know that, like,

03:57.300 --> 03:59.490
whether we think that joke is funny or not.

03:59.490 --> 04:01.350
And then this is the prediction, right?

04:01.350 --> 04:06.350
Gonna check whether the output matches the expected output.

04:06.930 --> 04:09.471
And that's gonna give us a pass score.

04:09.471 --> 04:14.471
We could also pass this on to the end, is how we said.

04:14.520 --> 04:16.950
Now, what we can do now that we've done this

04:16.950 --> 04:21.597
is we can click Evaluate and then just say initial prompt,

04:21.597 --> 04:23.910
and it creates a new version for us,

04:23.910 --> 04:26.820
and we can test it on just the 20% of the test dataset

04:26.820 --> 04:29.310
or the full dataset if we want.

04:29.310 --> 04:31.246
So, gonna save and run,

04:31.246 --> 04:33.963
and it's now gonna go and run this with an LLM,

04:34.950 --> 04:37.410
and you can see that we're failing on everything, right?

04:37.410 --> 04:39.060
0% pass rate.

04:39.060 --> 04:41.790
But this makes it really easy for it to see what went wrong.

04:41.790 --> 04:44.400
You can see just like the one we said before,

04:44.400 --> 04:46.560
it's saying that's a classic pun and it's funny,

04:46.560 --> 04:47.393
blah, blah, blah.

04:47.393 --> 04:48.907
It's not giving us the 1s or the 0s.

04:48.907 --> 04:52.200
So, pretty obvious that when you tell it,

04:52.200 --> 04:57.200
when you say 0, sorry, 1 if funny, 0 not.

05:00.420 --> 05:05.420
And now, if we go up to evaluate,

05:09.060 --> 05:11.190
I'm just gonna call that 1 or 0,

05:11.190 --> 05:14.070
and we should hopefully get some passes now.

05:14.070 --> 05:14.903
Here we go.

05:15.840 --> 05:18.360
So, you can see that it's outputting the right amount

05:18.360 --> 05:20.220
pretty much all the time.

05:20.220 --> 05:22.380
About 87% pass rate.

05:22.380 --> 05:24.840
And we can look into the individual ones that didn't pass

05:24.840 --> 05:27.060
and we can make some value judgements here,

05:27.060 --> 05:28.860
which is really helpful.

05:28.860 --> 05:31.770
Cool, so, that's the evaluation.

05:31.770 --> 05:33.330
I think that's quite helpful.

05:33.330 --> 05:35.160
And then you could, in here, you could say

05:35.160 --> 05:37.383
it's really important here.

05:40.110 --> 05:41.790
All right, could do some prompt engineering.

05:41.790 --> 05:43.860
Great, this is emotion prompting.

05:43.860 --> 05:46.563
And then you could go back and evaluate.

05:47.970 --> 05:52.113
Call it, like, emotion prompting, and save and run.

05:53.250 --> 05:55.720
And now, we're gonna get a different output

05:56.850 --> 05:58.410
and see if that actually makes a difference.

05:58.410 --> 06:01.660
Not exactly the same, 85.7%.

06:02.930 --> 06:05.640
So, you know, that didn't make a difference.

06:05.640 --> 06:07.620
Cool. And now we can do prompt engineering.

06:07.620 --> 06:10.170
We can understand what makes a big difference

06:10.170 --> 06:11.003
to the performance.

06:11.003 --> 06:13.050
Okay, so, I think that's pretty cool in itself.

06:13.050 --> 06:14.550
But then the other thing they just added

06:14.550 --> 06:16.620
was that they connected it with DSPy,

06:16.620 --> 06:18.450
which is why I wanted to check it out,

06:18.450 --> 06:21.720
because I made, like, a no-code DSPy tool.

06:21.720 --> 06:24.000
And DSPy lets you do optimization.

06:24.000 --> 06:27.213
So, for example, BootstrapFewShotWithRandomSearch,

06:28.296 --> 06:32.430
and that does is it adds in a bunch of demonstrations

06:32.430 --> 06:35.730
or a few shot examples to the prompt.

06:35.730 --> 06:38.283
So, I'm just gonna hit Run Optimization here,

06:39.420 --> 06:43.140
and it's gonna take a little bit of time to get,

06:43.140 --> 06:46.743
but it's gonna run DSPy for us in the background.

06:48.330 --> 06:50.850
And then once it the optimization is done,

06:50.850 --> 06:53.730
then we're gonna be able to actually go through it

06:53.730 --> 06:56.763
and choose that prompt at the end at the best version.

07:00.420 --> 07:02.340
Okay, so it's running now.

07:02.340 --> 07:05.940
We can see this is a, you know, version that we have

07:05.940 --> 07:08.010
actually increased the score already,

07:08.010 --> 07:11.250
a 93% of this one, which is pretty cool.

07:11.250 --> 07:13.680
You can see which demonstration's included.

07:13.680 --> 07:15.990
So, this is taken from the dataset

07:15.990 --> 07:18.690
and then it tests whether that works or not.

07:18.690 --> 07:19.860
And then it gives you the score.

07:19.860 --> 07:22.440
So you can see it's trying different candidates

07:22.440 --> 07:24.930
and you can still optimize the few shot examples,

07:24.930 --> 07:27.690
but also the prompter itself.

07:27.690 --> 07:29.823
So, I'll just leave that running for.

07:31.747 --> 07:34.290
And we'll see how well that works.

07:34.290 --> 07:35.730
The other thing is you can actually dig

07:35.730 --> 07:38.250
into LLM calls being made.

07:38.250 --> 07:41.130
So it's making 82 LLM calls each time,

07:41.130 --> 07:44.100
and we can see how much the cost of the step is,

07:44.100 --> 07:46.980
less than 1 cent in order to optimize this prompt.

07:46.980 --> 07:48.573
Here we go, a new best score.

07:50.730 --> 07:53.970
This is pretty good, 100%.

07:53.970 --> 07:54.803
Pretty happy with that.

07:54.803 --> 07:56.340
So that's how that works.

07:56.340 --> 07:59.940
And then you can take this, like,

07:59.940 --> 08:02.790
once it's finished running, it takes a little bit of time,

08:02.790 --> 08:05.460
and you can take this, and then you can apply it.

08:05.460 --> 08:06.877
So you can see here, it says

08:06.877 --> 08:07.830
"Please wait."

08:07.830 --> 08:09.930
But once that's finished, then you can click apply

08:09.930 --> 08:13.710
and then it will swap out the prompt in here,

08:13.710 --> 08:18.690
and it will add the demonstrations into this section here.

08:18.690 --> 08:20.850
And then you can do some prompt engineering

08:20.850 --> 08:22.108
and stuff from that.

08:22.108 --> 08:24.470
So, I thought this was a really cool tool.

08:24.470 --> 08:27.330
It actually really matches heavily the type of things

08:27.330 --> 08:30.300
that I'm doing day-to-day basis when I'm prompt engineering.

08:30.300 --> 08:33.390
So, yeah, I wanted to try it out.

08:33.390 --> 08:38.390
You could also have other LLM as a judge to evaluate this,

08:38.490 --> 08:39.450
which is quite helpful.

08:39.450 --> 08:41.070
I just use the exact match evaluator,

08:41.070 --> 08:44.370
but you could also drag in LLM as a judge.

08:44.370 --> 08:48.330
You could say, we need to make sure that the prompt,

08:48.330 --> 08:50.493
the joke, is this joke funny?

08:53.970 --> 08:56.910
And then it could estimate whether the joke is funny or not.

08:56.910 --> 08:59.880
So, anyway, it's cool.

08:59.880 --> 09:02.040
Hopefully you guys find that useful.

09:02.040 --> 09:04.080
I've seen a lot of success from this

09:04.080 --> 09:07.083
and yeah, hopefully you guys find that useful.