WEBVTT

00:01.260 --> 00:02.850
-: Hi, my name is Jared Zoneraich,

00:02.850 --> 00:04.740
I am one of the founders of PromptLayer,

00:04.740 --> 00:07.800
and we are a platform for prompt engineering.

00:07.800 --> 00:09.630
This is gonna be a pretty quick tutorial

00:09.630 --> 00:10.980
on a very important subject,

00:10.980 --> 00:14.040
which is, how do you evaluate your RAG system?

00:14.040 --> 00:16.350
So I've already built a RAG system,

00:16.350 --> 00:17.580
how do I know if it's good?

00:17.580 --> 00:19.680
How do I make sure it stays good

00:19.680 --> 00:21.720
and how do I get my prompts better?

00:21.720 --> 00:24.210
I've already built a RAG system for this tutorial

00:24.210 --> 00:26.880
that is trained on New York Stock Exchange data

00:26.880 --> 00:29.490
and lets the user ask questions about the data.

00:29.490 --> 00:31.110
So let's get started,

00:31.110 --> 00:31.943
to keep this short,

00:31.943 --> 00:35.310
I've done a little bit of the prompt writing already.

00:35.310 --> 00:38.430
I'll show you, I have something called stock-buddy here,

00:38.430 --> 00:41.100
so this is a prompt in PromptLayer.

00:41.100 --> 00:43.290
It says, "You are an AI stock trader,

00:43.290 --> 00:45.210
please answer the user's question."

00:45.210 --> 00:46.867
User gives a question and then it says,

00:46.867 --> 00:48.270
"I've received some data,"

00:48.270 --> 00:50.760
here's the data, "I will now answer your question."

00:50.760 --> 00:53.310
And we'll get the AI to answer the question that way.

00:53.310 --> 00:57.090
I have also built an AI evaluation system

00:57.090 --> 00:59.880
meant to evaluate how good my answer is.

00:59.880 --> 01:01.590
So basically says, "You're an AI system

01:01.590 --> 01:05.190
evaluating a new data scientist that's hired."

01:05.190 --> 01:07.530
Give a true or false if their answer was correct or not.

01:07.530 --> 01:10.080
So we have a question, we have an answer,

01:10.080 --> 01:12.960
and we have a real answer, and it gives a true or false.

01:12.960 --> 01:15.990
So let's go ahead and create the pipeline to evaluate it,

01:15.990 --> 01:18.750
so we're using PromptLayer to do this.

01:18.750 --> 01:23.750
So we'll call it rag-end-to-end pipeline or call it eval.

01:26.250 --> 01:28.740
So I've actually already created a data set

01:28.740 --> 01:31.800
that has a bunch of questions and a bunch of answers.

01:31.800 --> 01:34.650
I manually figured out these answers,

01:34.650 --> 01:38.430
through my like New York Stock Exchange data.

01:38.430 --> 01:39.960
But you can figure out these answers

01:39.960 --> 01:40.920
a lot of different ways,

01:40.920 --> 01:42.840
but this gives me the ground truth answer

01:42.840 --> 01:44.460
that I could evaluate on.

01:44.460 --> 01:46.260
So let's create the pipeline.

01:46.260 --> 01:48.690
You can, also, if you don't have the ground truth answer,

01:48.690 --> 01:51.090
you can maybe use an AI to figure out

01:51.090 --> 01:53.040
if it's close or if it sounds right,

01:53.040 --> 01:54.660
there's a lot of different methods you could use,

01:54.660 --> 01:56.250
this is just one of them.

01:56.250 --> 01:58.080
So let's build the first step of our pipeline.

01:58.080 --> 02:01.200
The first step, as you know, is retrieving the data.

02:01.200 --> 02:05.703
So I've actually, well, let's call it retrieval-step.

02:06.660 --> 02:10.020
I've actually created like a API endpoint

02:10.020 --> 02:11.760
that serves my RAG data.

02:11.760 --> 02:13.113
So that's the first step.

02:14.400 --> 02:15.780
Let's create our next step,

02:15.780 --> 02:19.770
which is actually creating the AI answer.

02:19.770 --> 02:21.240
So we'll select our prompt template

02:21.240 --> 02:24.033
that I showed you earlier, which is called stock-buddy.

02:25.260 --> 02:28.080
We'll choose the data, which is retrieval-step,

02:28.080 --> 02:28.950
and we'll choose the question,

02:28.950 --> 02:32.460
these are the variables from our prompt as I showed you.

02:32.460 --> 02:34.470
So this is kind of a preview of our eval,

02:34.470 --> 02:37.230
this is an easy way to create our eval,

02:37.230 --> 02:39.000
we get to see the live data.

02:39.000 --> 02:40.170
So here's our AI answer

02:40.170 --> 02:42.210
and then we'll finally add another one,

02:42.210 --> 02:46.863
which is the score, and we'll call it ai-eval.

02:47.730 --> 02:50.880
We'll choose our ai-evaluator,

02:50.880 --> 02:52.380
we'll give it the ai-answer,

02:52.380 --> 02:53.460
we'll give it the question

02:53.460 --> 02:55.860
and we'll give it the real answer.

02:55.860 --> 02:59.190
And it'll give us a true or false, if it was correct or not.

02:59.190 --> 03:01.140
So let's see, there we go, true, true, true.

03:01.140 --> 03:02.726
All of these are correct.

03:02.726 --> 03:05.130
So let's go back to our prompt.

03:05.130 --> 03:06.210
As you can see here,

03:06.210 --> 03:09.300
I've already run it a few times on these versions.

03:09.300 --> 03:11.820
I got 72%, then I got 80%.

03:11.820 --> 03:13.590
So let's jump to it.

03:13.590 --> 03:16.170
We can see it ran it on the full data set.

03:16.170 --> 03:17.790
Show all 100.

03:17.790 --> 03:20.040
And our score is how many of these are true

03:20.040 --> 03:21.300
and how many of these are false.

03:21.300 --> 03:24.630
We can also see some stats,

03:24.630 --> 03:27.030
but this is basically how you test a RAG system,

03:27.030 --> 03:29.760
you need to test it end to end,

03:29.760 --> 03:32.010
you can test each individual part,

03:32.010 --> 03:34.830
but you really need to know, does the whole thing work?

03:34.830 --> 03:36.423
Hope this tutorial was helpful.