WEBVTT

00:00.170 --> 00:03.530
Chain of thought is one of the most common prompting technique.

00:03.770 --> 00:07.610
It's really useful for getting learners to read, and they're still not great at it.

00:07.610 --> 00:12.680
But this makes a really big difference in terms of performance, and it's very easy to implement.

00:12.680 --> 00:13.940
But we're going to walk through this.

00:14.030 --> 00:16.460
There is a paper that you can read on Chain of Thought.

00:16.460 --> 00:19.640
And actually it's used in many scientific papers.

00:19.640 --> 00:24.350
But the one I recommend reading is this one chain of thought prompting with its reasoning in large language

00:24.350 --> 00:24.800
model.

00:24.950 --> 00:30.770
I think this is the one that introduced it, but the general idea of what you have to understand about

00:30.770 --> 00:34.550
chain of thought is that it mimics the way that humans think.

00:34.580 --> 00:40.940
If you think about system one and system two, it's how humans divide up the different tasks they have

00:40.940 --> 00:41.480
to do.

00:41.480 --> 00:48.110
And the vast majority of the operation of your brain is system one.

00:48.110 --> 00:54.500
It's this automatic emotional process where you give an immediate response subconsciously, and that's

00:54.500 --> 00:57.040
how we make most of the decisions that we need to make.

00:57.130 --> 01:00.940
And then when you actually stop and think about something, you're using system two.

01:00.970 --> 01:02.950
So that's a lot more deliberate.

01:03.280 --> 01:06.190
You're actually consciously working through the steps.

01:06.190 --> 01:08.350
And that's really what chain of thought is.

01:08.350 --> 01:13.660
You know, if large language models are really all system one right now are using chain of thought is

01:13.660 --> 01:18.100
how you make it stop and think through the steps and be deliberate against perched.

01:18.610 --> 01:25.540
And this is something that pretty much every mobile vendor in their prompt engineering guide that talks

01:25.540 --> 01:28.870
about life, they say you should give the AI time to think.

01:28.900 --> 01:30.430
That's what OpenAI says.

01:30.430 --> 01:34.330
And you're actually anthropic uses this technique in its tool.

01:34.330 --> 01:39.550
They output like a thinking tag first, and you can actually see that if you ask it to see what the

01:39.550 --> 01:40.420
results are.

01:40.660 --> 01:43.960
So let's dive into the actual example.

01:44.310 --> 01:50.220
First of all, we just have a basic function here which called the OpenAI API passes in a system prompt

01:50.220 --> 01:52.080
and a normal prompt.

01:52.200 --> 01:53.430
So the user prompt.

01:53.430 --> 01:59.400
So the system prompt here solve the following problem and return the answer in the format a answer.

02:00.360 --> 02:03.270
The question is how many hours are there in raspberry.

02:03.300 --> 02:08.850
Now this is a famous failure mode for llms because they can't really count letters that these are all

02:08.850 --> 02:11.520
tokenized, so it does a bad job with them.

02:11.520 --> 02:16.950
And then we also have here a standard prompt which is a three shot prompt.

02:17.070 --> 02:19.050
So we say here's three examples.

02:19.050 --> 02:20.820
We say how many years are there in elephant.

02:20.820 --> 02:24.840
And they convert to how many keys are with three.

02:24.840 --> 02:25.830
And then how many O's are there.

02:25.830 --> 02:27.060
And chocolate becomes two.

02:27.090 --> 02:30.180
So you get a bunch of examples of the types of answers we want.

02:30.180 --> 02:34.370
And then we're given it the question which is how many are there around Around three.

02:34.970 --> 02:41.330
And when we run that, if we just run that now we see hopefully that it came back with two.

02:41.420 --> 02:46.880
But there are actually three R's in raspberry because they have the R beginning and then the two at

02:46.880 --> 02:47.330
the end.

02:47.360 --> 02:51.530
Now the chain of thought prompts got the right result and it did do that.

02:51.530 --> 02:57.410
But this answer three we gave it this thinking in the examples we told it the think step by step.

02:57.620 --> 02:59.150
And then we spell it out.

02:59.150 --> 03:01.370
and then we have Jeff in between the letters.

03:01.370 --> 03:04.040
And it actually, you know, figured out.

03:04.040 --> 03:06.140
So we we actually wrote these right.

03:06.650 --> 03:09.080
And showed it how it should think through the problem.

03:09.080 --> 03:11.300
And then we say, let's look step by step.

03:11.450 --> 03:13.490
And it's doing a much better job.

03:13.610 --> 03:16.220
So there are different ways to implement this of course.

03:16.220 --> 03:18.770
And you know some ways work better than others.

03:18.800 --> 03:24.520
But I think the main thing to think about here is how do you demonstrate to it that it should reason

03:24.520 --> 03:28.510
through the steps within a previous few short examples?

03:29.560 --> 03:29.860
All right.

03:29.860 --> 03:31.120
So that's chain of thought.

03:31.120 --> 03:35.980
And when you're implementing a technique like that you typically want to evaluate the response.

03:35.980 --> 03:41.410
And in this case it might be pretty straightforward because you know the correct answer is.

03:41.620 --> 03:44.530
So I created a evaluate response function.

03:44.530 --> 03:46.480
And it just searches for that final answer.

03:46.480 --> 03:53.020
So we have it's looking for a and then the space and then the number and kind of pull back that number

03:53.020 --> 03:53.800
as the final answer.

03:53.800 --> 03:55.030
It can refer to an integer.

03:55.690 --> 04:00.760
And that's why it was so important by the way for us to provide examples in the first place, because

04:01.060 --> 04:07.750
you'll find that it might say the final answers to and doesn't start with this a colon space.

04:08.920 --> 04:14.830
And that's why few shot prompting to be really helpful because specifying a format to follow.

04:14.830 --> 04:17.410
And then you could check what the right answer is afterwards.

04:17.530 --> 04:20.530
So this just pulls out the right answer and then splits it.

04:20.530 --> 04:25.270
And then the other thing gets checking as well as if there are any steps, any reading steps.

04:25.390 --> 04:28.360
So we run that on this example test response.

04:28.660 --> 04:31.450
And we can see that your answer is correct.

04:31.450 --> 04:36.160
And it does provide the steps that we fasten the correct answers to.

04:36.160 --> 04:38.260
And it is multiple lines here.

04:38.290 --> 04:43.890
If we deleted this then it would say doesn't provide stats.

04:43.890 --> 04:46.980
Or if we said the correct answer three they would say that was all.

04:46.980 --> 04:49.980
Now we know that our evaluation metric is working.

04:49.980 --> 04:53.160
We can run that for lots of different test questions.

04:53.160 --> 04:54.870
Then here we have a bunch of new ones.

04:54.960 --> 05:01.470
How many are in hamburger mdm2, how many ls other umbrella and so on and so forth.

05:01.620 --> 05:03.600
So we run the evaluation.

05:03.990 --> 05:10.650
What we're going to do is give it the prompt type and say if the front type is standard, then we're

05:10.650 --> 05:14.880
going to just give it the normal few shot prompt without any recent steps.

05:14.880 --> 05:21.330
And if the prompt is the kind of thought prompt, and we're going to give it to the other prompt, to

05:21.330 --> 05:26.640
the one with the steps and which have been timely response, we're going to return it and we're going

05:26.640 --> 05:28.470
to check the evaluation.

05:28.490 --> 05:29.420
Is it correct.

05:29.420 --> 05:30.980
And that steps.

05:30.980 --> 05:36.470
And then we're going to figure out the overall accuracy of that prompt the average time.

05:36.470 --> 05:39.500
And then the step percentage how many of these has steps.

05:39.500 --> 05:44.000
So this is all in one function in controlling for the different type.

05:44.000 --> 05:46.220
And then returning the evaluation metrics.

05:46.610 --> 05:48.380
So we can run that on the standard.

05:48.590 --> 05:52.160
And then we can run that on the chain of thought.

05:52.550 --> 05:55.280
And we're going to be able to compare the two results.

05:55.280 --> 05:57.110
So let's see what we get.

05:57.110 --> 05:58.130
Standard prompting.

05:58.130 --> 06:02.300
We get 90% accuracy and it takes less than a second to run.

06:02.510 --> 06:04.610
But none of them have steps and they'll have reasoning.

06:04.610 --> 06:07.580
So you wouldn't be able to debug with it where it went wrong.

06:07.580 --> 06:10.460
Whereas chain of thought prompting we get 100% accuracy.

06:10.460 --> 06:11.540
It does take longer.

06:11.540 --> 06:13.940
So take 1.62 seconds.

06:14.120 --> 06:16.190
And this is something to think about in production.

06:16.190 --> 06:22.390
If you're asking it for more steps than the may cost you more tokens, and it's also going to take longer

06:22.390 --> 06:22.960
to return.

06:22.960 --> 06:27.850
And this is the type of trade off you need to make with Promptitude and chained.

06:27.850 --> 06:32.320
Thought isn't for every system, because sometimes you need to get a result back quickly.

06:32.320 --> 06:39.880
And as speed can be traded off for an accuracy, when you're deciding whether to deploy to your thought

06:39.880 --> 06:43.540
and how you deploy chain, your thought makes a big difference to that trade off to you.
