WEBVTT

00:00.560 --> 00:01.130
Okay.

00:01.130 --> 00:02.930
Self-consistency sampling.

00:02.930 --> 00:09.260
This is something that I use all the time, which I don't actually see a lot of other people using.

00:09.260 --> 00:12.500
I think it's a little bit of a secret trick.

00:12.500 --> 00:18.650
It's actually well known in the academic world, I think, but in production it has huge benefits.

00:18.650 --> 00:21.560
I think people really use it enough now.

00:21.560 --> 00:22.160
What is it?

00:22.160 --> 00:24.920
Self-consistency sampling is sounds complicated.

00:24.920 --> 00:25.220
Really.

00:25.220 --> 00:32.000
What it is, is just generating multiple responses and then choosing either the summary or the aggregate

00:32.000 --> 00:38.000
of those responses, or the most common response, or using some evaluation metric to choose which ones

00:38.000 --> 00:39.500
to to select.

00:39.650 --> 00:41.960
Okay, this is a canonical example of it.

00:41.960 --> 00:47.210
It's instead of generating one response where to this question.

00:47.210 --> 00:49.550
Instead it this one generates three.

00:49.550 --> 00:51.170
And that's fine.

00:51.170 --> 00:56.900
This question is wrong, or this answer is wrong because it got it right two of the three times.

00:56.900 --> 01:02.810
And this is one of the fun things about LMS is that they're non-deterministic, so they don't always

01:02.810 --> 01:04.460
give you the right answer.

01:04.460 --> 01:09.860
If you sample multiple times and then you select the ones that are most consistent, then you can get

01:09.860 --> 01:12.440
the right answer in aggregate, which is really helpful.

01:12.650 --> 01:14.990
Let me show you an example of that to run this.

01:15.440 --> 01:22.030
This is using the async open AI because I find it's much quicker if you're running it 3 or 5 times.

01:22.120 --> 01:25.840
You don't want to add too much latency to your prompt.

01:25.870 --> 01:27.940
You don't want to run them five times in a row.

01:27.970 --> 01:33.370
It's going to take five times as long if you run them all at the same time using an async API.

01:33.520 --> 01:35.590
Then you can get much better results.

01:35.890 --> 01:38.050
All right, this is the system prompt.

01:38.080 --> 01:44.440
Here we're telling it to use step by step reasoning for the questions and then provide the answers JSON

01:44.440 --> 01:47.020
on step by step reasoning and final answers the keys.

01:47.020 --> 01:48.880
And then we're just giving it a question.

01:49.360 --> 01:52.720
And this is, by the way, an async function.

01:52.720 --> 01:56.200
This was just what is needed when you're doing asynchronous calls.

01:56.200 --> 02:01.180
And then we have a function just to extract the answer from the reasoning path.

02:01.330 --> 02:03.820
So what we get is we from the JSON.

02:03.820 --> 02:06.610
We just get this final answer response.

02:06.610 --> 02:10.900
And we're also removing any dollar signs just to make this simpler.

02:11.050 --> 02:11.380
All right.

02:11.410 --> 02:14.860
Now we can run this self-consistency function.

02:14.860 --> 02:21.400
And all that's going to do is take the number of samples and generate that many tasks, asynchronous

02:21.400 --> 02:21.790
tasks.

02:21.790 --> 02:24.070
So this is generating five by default.

02:24.070 --> 02:25.540
And we're going to gather those tasks.

02:25.540 --> 02:30.250
So when all five of them are completed then we're going to get all the answers and then find the most

02:30.250 --> 02:31.120
common answer.

02:31.510 --> 02:32.080
Yeah here we go.

02:32.080 --> 02:34.540
So now we have one answer of 85.

02:34.570 --> 02:35.680
Another answer of 85.

02:35.710 --> 02:37.870
Two others that failed to pass.

02:37.870 --> 02:41.020
And then this one came back in the wrong format.

02:41.020 --> 02:43.930
We could go back and figure out the reasoning steps.

02:43.930 --> 02:47.350
We could go back and see what the answer was and why it failed to pass.

02:47.350 --> 02:53.890
But the nice thing here is that despite all these errors, we still got the right answer 825.

02:54.010 --> 03:00.910
So this is something that is really helpful with with LMS when the answers are non-deterministic, sometimes

03:00.910 --> 03:03.430
it just doesn't follow the format for whatever reason.

03:03.430 --> 03:05.920
So this can be really helpful.

03:05.920 --> 03:11.110
And you know, if you're not getting a consistent response across five you could just change this.

03:11.110 --> 03:14.050
You could say, I want now I want ten samples.

03:14.050 --> 03:20.410
And this is some way that you can essentially trade off quality for cost, because it doesn't really

03:20.410 --> 03:24.310
increase your latency very much to generate five things or ten things at a time.

03:24.310 --> 03:30.190
But it does improve the quality, because now if something is only successful two out of five times,

03:30.190 --> 03:32.680
you're almost guaranteed to get a successful answer.

03:32.680 --> 03:38.740
Whereas if you only run it once, you only have a 1 or 2 in five chance of of success.

03:38.860 --> 03:43.330
So if you scale this up to 10 or 20, then you can increase your odds even more.

03:43.330 --> 03:46.750
And all you're doing is paying more for more tokens.

03:46.750 --> 03:50.320
So this is a really good example of self-consistency sampling.

03:50.320 --> 03:58.030
And you can have this either as a quality step in your AI, or you can use this to generate synthetic

03:58.030 --> 03:58.930
data.

03:58.930 --> 04:06.040
Whatever it is, you can really improve the performance of of your output by, you know, generating

04:06.040 --> 04:07.300
multiple answers.

04:07.570 --> 04:13.770
The other pattern I see, by the way, is not just counting the most common answer, but having some

04:13.770 --> 04:17.970
evaluation metric for checking whether the response is correct.

04:18.060 --> 04:21.210
You could also do a multi generation step.

04:21.210 --> 04:28.650
So in this sample if the answer comes back nothing then you could tell it to go back and generate again

04:28.650 --> 04:30.150
until it comes back with something.

04:30.360 --> 04:34.620
So you can usually mix this with other types of techniques.

04:34.620 --> 04:38.070
And see that's a pretty common use of of this this function.