WEBVTT

00:00.120 --> 00:02.040
Okay, let's talk about five tricks.

00:02.040 --> 00:02.920
One trap.

00:02.920 --> 00:06.920
Let's get started with trick number one, the illusion of memory.

00:06.960 --> 00:09.600
Something which you've already seen basically before.

00:09.600 --> 00:09.880
But.

00:09.880 --> 00:14.160
But, uh, it's always worth taking a minute to explain this.

00:14.440 --> 00:22.080
So, look, we saw before, when we make a prompt to an LM, like, my name is Ed, and the LM has to

00:22.120 --> 00:28.200
complete this prompt, it comes up with a most likely, uh, text to come after this.

00:28.240 --> 00:31.560
Or as you may know, it's really generating tokens.

00:31.600 --> 00:35.160
There's little fragments of text and it responds, hi, Ed.

00:35.440 --> 00:37.800
And then we give it a separate prompt.

00:37.840 --> 00:38.880
What's my name?

00:39.240 --> 00:43.760
Every time you call an LM, as I explained a million times, it's stateless.

00:43.960 --> 00:48.560
This is GPT, the model, the LM, not ChatGPT the product.

00:48.720 --> 00:52.600
You make a call to GPT, the model, it takes an input sequence.

00:52.600 --> 00:53.440
What's my name?

00:53.640 --> 00:58.720
It predicts the most likely thing to come after that, based on all of the training data examples.

00:58.720 --> 01:04.230
And of course it says I don't know your name or something more charming than that in GPT style.

01:04.630 --> 01:12.190
Okay, so the illusion of memory is this very simple trick, which is that when you prompt an LLM,

01:12.230 --> 01:19.830
you might say, my name is Ed and it might say hi Ed, the next time that you send a prompt to the LLM.

01:20.030 --> 01:22.030
If the user has said, what's my name?

01:22.150 --> 01:26.230
You do not send the LLM the prompt, what's my name?

01:26.750 --> 01:27.590
Do you know what you do?

01:27.630 --> 01:28.070
Send it.

01:28.110 --> 01:31.630
You probably do, but if not, this is the answer.

01:31.630 --> 01:35.390
What you send it is the whole conversation so far.

01:35.590 --> 01:36.630
My name is Ed.

01:36.670 --> 01:37.390
Hi, Ed.

01:37.470 --> 01:38.550
What's my name?

01:38.710 --> 01:42.070
And what it is predicting is what comes next.

01:42.070 --> 01:48.910
And in all of its training data, it was given inputs that have this kind of set of message response

01:48.910 --> 01:50.030
message response.

01:50.030 --> 01:52.470
And so it's expecting something like that.

01:52.470 --> 01:58.910
So when it generates the next tokens, the next few the the text to come at the end of that, it generates

01:58.910 --> 02:02.510
something consistent with the full conversation so far.

02:02.630 --> 02:04.060
And that's why it responds.

02:04.060 --> 02:11.300
Your name is Ed, and that's why every time you talk to ChatGPT through the UI, it's sending the underlying

02:11.300 --> 02:15.780
GPT, LLM the entire conversation so far each time.

02:15.780 --> 02:23.100
And it generates the next text, the tokens based on the full history so far, which is called every

02:23.140 --> 02:26.300
time that you press enter when you're using ChatGPT.

02:26.580 --> 02:34.540
So this trick of sending the full conversation gives us the illusion that this data science model remembers

02:34.540 --> 02:36.420
what we said 30s ago.

02:36.620 --> 02:37.500
It doesn't.

02:37.540 --> 02:39.540
It's just included every time.

02:39.540 --> 02:41.180
In the prompt to the LLM.

02:41.420 --> 02:44.580
Many of you knew that already, but it never hurts to go through it again.

02:44.740 --> 02:48.380
And that that is how trick one works.

02:48.500 --> 02:52.100
And so trick two, which arguably is less to do with agentic AI.

02:52.140 --> 02:56.220
It's just a general trick of thinking reasoning.

02:56.380 --> 03:02.420
This started out like like I guess a year and a half ago or so when people discovered something that

03:02.420 --> 03:09.010
at the time was called chain of thought, which was this idea that you could ask an LLM, a question,

03:09.010 --> 03:11.890
and then you could just add to the end of your prompt.

03:12.050 --> 03:18.170
Uh, please think step by step and you would just get better outcomes.

03:18.210 --> 03:20.650
Just as a result of prompting it that way.

03:20.890 --> 03:28.290
And there seemed to be some, some kind of strange feature that if an LLM is generating text that describes

03:28.290 --> 03:34.810
what it should do and then it generates text to do it, you end up getting better outcomes.

03:35.090 --> 03:38.250
Then nothing, nothing is actually thinking through in that way.

03:38.250 --> 03:40.410
It's just generating likely text.

03:40.570 --> 03:46.610
Uh, but uh, but just as a side effect, by virtue of generating text to describe a thought process,

03:46.810 --> 03:49.650
it ends up coming up with better outcomes.

03:49.850 --> 03:55.330
And so that resulted in this idea of a thinking or a reasoning model.

03:55.370 --> 03:56.370
It means the same thing.

03:56.530 --> 04:02.770
And that is a model which is actually been trained so that when it's given any question before it generates

04:02.770 --> 04:10.760
the text for the final output, it generates some text to describe how it will take step by step your

04:10.760 --> 04:13.160
question to come up with an output.

04:13.160 --> 04:17.720
And that is why you sometimes see with some of these models a sort of what they call a thinking trace

04:17.880 --> 04:23.160
text coming up that describes its thought process before you get the final output.

04:23.200 --> 04:27.080
And so to give you a concrete example, and I'm actually taking this from an example that I do in some

04:27.080 --> 04:28.520
of my technical courses.

04:28.720 --> 04:35.840
You can you can prompt an LLM that is not thinking and reasoning, but a very small LLM like, like

04:35.840 --> 04:44.360
for example, GPT 4.1 nano on a mode where it's not reasoning and you can say, uh, toss two coins,

04:44.360 --> 04:46.680
one of them is heads, what's the chances?

04:46.680 --> 04:51.800
The other one is tails, and it's quite likely to get the wrong answer.

04:51.800 --> 04:58.080
In this case it's responding 50 over 50.5 as the as the probability which which happens to be the wrong

04:58.080 --> 04:58.440
answer.

04:58.440 --> 05:03.680
Because this is one of those sneaky little questions that, uh, has has trickery built into it.

05:04.160 --> 05:07.750
If, however, you prompt it with toss two coins.

05:07.750 --> 05:08.710
One is heads.

05:08.710 --> 05:09.590
What's the chance?

05:09.590 --> 05:10.750
The other one is tails.

05:10.750 --> 05:13.190
First, describe your thought process.

05:13.190 --> 05:14.670
Then give the answer.

05:14.910 --> 05:20.390
Or if you use a reasoning model which has been trained to first output the thought process and then

05:20.390 --> 05:25.350
the answer, you will get some kind of thought process that will say, okay, this this is probably

05:25.350 --> 05:27.790
a trick question or this is a sneaky question.

05:27.790 --> 05:31.430
Now I need to realize that they haven't said I toss two coins.

05:31.430 --> 05:32.750
The one on the left is heads.

05:32.750 --> 05:33.470
What's the chances?

05:33.470 --> 05:36.110
The one on the right is tails, which would of course be 5050.

05:36.150 --> 05:38.270
But the question has been raised differently to that.

05:38.270 --> 05:41.310
Okay, there are four different possibilities here blah blah, blah blah blah.

05:41.510 --> 05:44.710
The answer is two thirds and it tends to get that right.

05:44.710 --> 05:49.310
And you could try these experiments and depending on which models you pick and how you do it, you can

05:49.310 --> 05:54.790
you can quite easily recreate this idea that a small model, when it's not given the chance to think

05:54.790 --> 05:56.750
it through or get the wrong answer.

05:56.750 --> 06:03.750
And when reasoning is on and it's giving tokens, it's outputting text to describe the thought process,

06:03.750 --> 06:06.350
you end up with the right answer.

06:06.350 --> 06:13.580
So this reasoning trick ends up working really well to get more, more sophisticated intelligence out

06:13.580 --> 06:19.460
of using using a model which is just generating the most likely text to follow some input text.

06:19.460 --> 06:24.220
And just to add one more piece here, you may you may know this already, but maybe you don't.

06:24.380 --> 06:30.460
The way that these models actually work, they don't generate the output given an input.

06:30.460 --> 06:31.780
The way I've been describing it.

06:31.820 --> 06:32.620
Exactly.

06:32.620 --> 06:38.980
They're given an input, and the first thing they do is they generate the first few letters.

06:39.020 --> 06:44.540
It's called a token, like a chunk of letters that would come after this input, just the first one,

06:44.540 --> 06:51.260
the most likely next thing to come that is then taken and it's stuffed on the end of the input, and

06:51.260 --> 06:57.660
then that whole thing is passed in again to the model, and it's now challenged to come up with the

06:57.660 --> 07:03.860
next little fragment of letters, the next token that comes after all this input and the first token.

07:03.860 --> 07:07.620
And it does that, that gets shoved on the end and it repeats.

07:07.620 --> 07:10.520
That's the way that models generate text.

07:10.560 --> 07:12.600
That process is known as inference.

07:12.760 --> 07:19.160
And when it's doing that, it's generating tokens that are most likely to come next after all of the

07:19.160 --> 07:21.360
input and the output so far.

07:21.880 --> 07:29.160
And that's why you get this curious side effect that if you ask it to generate tokens to describe a

07:29.160 --> 07:32.480
thought process, it will start by doing that.

07:32.480 --> 07:36.640
And then you can generate tokens for the for the outcome.

07:36.680 --> 07:42.640
And it will be consistent with the thought process and it just results in better outcomes.

07:42.760 --> 07:44.280
It sounds counterintuitive.

07:44.280 --> 07:48.200
It sounds like like that's too good to be true, but it actually works.

07:48.200 --> 07:50.040
It's empirical and it works.

07:50.320 --> 07:55.280
And there are some models, uh, which which are designed for being just in a chat mode.

07:55.320 --> 07:58.000
They have been trained to just respond immediately.

07:58.000 --> 08:02.880
And there are other variants, as we say, other other ways of the same kinds of the same family of

08:02.880 --> 08:07.920
models has been trained in order to be in a reasoning mode where it will generate reasoning stuff first

08:07.920 --> 08:09.080
and then the answer.

08:09.400 --> 08:12.870
And there's also some hybrid models that can can do both.

08:13.430 --> 08:14.710
And it's people.

08:14.750 --> 08:19.630
People tend to always sort of gravitate towards reasoning models, because they tend to be much more

08:19.630 --> 08:22.950
intelligent and do better in any of the benchmarks and tests.

08:23.110 --> 08:28.590
But it turns out often chat models are better for different use cases, particularly in agentic AI when

08:28.590 --> 08:31.230
you're already driving a whole system step by step.

08:31.350 --> 08:34.830
So don't always assume that a reasoning model is going to be better.

08:34.870 --> 08:39.550
Sometimes a chatbot will do better, and you may have the question of okay, how do you tell?

08:39.590 --> 08:45.190
And the answer, as is so often the case when working with Llms, is that you should try both and pick

08:45.190 --> 08:47.150
the one that performs better.

08:47.190 --> 08:49.910
This whole field is super experimental.

08:50.230 --> 08:53.750
There are no right and wrong answers except to experiment.

08:53.790 --> 08:58.190
Have some sort of a measurement of success and use that to make your decision.

08:58.230 --> 09:04.510
And then one final thing I'll say on this, which is that some, some models allow you to specify how

09:04.510 --> 09:06.310
long it should think for.

09:06.590 --> 09:11.190
Sometimes called the reasoning effort, sometimes called the thinking budget.

09:11.830 --> 09:17.940
Then in some models, like the latest GPT 5.1, you can choose a budget of none if you want it to be

09:17.940 --> 09:21.980
in chat mode, or minimal or low or medium or high.

09:22.140 --> 09:28.900
Those are all the settings of the reasoning or thinking budget, and you may wonder how that actually

09:28.900 --> 09:36.460
works, like how the model is just just generating these tokens and how do you you sort of what machinery

09:36.500 --> 09:41.860
have you built around one of these models to tell it, hey, I want you to do more thinking.

09:42.100 --> 09:47.500
And there are a few different techniques that are used, but the one that that is most common is unbelievably

09:47.540 --> 09:48.100
hacky.

09:48.300 --> 09:53.420
It's it's so janky that it's hard to believe this is really how it works.

09:53.420 --> 09:58.100
But, but but let me just, just quickly tell you now, I mentioned that when you're running a model

09:58.100 --> 10:00.860
in inference, as it's called, the model is generating tokens.

10:00.860 --> 10:04.340
You do them token at a time, generates a token that gets shoved back in.

10:04.380 --> 10:06.340
You generate the next token, shove back in.

10:06.580 --> 10:11.380
So during the reasoning mode it's doing this and it's coming up with all of this reasoning stuff as

10:11.380 --> 10:12.500
it thinks things through.

10:12.740 --> 10:15.090
And someone had a bright idea.

10:15.610 --> 10:17.010
And this sounds crazy.

10:17.050 --> 10:19.290
Someone had a bright idea that that you can.

10:19.330 --> 10:23.610
Of course you can choose to add in extra tokens in here.

10:23.610 --> 10:27.250
You don't have to take what it's generated and put that back in.

10:27.290 --> 10:29.690
You could put in something else as well if you wanted.

10:29.690 --> 10:32.250
And that becomes the new input and it generates an output.

10:32.290 --> 10:36.730
It doesn't know that it didn't generate all of this, that there's something extra in there that you've

10:36.770 --> 10:37.730
shoved in there.

10:37.770 --> 10:41.250
It just thinks this is the whole input that it's now generating the next token.

10:41.250 --> 10:41.850
Fine.

10:42.090 --> 10:45.810
So what might you shove in there that might that might help this?

10:45.970 --> 10:51.250
Well, someone had this idea that when it's generating tokens, it generates reasoning a couple of sentences

10:51.290 --> 10:52.730
about how it's thinking things through.

10:52.890 --> 10:57.170
You could wait until it has like a full stop when it's like finished what appears to be a some sort

10:57.170 --> 11:03.090
of a thought process and then just insert in the word wait at the end of that.

11:03.090 --> 11:07.210
So it says something like, okay, this appears to be a trick question.

11:07.210 --> 11:09.170
I should consider the different possibilities.

11:09.170 --> 11:11.250
And then the word wait appears there.

11:11.250 --> 11:13.290
And now it has to generate what comes next.

11:13.290 --> 11:17.840
And the tokens that it now generates need to be coherent with the word weight.

11:17.840 --> 11:21.640
And so it tends to generate something that is kind of taking a step back.

11:21.800 --> 11:25.760
And it's like, wait, I should double check that I've really understood the problem.

11:25.840 --> 11:28.480
Or wait, I hope I'm not jumping to conclusions.

11:28.480 --> 11:29.440
Let me review.

11:29.640 --> 11:35.560
And so it tends to generate tokens that are reflective of the thought process so far.

11:35.560 --> 11:38.560
And it goes back and revisits and challenges itself.

11:38.920 --> 11:46.800
So this trick of adding in things like weight and on the other hand, and stuff like that tends to allow

11:46.800 --> 11:51.040
it to rethink through and often explore new territory.

11:51.200 --> 11:55.480
And if you look in some of the reasoning traces you see when you're chatting with Llms, you'll see

11:55.480 --> 11:59.800
the word weight in there and others, and you'll know that it's this trick at play.

11:59.840 --> 12:07.680
This seemingly really, really simplistic trick does really well, and that is one of the core techniques

12:07.680 --> 12:15.280
that's used to sort of force this thinking budget and make a model think through more the train of thought

12:15.320 --> 12:17.240
before it comes up with the final answer.
