WEBVTT

00:01.020 --> 00:02.430
-: Hello and welcome back to the course

00:02.430 --> 00:04.470
on artificial intelligence.

00:04.470 --> 00:07.590
Today, we're going to talk about the Bellman equation.

00:07.590 --> 00:08.670
It's quite a complex topic

00:08.670 --> 00:11.970
and we're going to introduce it in a step-by-step manner

00:11.970 --> 00:14.010
throughout this whole section of the course.

00:14.010 --> 00:15.720
So we're not going to just jump straight

00:15.720 --> 00:17.070
into the most complex version

00:17.070 --> 00:18.150
of the Bellman equation right away,

00:18.150 --> 00:20.580
but instead, we're going to introduce it slowly

00:20.580 --> 00:23.400
in order to gradually understand how it works.

00:23.400 --> 00:25.470
And I hope you're cool with that approach,

00:25.470 --> 00:28.680
if you are, let's get straight into it.

00:28.680 --> 00:31.530
So, we're going to have a couple of key concepts

00:31.530 --> 00:32.817
that we're going to be operating with.

00:32.817 --> 00:36.150
And these concepts are, S stands for state.

00:36.150 --> 00:38.640
So the state in which our agent is,

00:38.640 --> 00:41.730
or any other possible state in which it can be.

00:41.730 --> 00:45.480
A, represents an action that an agent can take.

00:45.480 --> 00:48.540
So an agent can have access to a certain list of actions,

00:48.540 --> 00:50.370
and actions are very important

00:50.370 --> 00:53.610
when they're looked at in a state combination.

00:53.610 --> 00:54.810
So when you're in a certain state

00:54.810 --> 00:55.643
and then you look at actions,

00:55.643 --> 00:57.240
then it starts to make sense

00:57.240 --> 00:59.100
what's going to be the result of those actions.

00:59.100 --> 01:01.470
Because if you look at an action by itself without a state,

01:01.470 --> 01:02.303
doesn't really make sense

01:02.303 --> 01:03.870
because you don't know where you are,

01:03.870 --> 01:06.480
and where you can possibly end up in.

01:06.480 --> 01:08.970
Then we have, we'll have R, which stands for reward.

01:08.970 --> 01:11.430
And that's the reward the agent gets

01:11.430 --> 01:14.310
for entering into a certain state.

01:14.310 --> 01:17.010
And gamma is the discount factor.

01:17.010 --> 01:18.690
And we'll talk about the discount factor in a second,

01:18.690 --> 01:20.010
it all makes sense just now.

01:20.010 --> 01:22.500
But just take a note, make a mental note,

01:22.500 --> 01:24.720
that we are going to have this letter gamma,

01:24.720 --> 01:26.580
that we'll be operating with later on.

01:26.580 --> 01:28.860
So the person behind the Bellman equation

01:28.860 --> 01:31.350
is Richard Earnest Bellman.

01:31.350 --> 01:34.200
He was a applied mathematician,

01:34.200 --> 01:38.220
and came up with the concept of dynamic programming,

01:38.220 --> 01:41.100
which we now call reinforcement learning

01:41.100 --> 01:43.590
or which we call the Bellman equation now in...

01:43.590 --> 01:45.480
Well, that's what we call now.

01:45.480 --> 01:48.570
And in 1953 he came up with that concept

01:48.570 --> 01:52.620
and that's when the Bellman equation came to be.

01:52.620 --> 01:56.490
So let's have a look at how this all works.

01:56.490 --> 01:59.130
There's our lovely agent in the bottom left corner

01:59.130 --> 02:00.990
and he is in a maze.

02:00.990 --> 02:03.000
And this is quite a classical maze

02:03.000 --> 02:04.620
where you've got some blocks,

02:04.620 --> 02:06.030
the white blocks are blocks

02:06.030 --> 02:08.250
in which the agent can step into,

02:08.250 --> 02:11.700
the gray block is the one that is just not accessible

02:11.700 --> 02:13.890
so that's like a wall in this maze.

02:13.890 --> 02:16.291
The green is where the agent

02:16.291 --> 02:18.240
should be aiming to end up in.

02:18.240 --> 02:21.210
That's where we want the agent to go, that's the finish.

02:21.210 --> 02:23.130
And the red is a fire pit.

02:23.130 --> 02:25.080
So if the agent falls into a fire pit,

02:25.080 --> 02:26.910
he will lose the game.

02:26.910 --> 02:31.350
So in the fire pit, the reward, which is R, is minus one.

02:31.350 --> 02:34.717
So that's our way of telling the agent,

02:34.717 --> 02:36.390
"That's not something we want you to do."

02:36.390 --> 02:38.640
Like remember an example of when we're training dogs,

02:38.640 --> 02:40.458
we want to tell him like, "Bad dog,"

02:40.458 --> 02:42.780
if it's not doing the right thing that we wanted to do.

02:42.780 --> 02:43.613
Same thing here,

02:43.613 --> 02:44.446
we want to tell the agent

02:44.446 --> 02:47.010
that this is not something that you should be doing.

02:47.010 --> 02:48.270
You shouldn't be ending up in the square.

02:48.270 --> 02:49.740
So every time it doesn't end up in the square,

02:49.740 --> 02:51.210
it'll get a minus one reward,

02:51.210 --> 02:53.520
so it'll be punished with a minus one reward.

02:53.520 --> 02:55.260
On the other hand, if it ends up in the green square,

02:55.260 --> 02:56.850
it'll get a plus one reward,

02:56.850 --> 02:59.580
meaning that that is what we wanted to do.

02:59.580 --> 03:00.720
So those are the two rewards

03:00.720 --> 03:02.460
that the agent can possibly get.

03:02.460 --> 03:06.360
And how does it learn how to operate in this maze?

03:06.360 --> 03:08.370
Just like in that example of the robot dogs

03:08.370 --> 03:09.300
that learn to walk,

03:09.300 --> 03:10.590
we just gonna let it know...

03:10.590 --> 03:12.480
We'll just tell it that, "Here are the actions you can do,

03:12.480 --> 03:14.640
you can go up, right, left or down.

03:14.640 --> 03:16.890
Those are the four possible actions that you can take,

03:16.890 --> 03:18.354
and that's it.

03:18.354 --> 03:19.770
Have a play around with that,

03:19.770 --> 03:21.420
see what you can come up with."

03:21.420 --> 03:23.610
So the agent might go to the right,

03:23.610 --> 03:25.530
then they might go to more to the right,

03:25.530 --> 03:26.670
they might go back to the left,

03:26.670 --> 03:28.260
they're just randomly pressing these buttons

03:28.260 --> 03:30.210
and they're trying to see what happens.

03:30.210 --> 03:31.170
Then they go back here,

03:31.170 --> 03:34.650
they go up, go up, go down, go up, go right.

03:34.650 --> 03:36.150
So for now, they haven't learned anything,

03:36.150 --> 03:36.983
they just...

03:36.983 --> 03:38.430
So far nothing's happened.

03:38.430 --> 03:41.820
They go right and then bam, they end up in the green square.

03:41.820 --> 03:45.570
So they realize, "Wow, I just got a plus one reward."

03:45.570 --> 03:47.910
So as soon as they stepped into the green square,

03:47.910 --> 03:49.080
they got a plus one reward.

03:49.080 --> 03:51.847
And that triggers the algorithm to say,

03:51.847 --> 03:53.820
"Okay, that's really cool.

03:53.820 --> 03:56.460
I am rewarded for ending up in this square.

03:56.460 --> 03:58.920
So I want to end up in the square."

03:58.920 --> 04:00.900
So what does that mean for the agent?

04:00.900 --> 04:02.737
That means it starts asking question,

04:02.737 --> 04:04.320
"How did I get to the square?

04:04.320 --> 04:07.770
What was the preceding state I was in,

04:07.770 --> 04:09.930
and what action did I take to get into the square?"

04:09.930 --> 04:11.137
And then it looks back and it says,

04:11.137 --> 04:14.940
"Okay, so the preceding state was this one.

04:14.940 --> 04:17.015
It turns out to be valuable in that state,

04:17.015 --> 04:19.260
that one that's part of the red arrow

04:19.260 --> 04:20.793
because from that state,

04:20.793 --> 04:25.530
I'm just one step away from getting the maximum reward

04:25.530 --> 04:28.530
I can possibly dream of, of plus one.

04:28.530 --> 04:30.510
Like a biscuit for a dog.

04:30.510 --> 04:33.240
As soon as I know, if I ever am in that state,

04:33.240 --> 04:35.190
that square mark with the red arrow,

04:35.190 --> 04:37.020
all I'll have to do is press right."

04:37.020 --> 04:39.120
So how do I tell myself,

04:39.120 --> 04:41.460
how do I remember that, that state is valuable?

04:41.460 --> 04:44.490
Well, to me there's no difference, actually.

04:44.490 --> 04:45.900
As the agent, there's no difference

04:45.900 --> 04:48.330
in whether I am in the green square

04:48.330 --> 04:49.920
or in the white square, right?

04:49.920 --> 04:51.630
In the green square I get the reward of one,

04:51.630 --> 04:53.610
so I'm going to mark for myself

04:53.610 --> 04:57.600
that the white square for me, it has the value of one

04:57.600 --> 05:00.210
because it leads exactly to reward one.

05:00.210 --> 05:01.710
So as soon as I'm in the white square,

05:01.710 --> 05:03.330
I know I'll just take one more action,

05:03.330 --> 05:05.400
I'll be in the green square and I'll get a reward of one.

05:05.400 --> 05:07.110
So that's why I'm going to say,

05:07.110 --> 05:09.390
that the value of this square is equal to one

05:09.390 --> 05:14.340
because it leads directly without any sort of subtractions.

05:14.340 --> 05:16.170
As soon as I'm in here, I know my reward will be one,

05:16.170 --> 05:18.570
so I'm gonna mark this square as V equal to one.

05:18.570 --> 05:19.410
That's the value,

05:19.410 --> 05:22.410
that's the perceived value of being in this state.

05:22.410 --> 05:24.277
Next, the agent's going to be like,

05:24.277 --> 05:27.030
"Okay, so how do I get into this square?"

05:27.030 --> 05:30.000
And you know, he might walk around again, and so on,

05:30.000 --> 05:31.447
end up in this square again and be like,

05:31.447 --> 05:33.750
"Okay, how did I get into this square before that?

05:33.750 --> 05:36.810
And the way I got into this square was from this square.

05:36.810 --> 05:37.643
Interesting.

05:37.643 --> 05:40.260
Okay, so as soon as I get into this square,

05:40.260 --> 05:42.990
I know that all I have to do is go right.

05:42.990 --> 05:45.660
And then from here I already know that I'm going to win.

05:45.660 --> 05:48.330
I know exactly how everything's gonna unravel from here.

05:48.330 --> 05:51.030
And I know the value of being in this state is equal to one.

05:51.030 --> 05:52.770
And since there's no,

05:52.770 --> 05:56.460
nothing is stopping me from going here, from here to here,

05:56.460 --> 05:59.490
the value in this is going to perceived value.

05:59.490 --> 06:03.180
I'm going to value being in here, as V equal to one as well,

06:03.180 --> 06:04.227
because as soon as I'm in here,

06:04.227 --> 06:08.160
and I'll be here pretty quickly, so I'm going to win."

06:08.160 --> 06:10.417
And then how do I get into this square before that?

06:10.417 --> 06:13.050
"Well, I got into this square from this square."

06:13.050 --> 06:15.780
So the value, a similar approach,

06:15.780 --> 06:19.230
the value of being here is also equal to one and so on.

06:19.230 --> 06:20.577
So the value of being here is equal to one,

06:20.577 --> 06:22.410
and the value of being here is equal to one

06:22.410 --> 06:24.240
because each one of them leads to the next one

06:24.240 --> 06:26.220
and leads to the finish line.

06:26.220 --> 06:29.940
So that's all like pretty logical at this stage.

06:29.940 --> 06:32.220
This is as pretty much designing

06:32.220 --> 06:33.390
the Bellman equation right now.

06:33.390 --> 06:37.050
So this is, we could possibly think about designing

06:37.050 --> 06:40.470
an equation that helps an agent go through the maze.

06:40.470 --> 06:43.440
So look at the reward, then the preceding state,

06:43.440 --> 06:45.270
give it a value of equal to reward

06:45.270 --> 06:46.200
the preceding state and so on.

06:46.200 --> 06:48.150
So it kinda like creates this pathway.

06:48.150 --> 06:50.977
It's all great and well, but the problem here is,

06:50.977 --> 06:54.120
"Okay, what happens if our agent for some reason

06:54.120 --> 06:56.790
starts in this state,

06:56.790 --> 06:58.657
Instead of starting here and taking these actions,

06:58.657 --> 07:00.600
but it actually starts in the state?"

07:00.600 --> 07:02.130
How does it know?

07:02.130 --> 07:04.290
How does it remember which action to take?

07:04.290 --> 07:06.510
Should it go right or should it go down,

07:06.510 --> 07:08.520
or should it maybe go left, or should it go up?

07:08.520 --> 07:09.780
How does it remember

07:09.780 --> 07:13.200
which is the next continuation from here,

07:13.200 --> 07:16.680
if the only values it has is these values of equal to one?

07:16.680 --> 07:18.530
So it cannot see what's further away.

07:18.530 --> 07:20.820
It can only see, "All right, what I have here,

07:20.820 --> 07:21.960
and what I have here."

07:21.960 --> 07:23.640
How does it know which way to go?

07:23.640 --> 07:24.870
Well at this stage, it doesn't.

07:24.870 --> 07:27.687
It's pretty identical for the agent which way to go.

07:27.687 --> 07:30.668
And so that's why this approach doesn't really work.

07:30.668 --> 07:32.880
It's a very simplistic explanation.

07:32.880 --> 07:34.530
Of course, there's much more to it,

07:34.530 --> 07:38.100
but in an intuitive way that's why we cannot just assign,

07:38.100 --> 07:40.770
just carry on this value backwards like that,

07:40.770 --> 07:43.770
because one of the reasons is once the agent

07:43.770 --> 07:46.268
is in between these two values, which way is it gonna go?

07:46.268 --> 07:48.600
It can get confused like that.

07:48.600 --> 07:51.060
And so how do we solve this problem?

07:51.060 --> 07:52.350
What are we going to do here?

07:52.350 --> 07:54.240
And this is where we're going to start introducing,

07:54.240 --> 07:57.060
the Bellman equation in its actual form,

07:57.060 --> 07:58.650
slowly, step by step.

07:58.650 --> 08:01.620
So the Bellman equation looks something like this.

08:01.620 --> 08:02.880
So we've already talked about V,

08:02.880 --> 08:04.710
the value of being in a certain state,

08:04.710 --> 08:08.280
S is your current state or any given state.

08:08.280 --> 08:10.380
And there is S as well,

08:10.380 --> 08:13.320
and S prime is the state, the following state,

08:13.320 --> 08:16.980
the state that you will end up in after this state,

08:16.980 --> 08:18.960
and by taking a certain in action.

08:18.960 --> 08:22.230
But we know that there's many actions that a agent can take.

08:22.230 --> 08:24.270
And that's why we've got this max over here.

08:24.270 --> 08:27.270
So by taking an action, what will happen to an agent?

08:27.270 --> 08:29.100
So let's say we are in state S,

08:29.100 --> 08:32.760
by taking an action in state S, and we take action A,

08:32.760 --> 08:34.950
what will happen is we'll instantly get a reward

08:34.950 --> 08:36.750
by getting into a new state.

08:36.750 --> 08:40.080
And remember that reward can be one or plus one or minus one

08:40.080 --> 08:41.610
if it's at the end of the game,

08:41.610 --> 08:43.650
or it can be a zero if it's throughout the game.

08:43.650 --> 08:46.260
In this case, our reward throughout the game is zero.

08:46.260 --> 08:47.940
So that's the reward.

08:47.940 --> 08:51.270
Plus, we will get into a new state

08:51.270 --> 08:55.140
which has value of S prime.

08:55.140 --> 08:57.840
So that's the value of the new state and gamma.

08:57.840 --> 08:58.830
We'll talk about gamma in a second,

08:58.830 --> 09:01.140
but the point I'm trying to raise here

09:01.140 --> 09:02.790
or the point I'm raising here is that,

09:02.790 --> 09:04.560
we've got many different actions that we can take,

09:04.560 --> 09:05.820
and that's why we've got the maximum.

09:05.820 --> 09:08.040
So by taking action, we get reward,

09:08.040 --> 09:09.720
plus, we end up in a new state.

09:09.720 --> 09:11.834
And so for every out of the...

09:11.834 --> 09:13.380
in our case, we have four possible actions,

09:13.380 --> 09:15.600
for every one of the possible four actions,

09:15.600 --> 09:17.790
we're going to have a equation like this.

09:17.790 --> 09:20.400
So this is going to have a value for...

09:20.400 --> 09:21.750
They will have a different value

09:21.750 --> 09:23.460
for every one of the four actions.

09:23.460 --> 09:25.890
And we're going to look at only the maximum

09:25.890 --> 09:27.930
because, of course, the agent wants to take

09:27.930 --> 09:28.800
the optimal state.

09:28.800 --> 09:32.130
So if he's in state S, he's going to look at these values,

09:32.130 --> 09:33.480
he's gonna look, find the maximum

09:33.480 --> 09:35.460
based on the action and going to take that action

09:35.460 --> 09:37.620
that leads to the maximum of these values.

09:37.620 --> 09:39.480
So hopefully that makes sense,

09:39.480 --> 09:41.640
why we're taking the maximum here.

09:41.640 --> 09:43.530
Then once we've got the reward and the value of the state,

09:43.530 --> 09:45.660
why do we have this gamma parameter here?

09:45.660 --> 09:49.410
Well, it's there exactly to solve that problem of,

09:49.410 --> 09:51.900
where the agent doesn't know which way to go

09:51.900 --> 09:54.555
because it's comparing the values of two states

09:54.555 --> 09:56.790
on both sides and they're the same.

09:56.790 --> 09:58.890
That's why the gamma is called the discounting factor.

09:58.890 --> 09:59.850
So we're going to have a look at that

09:59.850 --> 10:02.070
in practice now to better understand it.

10:02.070 --> 10:03.270
So let's take our formula,

10:03.270 --> 10:04.770
we'll put it here on the top right

10:04.770 --> 10:07.950
and now we will analyze what the values

10:07.950 --> 10:09.120
of these different states are.

10:09.120 --> 10:11.850
And every state here is a square nomate.

10:11.850 --> 10:14.700
So any one of these white squares

10:14.700 --> 10:16.230
is a state and we will going to calculate

10:16.230 --> 10:18.270
the value of being in that state.

10:18.270 --> 10:19.770
So let's start with this square,

10:19.770 --> 10:21.840
what is the value of being in this state?

10:21.840 --> 10:23.700
Well, we need to take the maximum

10:23.700 --> 10:26.100
of this value across all actions.

10:26.100 --> 10:29.730
And we know that this value is maximized

10:29.730 --> 10:31.170
as we get closer to the finish line.

10:31.170 --> 10:33.600
That's how it's constructed, and just by looking at,

10:33.600 --> 10:35.760
you can see because, here's got the reward,

10:35.760 --> 10:37.530
and here's got a discounting factor

10:37.530 --> 10:41.040
multiplied by the value of the next state.

10:41.040 --> 10:42.450
And it just makes sense,

10:42.450 --> 10:44.820
that that's how we would construct that equation.

10:44.820 --> 10:47.640
So it makes sense that from here,

10:47.640 --> 10:50.340
the maximum of this value will be if we move to the right.

10:50.340 --> 10:52.140
So, that's how we calculate the value of the state.

10:52.140 --> 10:55.350
This value of this state is equals the maximum

10:55.350 --> 10:59.100
or equals to this value if we move to the right.

10:59.100 --> 11:00.990
If we take an action of moving to the right,

11:00.990 --> 11:02.340
so what will this value be?

11:02.340 --> 11:05.040
Well, the reward of moving to the right is equal to one,

11:05.040 --> 11:07.710
and regardless of what gamma is,

11:07.710 --> 11:09.420
we don't have a value in this state

11:09.420 --> 11:11.700
because we are already in the best state possible.

11:11.700 --> 11:12.870
So this is the final state,

11:12.870 --> 11:15.420
it won't have a value, we just get a reward here

11:15.420 --> 11:16.260
and that's the end of the game.

11:16.260 --> 11:20.490
So the value will be of this maximum will be equal to one

11:20.490 --> 11:23.850
and that's why value of state S here, is equal to one.

11:23.850 --> 11:25.846
Now things get interesting when we move to the left,

11:25.846 --> 11:27.990
when we move backwards a bit.

11:27.990 --> 11:32.850
So now let's calculate the value of being in this state.

11:32.850 --> 11:34.080
And for that, we're going to need gamma,

11:34.080 --> 11:37.410
so let's say our discounting factor is a 0.9,

11:37.410 --> 11:40.110
and it'll make sense what a discounting factor is,

11:40.110 --> 11:40.980
once we calculate this.

11:40.980 --> 11:44.100
So from here, just based on our intuition

11:44.100 --> 11:46.710
and because we know how this maze is working,

11:46.710 --> 11:47.543
how this maze works,

11:47.543 --> 11:50.010
we know that the best possible action is go to the right

11:50.010 --> 11:51.510
because from here we go here.

11:51.510 --> 11:53.580
So that means a maximum will be be achieved

11:53.580 --> 11:56.220
when in this state, you go to the right.

11:56.220 --> 11:58.950
And so let's see what happens if we plug it in here,

11:58.950 --> 12:00.630
so if you go from here to here,

12:00.630 --> 12:02.730
you don't get any reward, it'll still be a zero,

12:02.730 --> 12:04.800
but then you'll get gamma, so you get 0.9

12:04.800 --> 12:07.590
times the value of the new state, which is one.

12:07.590 --> 12:09.090
So in this case the value,

12:09.090 --> 12:14.040
the whole result of this is 0.9 times one equals 0.9.

12:14.040 --> 12:16.230
So that's our value, 0.9.

12:16.230 --> 12:18.600
So if we calculate this now, you'll see that from here,

12:18.600 --> 12:20.190
we know just by looking at the maze,

12:20.190 --> 12:22.170
we know because we as humans,

12:22.170 --> 12:24.930
because we're understanding how this equation works,

12:24.930 --> 12:26.640
of course, and an AI,

12:26.640 --> 12:28.470
the agent would have to experiment with these things,

12:28.470 --> 12:30.690
but because we have like a crystal ball,

12:30.690 --> 12:32.130
we can see this whole maze,

12:32.130 --> 12:33.870
we have like the bird's eye view right now,

12:33.870 --> 12:36.300
we know that the best action's to go to the right.

12:36.300 --> 12:39.660
So if we plug it all in here, it'll be zero, no reward,

12:39.660 --> 12:44.660
plus 0.9 times the value in this state 0.9, is 0.81,

12:44.910 --> 12:45.743
and so on.

12:45.743 --> 12:50.430
So here it'll be 0.73 and here it'll be 0.66.

12:50.430 --> 12:53.640
So you can see that the way the discounted factor works

12:53.640 --> 12:56.970
is it discounts the value of the state

12:56.970 --> 12:58.620
as you are further away.

12:58.620 --> 13:01.530
So if you are familiar with finance theory,

13:01.530 --> 13:05.010
then it's something similar to time value of money.

13:05.010 --> 13:06.210
Like what would you...

13:06.210 --> 13:07.043
Think about it this way.

13:07.043 --> 13:10.200
What would you prefer to have $5 today

13:10.200 --> 13:13.170
or $5 in 10 days from now?

13:13.170 --> 13:15.060
Just if somebody was to give you a choice

13:15.060 --> 13:16.230
I will give you $5 today

13:16.230 --> 13:18.330
or I'll give you $5 10 days from now.

13:18.330 --> 13:20.280
Well of course you would choose $5 today.

13:20.280 --> 13:21.113
Why is that?

13:21.113 --> 13:22.560
Well, because you can take those $5

13:22.560 --> 13:25.590
and you can invest them at a certain interest rate,

13:25.590 --> 13:27.660
which is very similar to gamma

13:27.660 --> 13:32.023
and your $5 in 10 days will actually grow into maybe $5.73,

13:32.880 --> 13:34.050
or something like that.

13:34.050 --> 13:36.420
And that's how time value of money works.

13:36.420 --> 13:38.187
And very similar concept here.

13:38.187 --> 13:39.690
And the important thing to understand here,

13:39.690 --> 13:41.070
this is just a theory,

13:41.070 --> 13:43.320
a way that reinforcement learning works.

13:43.320 --> 13:46.200
So Richard Bellman came up with this equation

13:46.200 --> 13:48.734
and from then, now that's how we use it.

13:48.734 --> 13:51.450
So you could go ahead and come up with a different equation.

13:51.450 --> 13:52.530
It doesn't have to have gamma,

13:52.530 --> 13:53.640
it might have some other factor,

13:53.640 --> 13:54.960
it might not even have a factor

13:54.960 --> 13:57.840
but this approach works and that's why we're using it.

13:57.840 --> 14:00.780
And this is what it visually looks like.

14:00.780 --> 14:02.310
So the further away you are,

14:02.310 --> 14:04.860
the less value of this being in this state.

14:04.860 --> 14:06.660
And in terms of time value of money,

14:06.660 --> 14:08.760
if I could say to you, "Where would you rather be?

14:08.760 --> 14:11.280
Would you rather be here, would you rather be here?"

14:11.280 --> 14:12.930
You'd say, "I would rather be here."

14:12.930 --> 14:15.900
So we're creating that same phenomenon

14:15.900 --> 14:17.040
as in time value of money.

14:17.040 --> 14:19.170
We're artificially creating it through gamma

14:19.170 --> 14:22.290
so that in order to incentivize agents

14:22.290 --> 14:24.690
or inspire agents to be closer to the finish line.

14:24.690 --> 14:26.527
So if an agent were to be asked,

14:26.527 --> 14:28.080
"Would you rather be here or here?"

14:28.080 --> 14:29.910
Because of the way this equation works,

14:29.910 --> 14:31.590
it would choose to be here.

14:31.590 --> 14:33.390
There's nothing more to that, nothing less.

14:33.390 --> 14:36.060
It's not something that the world works this way, no.

14:36.060 --> 14:38.910
It's just something that we're artificially creating

14:38.910 --> 14:41.640
in order for our agents to understand

14:41.640 --> 14:45.060
that this is good, this is good, this is good.

14:45.060 --> 14:47.610
They're all good, but this one is better than this one,

14:47.610 --> 14:48.870
and this one is better than this one,

14:48.870 --> 14:50.070
and this one is better than this one.

14:50.070 --> 14:53.310
And that way, you can see the old agent can see

14:53.310 --> 14:54.780
in which direction needs to go.

14:54.780 --> 14:56.490
So you can see that if I'm standing here,

14:56.490 --> 14:57.720
remember that problem that we had

14:57.720 --> 14:59.940
or was he standing here?

14:59.940 --> 15:02.460
Yeah, so if he's standing here, do I go down

15:02.460 --> 15:05.190
or like, if I'm standing here, do I go up or do I go down?

15:05.190 --> 15:07.230
Well, now it's not a problem anymore

15:07.230 --> 15:09.810
because you can see that it's actually better to go up

15:09.810 --> 15:11.550
because the value is bigger here.

15:11.550 --> 15:12.810
And then from here is better to go right,

15:12.810 --> 15:14.247
because the value is bigger here than here.

15:14.247 --> 15:15.780
And then from here is better to go right,

15:15.780 --> 15:17.100
because the value here is bigger than here, than here.

15:17.100 --> 15:19.500
And then from here he already knows

15:19.500 --> 15:20.333
that he needs to go right

15:20.333 --> 15:22.650
because he'll get a reward here of one.

15:22.650 --> 15:24.960
So that's how this whole approach works.

15:24.960 --> 15:27.600
Now let's have a quick look at the rest of the square.

15:27.600 --> 15:30.030
So how do we calculate the value in this square?

15:30.030 --> 15:32.490
Well, here is where things get a bit tricky.

15:32.490 --> 15:36.360
So from here, you might not actually go left, right?

15:36.360 --> 15:37.380
You might actually go right.

15:37.380 --> 15:38.970
So we can't just keep going like that

15:38.970 --> 15:41.520
because it might actually be shorter to go this way.

15:41.520 --> 15:42.360
So what we're going to do,

15:42.360 --> 15:45.000
is we're going to calculate the value in this square first.

15:45.000 --> 15:48.210
And because obviously from here the best way is to go is up.

15:48.210 --> 15:49.380
Again, that's because we see the crew,

15:49.380 --> 15:51.330
we have the crystal ball, we can see things.

15:51.330 --> 15:53.460
And you'll see further down in this section,

15:53.460 --> 15:55.440
you'll see how the agent actually explores this,

15:55.440 --> 15:58.080
understands this like through experimentation.

15:58.080 --> 16:00.150
But for us we know that it's better to go this way

16:00.150 --> 16:02.190
so we're going to calculate the value here

16:02.190 --> 16:03.780
and that's why we're going to calculate

16:03.780 --> 16:06.390
the value in this square first.

16:06.390 --> 16:09.240
So, here we have three possible actions,

16:09.240 --> 16:10.530
in reality, we actually have four.

16:10.530 --> 16:11.610
We can also go left,

16:11.610 --> 16:13.830
the agent could hypothetically press left

16:13.830 --> 16:15.420
and bump into the wall and stay here.

16:15.420 --> 16:17.970
But for simplicity's sake we're just going to show

16:17.970 --> 16:20.580
the actions that we knowing what we know

16:20.580 --> 16:21.900
and having the crystal ball we know

16:21.900 --> 16:24.720
which actions are the ones actually to lead to something

16:24.720 --> 16:26.850
other than the same state again.

16:26.850 --> 16:29.100
And so from here, we know that the...

16:29.100 --> 16:31.020
Again, just because we have a crystal ball

16:31.020 --> 16:33.330
we know that the best way to go is this way.

16:33.330 --> 16:35.130
An agent, of course, would have to experiment

16:35.130 --> 16:35.963
and find the best way.

16:35.963 --> 16:37.560
And you'll see how that happens,

16:37.560 --> 16:38.550
further down in this section,

16:38.550 --> 16:40.380
you'll see actually how an agent walks around

16:40.380 --> 16:43.620
and how you would experiment trying to find these values.

16:43.620 --> 16:45.360
But for us, we know it's that way.

16:45.360 --> 16:47.490
So here if we plug everything in one,

16:47.490 --> 16:50.580
so the maximum, the best output is when you go up,

16:50.580 --> 16:55.580
here is a one 0.90, so you plug that in, you get 0.9.

16:56.190 --> 16:57.510
Okay, so we calculate that one.

16:57.510 --> 16:59.814
Let's calculate this one, same approach.

16:59.814 --> 17:02.070
You have three ways you can go,

17:02.070 --> 17:03.420
actually four for the agent,

17:03.420 --> 17:05.880
but for us we can see it's only three.

17:05.880 --> 17:10.880
So 0.81, from here you have 0.73,

17:11.100 --> 17:13.320
and it actually ties in nicely with this value

17:13.320 --> 17:15.990
because if you discount again, you get 0.66.

17:15.990 --> 17:20.100
And here you have 0.73 because this is the optimal route.

17:20.100 --> 17:21.240
So there you go.

17:21.240 --> 17:23.760
That is the values all of these states.

17:23.760 --> 17:25.110
And now you can see that

17:25.110 --> 17:27.150
because we've created this equation,

17:27.150 --> 17:30.450
we've created synthetically this whole concept

17:30.450 --> 17:33.540
of the closer you are to the finish line,

17:33.540 --> 17:36.990
the more valuable that state is.

17:36.990 --> 17:39.570
Now, because we've created that, now it's pretty obvious

17:39.570 --> 17:41.910
for the agent which way it should go.

17:41.910 --> 17:44.880
And we'll talk more about that in the coming tutorials.

17:44.880 --> 17:47.550
I hope you enjoyed today's session.

17:47.550 --> 17:49.350
And I know it's a bit...

17:49.350 --> 17:52.320
it might sound a bit very basic at this stage,

17:52.320 --> 17:54.210
but as we go through this section

17:54.210 --> 17:56.700
we will add a bit more complexity to it.

17:56.700 --> 17:58.560
At the same time, if you cannot wait,

17:58.560 --> 17:59.880
if you wanna jump into it,

17:59.880 --> 18:02.010
then there's a paper which you can look at

18:02.010 --> 18:04.320
and it is the original paper by Richard Bellman,

18:04.320 --> 18:08.340
it's called "The Theory of Dynamic Programming" from 1954.

18:08.340 --> 18:10.200
And you can find it at this link.

18:10.200 --> 18:11.280
And there you go.

18:11.280 --> 18:12.870
So you can jump straight into it

18:12.870 --> 18:16.620
and read from the author of the Bellman equation.

18:16.620 --> 18:18.180
But just bear in mind that

18:18.180 --> 18:20.970
this is quite a mathematically heavy paper.

18:20.970 --> 18:22.830
And on that note, I look forward to seeing you next time.

18:22.830 --> 18:24.753
And until then, enjoy AI.