WEBVTT

00:01.170 --> 00:02.700
Instructor: Hello and welcome back to the course

00:02.700 --> 00:04.740
on artificial intelligence.

00:04.740 --> 00:08.070
Today we're talking about the temporal difference.

00:08.070 --> 00:10.050
Now, it's a very important tutorial

00:10.050 --> 00:13.110
because temporal difference is the heart and soul

00:13.110 --> 00:15.090
of the Q-learning algorithm.

00:15.090 --> 00:18.720
This is actually how everything we've learned so far

00:18.720 --> 00:22.410
comes together into play inside key learning.

00:22.410 --> 00:23.910
So let's have a look.

00:23.910 --> 00:26.190
Remember the time when we talked about deterministic

00:26.190 --> 00:28.071
versus non-deterministic search?

00:28.071 --> 00:30.743
And remember how we said in this case

00:30.743 --> 00:33.619
it's when the agent wants to go up, he definitely goes up,

00:33.619 --> 00:36.150
and when in this case he wants to go up,

00:36.150 --> 00:37.650
there's a 10% chance he'll go a little left,

00:37.650 --> 00:39.180
10% chance he'll go right

00:39.180 --> 00:42.420
and then 80% chance he'll go straight up?

00:42.420 --> 00:45.030
Well, these numbers are, of course, arbitrary

00:45.030 --> 00:46.410
and can be different.

00:46.410 --> 00:49.230
And this whole concept could be different

00:49.230 --> 00:50.670
in different problems.

00:50.670 --> 00:53.250
So, it doesn't have to concern which way he's moving

00:53.250 --> 00:55.050
just that there's some randomness,

00:55.050 --> 00:57.240
something that's out of the control of the agent

00:57.240 --> 01:00.030
happening inside this environment.

01:00.030 --> 01:01.740
And what effect that had

01:01.740 --> 01:06.450
is as you remember was that in the deterministic example

01:06.450 --> 01:09.240
it was very easy to calculate the V-values.

01:09.240 --> 01:11.040
Well, not necessarily always very easy,

01:11.040 --> 01:13.080
but in our case we could just simply calculate them

01:13.080 --> 01:15.150
by using the Bellman equation,

01:15.150 --> 01:17.340
and we had the exact values.

01:17.340 --> 01:21.036
And then, as you remember, I very carefully mentioned

01:21.036 --> 01:26.010
that these values for the non-deterministic search example

01:26.010 --> 01:27.810
are off the top of my head.

01:27.810 --> 01:28.710
They're not calculate.

01:28.710 --> 01:29.580
We know last...

01:29.580 --> 01:31.950
That time I said we're not just going to calculate them

01:31.950 --> 01:34.710
because it's very complex, but the computer could do it.

01:34.710 --> 01:37.140
And we just went along with these values

01:37.140 --> 01:39.630
that are just values that I made up,

01:39.630 --> 01:41.310
but they did get the job done,

01:41.310 --> 01:43.260
they helped us understand the concepts.

01:43.260 --> 01:45.120
Well, now we're going to return to that a little bit

01:45.120 --> 01:47.820
and understand what exactly is going on here.

01:47.820 --> 01:51.960
Why is it so much harder to calculate these values

01:51.960 --> 01:55.440
in the non-deterministic example or generally speaking

01:55.440 --> 01:58.290
in these problems in these environments

01:58.290 --> 01:59.580
and the agent going through them,

01:59.580 --> 02:03.000
why can it be so hard to calculate these values?

02:03.000 --> 02:04.740
Well, when you think about it,

02:04.740 --> 02:06.389
because when the agent moves,

02:06.389 --> 02:08.970
for instance from here to the right,

02:08.970 --> 02:11.370
he doesn't necessarily always move that way.

02:11.370 --> 02:12.750
Sometimes there's a chance

02:12.750 --> 02:13.710
that he'll go to...

02:13.710 --> 02:16.020
When instead of going straight,

02:16.020 --> 02:18.663
so let's call these northeast, southwest,

02:18.663 --> 02:23.663
instead of going west, the agent might sometimes go south.

02:24.750 --> 02:27.330
And for instance, from here, instead of going north,

02:27.330 --> 02:29.460
he might sometimes go east.

02:29.460 --> 02:30.293
So sorry.

02:30.293 --> 02:33.210
So here, instead of going east, he might sometimes go south.

02:33.210 --> 02:34.680
And here instead of going north,

02:34.680 --> 02:36.960
he might sometimes go east or west.

02:36.960 --> 02:38.220
And here, instead of going north,

02:38.220 --> 02:41.193
he might sometimes go west, east or west and so on.

02:42.270 --> 02:44.580
And therefore, so in order to calculate this value,

02:44.580 --> 02:46.740
you would need to know what this value is.

02:46.740 --> 02:47.820
But the interesting thing

02:47.820 --> 02:49.380
is that in order to calculate this value,

02:49.380 --> 02:51.090
you need to know what this value is.

02:51.090 --> 02:53.610
So there's a lot of recursion happening here

02:53.610 --> 02:57.360
and therefore you cannot just define what these values are.

02:57.360 --> 03:01.140
And on top of that, this recursion is not deterministic.

03:01.140 --> 03:03.000
It is sometimes it happens this way,

03:03.000 --> 03:04.980
sometimes instead of going up, he'll go right,

03:04.980 --> 03:07.170
sometimes instead of going up, he'll go left.

03:07.170 --> 03:08.790
Sometimes instead of...

03:08.790 --> 03:10.530
When he wants to go up, he will go up.

03:10.530 --> 03:12.870
So it is subject to chance.

03:12.870 --> 03:16.650
And so maybe many times the agent will go through this path

03:16.650 --> 03:17.940
and he'll go up, up, up, up, up.

03:17.940 --> 03:19.650
And he'll think that from here

03:19.650 --> 03:20.820
he always kind of goes up,

03:20.820 --> 03:22.980
and so the value of the state will be good,

03:22.980 --> 03:25.161
and then all of a sudden he'll drop into the pit

03:25.161 --> 03:27.286
and this value will go down.

03:27.286 --> 03:29.250
And so therefore, you can see

03:29.250 --> 03:31.980
how there is some stochasticity or randomness

03:31.980 --> 03:33.930
to this whole calculation on these values

03:33.930 --> 03:35.340
because they're all interlinked,

03:35.340 --> 03:37.560
plus on top you've got that randomness

03:37.560 --> 03:39.810
in this inherent in the environment

03:39.810 --> 03:42.540
because there's a mark of decision process.

03:42.540 --> 03:45.270
So, that's where all this comes together

03:45.270 --> 03:46.800
and that's where we're going to introduce

03:46.800 --> 03:48.690
the concept of the temporal difference

03:48.690 --> 03:52.470
which will allow the agent to calculate these values.

03:52.470 --> 03:55.560
And here we were dealing with V-values

03:55.560 --> 03:57.600
and since then we've already moved on to Q-values,

03:57.600 --> 03:59.310
so that's what we're going to be working

03:59.310 --> 04:01.980
if we're going to be looking at Q-values.

04:01.980 --> 04:06.180
So as you recall, this is our Bellman equation for Q-values.

04:06.180 --> 04:11.180
So a Q-value or the value of performing

04:11.190 --> 04:14.220
a state-action A and state S is equal

04:14.220 --> 04:17.250
to the reward that you get after performing that action,

04:17.250 --> 04:19.470
so immediately after performing that action.

04:19.470 --> 04:23.880
Plus you get the maximum, you get the gamma

04:23.880 --> 04:26.910
of the sum of all the possible.

04:26.910 --> 04:29.880
So you kind of get the expected value of the state

04:29.880 --> 04:31.650
that you will end up in.

04:31.650 --> 04:33.060
So, as you call that is our formula

04:33.060 --> 04:35.280
for the Bellman equation.

04:35.280 --> 04:37.410
And now just for simplicity's sake

04:37.410 --> 04:39.923
we're going to rewrite it in the old-fashioned way

04:39.923 --> 04:43.680
in the way that we used to talk about the Bellman equation

04:43.680 --> 04:45.840
before we knew about stochasticity.

04:45.840 --> 04:48.900
So as remember, this was our Bellman equation

04:48.900 --> 04:52.650
in the sense of a deterministic search example.

04:52.650 --> 04:55.200
Because here you don't have that expected value,

04:55.200 --> 04:57.690
you don't have the sum across all probabilities

04:57.690 --> 05:00.360
you just have that as if it's determined

05:00.360 --> 05:01.590
where you're going to end up,

05:01.590 --> 05:03.030
what state you're going to end up

05:03.030 --> 05:05.400
and then you're taking the max in that one state.

05:05.400 --> 05:07.230
And the reason where we are writing it

05:07.230 --> 05:09.090
is simply the only reason

05:09.090 --> 05:12.210
is because it is just easier to write it

05:12.210 --> 05:14.580
and it'll be easier to us to follow along with the formula.

05:14.580 --> 05:16.080
So, we're going to just remember

05:16.080 --> 05:19.410
that we replaced this part with this part

05:19.410 --> 05:23.340
and also you'll find this notation in a lot of literature,

05:23.340 --> 05:25.410
so it'll be easier for you to follow along

05:25.410 --> 05:28.380
with other sources if you're standing those.

05:28.380 --> 05:31.140
But do remember that in fact what we mean

05:31.140 --> 05:33.600
is this probabilistic approach here

05:33.600 --> 05:35.460
instead of this notation.

05:35.460 --> 05:37.560
It's just easier for us to operate this

05:37.560 --> 05:39.120
and understand what's going on,

05:39.120 --> 05:40.650
and just kind of like look at the equation

05:40.650 --> 05:42.810
so that they're not too cluttered.

05:42.810 --> 05:44.850
But once again, just remember that in fact

05:44.850 --> 05:47.926
what we mean is this probabilistic approach over here.

05:47.926 --> 05:50.160
And so, we are actually nearly done.

05:50.160 --> 05:52.170
So, let's have a look at what's going on.

05:52.170 --> 05:56.460
So, here is our blank state of the maze.

05:56.460 --> 05:58.350
We don't have any Q-values.

05:58.350 --> 06:00.000
Let's see, or we may...

06:00.000 --> 06:01.770
But let's just keep it blank for now.

06:01.770 --> 06:05.580
Let's just look at one of the states, one of the cells,

06:05.580 --> 06:06.680
this one specifically.

06:07.800 --> 06:11.220
And here we have for instance, for the action of going up

06:11.220 --> 06:14.340
we have a Q-value that we've calculated.

06:14.340 --> 06:17.010
So it's not that we don't have any Q-values yet,

06:17.010 --> 06:19.920
we do but we're just not illustrating anything,

06:19.920 --> 06:22.650
we're just keeping it blank for simplicity's sake.

06:22.650 --> 06:25.560
But we have the agent's been walking around for some time,

06:25.560 --> 06:27.720
and let's say hypothetically

06:27.720 --> 06:32.640
somehow he's calculated this Q-value of going up or north

06:32.640 --> 06:35.520
from this state, from this specific cell.

06:35.520 --> 06:38.130
And the value is Q, S and A.

06:38.130 --> 06:41.730
And so now what we have, so he is currently

06:41.730 --> 06:43.080
with this blue arrows pointing,

06:43.080 --> 06:45.540
the agent is sitting in this cell,

06:45.540 --> 06:48.570
and now he needs to make a choice, where is he gonna go?

06:48.570 --> 06:51.990
And he knows the value of this, of the action going north.

06:51.990 --> 06:54.720
And that is Q, S and A.

06:54.720 --> 06:56.787
And here I'm saying before,

06:56.787 --> 06:57.960
and the reason for that

06:57.960 --> 07:00.240
is because that is before he takes action,

07:00.240 --> 07:03.360
he hasn't taken the action yet so he's still in the cell,

07:03.360 --> 07:05.171
and before he's taken the action,

07:05.171 --> 07:08.910
the value here is Q, S and A.

07:08.910 --> 07:11.370
And now he actually takes the action.

07:11.370 --> 07:13.650
So, let's say he decides this is the best one,

07:13.650 --> 07:16.710
he takes the action and he moves up to this cell.

07:16.710 --> 07:20.790
Well, now what happens is now comes after.

07:20.790 --> 07:22.320
So after he is taken action,

07:22.320 --> 07:24.360
we can measure what is this value.

07:24.360 --> 07:25.620
Let's just calculate this value.

07:25.620 --> 07:29.070
The value of the reward of for taking that action,

07:29.070 --> 07:32.580
plus gamma times the maximum of this new state

07:32.580 --> 07:35.580
that he's just gotten into S-Prime.

07:35.580 --> 07:39.060
And so the maximum across all possible actions in S-Prime.

07:39.060 --> 07:44.060
And so what we have here is the value before of that action,

07:44.790 --> 07:47.670
and then we've calculated this metric afterwards.

07:47.670 --> 07:50.880
But as you can recall from the previous formula...

07:50.880 --> 07:53.040
So if we could go back very quickly

07:53.040 --> 07:55.650
from the previous formula,

07:55.650 --> 07:58.920
what we just calculated is indeed the...

07:58.920 --> 08:02.190
That is how Q of S and A is calculated.

08:02.190 --> 08:05.820
So this right part, we've just calculated it separately

08:05.820 --> 08:08.310
but after we've taken action,

08:08.310 --> 08:12.930
so, once again before we knew a Q of an S and A value

08:12.930 --> 08:14.310
something that we've calculated

08:14.310 --> 08:15.800
through our iterations previously.

08:15.800 --> 08:16.980
So something...

08:16.980 --> 08:19.980
So a value that's stored in our memory,

08:19.980 --> 08:22.200
so just like a number that we know.

08:22.200 --> 08:25.230
And now after the actions been performed

08:25.230 --> 08:28.140
we know what reward he actually got,

08:28.140 --> 08:30.122
what reward the agent actually got,

08:30.122 --> 08:33.330
and we can calculate this new value.

08:33.330 --> 08:36.960
So in essence, we're kind of recalculating this value

08:36.960 --> 08:39.030
but now with new information.

08:39.030 --> 08:41.580
The new information is the reward that we got,

08:41.580 --> 08:43.590
and plus what state we ended up in

08:43.590 --> 08:45.257
and what the maximum across that state,

08:45.257 --> 08:50.130
what this new value is for that specific data

08:50.130 --> 08:50.963
we're looking at.

08:50.963 --> 08:54.450
So, what's the value of that being in that state

08:54.450 --> 08:56.520
is so basically the Q of an S and A

08:56.520 --> 08:58.470
but given new information.

08:58.470 --> 09:02.130
And now the temporal difference is defined

09:02.130 --> 09:07.130
as TD of A and S of the difference between these two.

09:07.680 --> 09:11.760
So, here the first element is your after value

09:11.760 --> 09:14.670
so the kind of like Q of S and A

09:14.670 --> 09:16.500
but calculated afterwards

09:16.500 --> 09:18.780
and the previous Q of an S and A

09:18.780 --> 09:22.050
which you had stored in your memory.

09:22.050 --> 09:24.270
And so the question is are they different?

09:24.270 --> 09:26.220
So ideally they should be the same,

09:26.220 --> 09:29.070
ideally this should be the same as this,

09:29.070 --> 09:31.800
simply because this is the formula for calculating this.

09:31.800 --> 09:35.040
But the thing is that this is not something we calculate.

09:35.040 --> 09:38.100
This is something that we have from empirical evidence

09:38.100 --> 09:39.690
something that we have from just going

09:39.690 --> 09:41.340
through the maze many times and calculating.

09:41.340 --> 09:44.370
So this is something we've come up with so far

09:44.370 --> 09:46.830
it's not related to the current iteration.

09:46.830 --> 09:48.870
It's something that we came up with previously

09:48.870 --> 09:49.703
a long time...

09:49.703 --> 09:52.110
Not a long time ago, but in one of our previous iterations

09:52.110 --> 09:53.490
going through the maze,

09:53.490 --> 09:55.890
whereas this is something we've calculated just now

09:55.890 --> 09:59.007
and there's no guarantee that they're going to be the same

09:59.007 --> 10:02.490
because of the randomness that exists in the maze,

10:02.490 --> 10:04.740
because this could have been calculated

10:04.740 --> 10:07.680
and some certain random events were triggered

10:07.680 --> 10:08.790
and this can be calculated.

10:08.790 --> 10:11.730
Different random events were triggered.

10:11.730 --> 10:14.070
And so now let's rewrite that over here,

10:14.070 --> 10:15.690
let's just move it up there.

10:15.690 --> 10:16.890
So how do we use this?

10:16.890 --> 10:20.400
The question is, okay, so we have this temporal difference.

10:20.400 --> 10:21.360
How do we use this?

10:21.360 --> 10:23.550
And why is it called a temporal difference?

10:23.550 --> 10:25.320
Well, the reason it's called a temporal difference

10:25.320 --> 10:29.010
is because you're basically calculating the same thing.

10:29.010 --> 10:30.840
You're calculating Q of S and A,

10:30.840 --> 10:33.630
so the Q-value of that action.

10:33.630 --> 10:36.360
You're calculating here and you're calculating it here,

10:36.360 --> 10:38.310
but the difference is time.

10:38.310 --> 10:41.850
This is your Q of S and A previously,

10:41.850 --> 10:46.140
this is your Q of S and A now, your new Q of S and A.

10:46.140 --> 10:49.080
And the question is, has there been a difference?

10:49.080 --> 10:52.080
Have there been a shift between them in time?

10:52.080 --> 10:54.480
And how can we use this to our advantage

10:54.480 --> 10:57.030
if there is indeed has been a shift in time?

10:57.030 --> 10:59.347
Well, one thing we could do is we could say,

10:59.347 --> 11:03.990
"Okay, well, you know, our Q of S and A, this new value

11:03.990 --> 11:04.860
doesn't equal the old,

11:04.860 --> 11:05.970
so we're gonna get rid of the old,

11:05.970 --> 11:07.290
we'll forget about the old

11:07.290 --> 11:09.990
and we'll just use this as a new value."

11:09.990 --> 11:11.910
But that would not be smart.

11:11.910 --> 11:14.909
And the reason for that is that in our environment

11:14.909 --> 11:18.090
random events sometimes happen.

11:18.090 --> 11:22.200
And what if our old Q of S and A was something

11:22.200 --> 11:25.740
that, you know, consistently happens, like, 80% of the time

11:25.740 --> 11:27.390
and then, like, was represented

11:27.390 --> 11:28.320
by what happens 80% of the time,

11:28.320 --> 11:33.270
and then this new one just what happened due to randomness?

11:33.270 --> 11:35.040
In that case we're gonna throw away

11:35.040 --> 11:39.750
the one that is responsible for the bulk of the situation

11:39.750 --> 11:41.190
and we're going to replace it with something

11:41.190 --> 11:43.890
that happens only 10 or 20% of the time.

11:43.890 --> 11:46.980
That wouldn't be the best approach to go.

11:46.980 --> 11:49.140
And that's exactly why

11:49.140 --> 11:52.020
we don't want to completely change our Q-values.

11:52.020 --> 11:55.620
We want to, like, change them step by step,

11:55.620 --> 11:56.880
a little bit by little bit.

11:56.880 --> 11:58.230
And that's why we're going to use

11:58.230 --> 12:00.810
this temporal difference in a specific way.

12:00.810 --> 12:02.790
So we're going to say, here's a formula,

12:02.790 --> 12:05.550
we're going to take our Q of S and A,

12:05.550 --> 12:07.140
and we're going to update it in such a way

12:07.140 --> 12:09.570
we're gonna take the old value of Q, S and A

12:09.570 --> 12:13.380
and we are going to add alpha times the temporal difference.

12:13.380 --> 12:15.720
So alpha is going to be our learning rate.

12:15.720 --> 12:17.400
That's a new parameter that we're introducing.

12:17.400 --> 12:20.040
That's how quickly is algorithm learning.

12:20.040 --> 12:22.500
So, basically we're taking this difference

12:22.500 --> 12:24.990
and whatever it is we're adding it on

12:24.990 --> 12:27.210
to our previous Q of S and A.

12:27.210 --> 12:29.097
Now, this formula probably doesn't make any sense

12:29.097 --> 12:31.770
or like just by looking it doesn't make any sense

12:31.770 --> 12:33.990
because you've got Q of S and A here, and Q of S and A here,

12:33.990 --> 12:34.823
it's the same thing,

12:34.823 --> 12:36.870
so it probably should negate each other.

12:36.870 --> 12:40.350
But we're going to rewrite this in a bit of a different way.

12:40.350 --> 12:41.610
So I'm just gonna show you again.

12:41.610 --> 12:44.160
So I'm just adding time to these formulas.

12:44.160 --> 12:46.470
So here is QT minus one.

12:46.470 --> 12:48.870
The previous here is QT minus one,

12:48.870 --> 12:51.150
the previous here is QT then U.

12:51.150 --> 12:53.100
There should be a circle here, a circle here as well,

12:53.100 --> 12:54.210
but never mind.

12:54.210 --> 12:56.430
And here we've got alpha temporal difference,

12:56.430 --> 12:58.740
the new current temporal difference.

12:58.740 --> 13:00.487
So you can see what we're doing.

13:00.487 --> 13:04.890
"We're saying, okay, let's take our current Q

13:04.890 --> 13:06.990
is going to be equal to our previous Q

13:06.990 --> 13:11.130
plus whatever temporal difference we found times alpha."

13:11.130 --> 13:14.790
This formula over here is the heart and soul

13:14.790 --> 13:16.290
of the Q-learning algorithm.

13:16.290 --> 13:18.330
This is how the Q-values are updated

13:18.330 --> 13:21.870
and it's good that we've already learned what Q-values are

13:21.870 --> 13:25.410
what gamma is, what R is, and what all of this stuff is.

13:25.410 --> 13:27.990
And now all we need to see is that

13:27.990 --> 13:30.450
you have a previous Q-value.

13:30.450 --> 13:31.950
Yes, it's good.

13:31.950 --> 13:34.529
And then what can happen is that

13:34.529 --> 13:37.410
when you actually do take the action,

13:37.410 --> 13:39.030
when the agent takes action,

13:39.030 --> 13:40.920
he will know he'll get a reward,

13:40.920 --> 13:42.248
and he'll end up in a state.

13:42.248 --> 13:45.907
And so based on that, he can calculate,

13:45.907 --> 13:46.740
"Uh-huh, okay.

13:46.740 --> 13:50.100
So, what is, what would've, what should have been

13:50.100 --> 13:53.520
the Q-value of that move that I made?"

13:53.520 --> 13:56.430
And now that is this part of the equation.

13:56.430 --> 13:59.340
Subtract the old Q-value gets your temporal difference,

13:59.340 --> 14:03.738
and now you need to take a alpha times temporal difference.

14:03.738 --> 14:05.850
And that's how you're going to adjust your Q-value,

14:05.850 --> 14:08.190
that's what you're going to adjust your Q-value by.

14:08.190 --> 14:09.990
And now just to finish off,

14:09.990 --> 14:11.820
this is kind of, like, this is sufficient

14:11.820 --> 14:12.990
to understand what's going on,

14:12.990 --> 14:15.150
but just to clarify things even more

14:15.150 --> 14:18.450
or perhaps maybe confuse things even more,

14:18.450 --> 14:19.650
what are we going to do is we're going to take

14:19.650 --> 14:21.000
this temporal difference

14:21.000 --> 14:21.890
or this temporal difference over here,

14:21.890 --> 14:24.210
we're going to plug it into this formula.

14:24.210 --> 14:26.400
So we're going to take all of this part

14:26.400 --> 14:28.110
and plug it into this formula

14:28.110 --> 14:29.910
and get end up with a huge equation.

14:29.910 --> 14:32.610
So here we go, there's our equation.

14:32.610 --> 14:35.010
So this is the full equation

14:35.010 --> 14:38.550
with the temporal difference written out completely.

14:38.550 --> 14:41.130
And the reason I wrote this out,

14:41.130 --> 14:42.630
well, first of all you'll probably find this

14:42.630 --> 14:45.720
in other literature if you study it.

14:45.720 --> 14:47.850
And the second thing is that it makes something

14:47.850 --> 14:48.683
a bit more complex.

14:48.683 --> 14:50.220
Yes, the formula's longer

14:50.220 --> 14:52.290
but also makes some things a bit clearer.

14:52.290 --> 14:55.980
So, for instance, you can see here the role alpha plays

14:55.980 --> 14:58.230
you can see it better, because look at this,

14:58.230 --> 15:00.750
here you've got QT minus one

15:00.750 --> 15:03.750
and here you've got QT minus one with a negative sign.

15:03.750 --> 15:06.930
So if you plug in alpha equals to one,

15:06.930 --> 15:11.930
if you put a one in here, then this will negate with this,

15:12.120 --> 15:13.860
so they'll destroy each other.

15:13.860 --> 15:16.470
And all you'll have left is this part.

15:16.470 --> 15:18.840
And what that means is exactly that situation

15:18.840 --> 15:22.860
where we said, "All right, so we've got a new value

15:22.860 --> 15:24.810
which it should have been,

15:24.810 --> 15:27.270
let's update our Q-value with the new value

15:27.270 --> 15:29.700
and forget about whatever we had previously."

15:29.700 --> 15:31.440
And as we discussed, it's not the best approach

15:31.440 --> 15:34.620
because there are random events here

15:34.620 --> 15:37.500
and we want to update things step by step.

15:37.500 --> 15:41.160
And on the other hand, if you set alpha equal to zero

15:41.160 --> 15:43.350
what happens then is that you completely forget

15:43.350 --> 15:45.939
about this whole part and your QT,

15:45.939 --> 15:47.820
the new one or the current one

15:47.820 --> 15:49.500
is going to be always equals to the previous one.

15:49.500 --> 15:51.690
So you're not gonna be learning anything.

15:51.690 --> 15:53.250
And that means whatever's happening

15:53.250 --> 15:54.600
in the maze doesn't matter

15:54.600 --> 15:57.810
because you've decided on your QT value a long time ago

15:57.810 --> 15:58.857
and you're just gonna keep it.

15:58.857 --> 16:00.870
And so that's why alpha shouldn't be zero

16:00.870 --> 16:03.210
or shouldn't be one, it should be somewhere between,

16:03.210 --> 16:06.600
and it's going to allow you to learn slowly, step by step.

16:06.600 --> 16:08.640
It's going to allow you to as your...

16:08.640 --> 16:10.950
Or the agent has it goes through the maze

16:10.950 --> 16:12.960
is gonna get this temporal difference,

16:12.960 --> 16:15.810
And slowly and surely this value

16:15.810 --> 16:18.000
is going to get updated and updated, updated.

16:18.000 --> 16:21.360
And what will happen eventually is that

16:21.360 --> 16:25.650
at some point hopefully the algorithm will converge.

16:25.650 --> 16:28.260
And what that means is that this temporal difference

16:28.260 --> 16:30.840
will start becoming closer and closer to zero

16:30.840 --> 16:32.520
and eventually it'll be just, well

16:32.520 --> 16:35.274
very close to zero or even zero, zero, zero, zero, zero.

16:35.274 --> 16:38.077
And what that means is that every single time

16:38.077 --> 16:43.077
your new QT value or your new calculated value

16:43.260 --> 16:45.210
what it should have been, so not this one,

16:45.210 --> 16:47.190
but what it hypothetically should have been

16:47.190 --> 16:49.590
after you take the step will be just equal

16:49.590 --> 16:52.380
to your previous QT value and then one, then it's zero.

16:52.380 --> 16:54.630
And then means when your temporal difference is zero,

16:54.630 --> 16:57.000
it means your algorithm has converged

16:57.000 --> 16:59.220
and it's not really necessary

16:59.220 --> 17:02.670
to continue updating what's going on,

17:02.670 --> 17:06.270
it's not necessary to continue updating your Q-values.

17:06.270 --> 17:08.970
The caveat here is that the only time,

17:08.970 --> 17:10.980
yeah, probably one of the only times

17:10.980 --> 17:12.810
when you would still want to continue

17:12.810 --> 17:15.420
performing this whole, you know,

17:15.420 --> 17:17.220
updating of the Q-values,

17:17.220 --> 17:19.140
if the environment is constantly changing.

17:19.140 --> 17:21.270
If not, it's not that it just has some

17:21.270 --> 17:23.250
random stochastic events in it,

17:23.250 --> 17:26.310
but the environment itself is modifying,

17:26.310 --> 17:29.010
is morphing, is changing with time,

17:29.010 --> 17:30.840
so you continuously need to learn

17:30.840 --> 17:33.450
because it's not possible for you to learn everything

17:33.450 --> 17:35.640
and come up with the optimal policy,

17:35.640 --> 17:37.560
because the optimal policy is also changing

17:37.560 --> 17:39.080
with the environment all the time.

17:39.080 --> 17:41.580
In that case, you will need to continue calculating

17:41.580 --> 17:44.384
temporal difference and calculating the Q-values.

17:44.384 --> 17:45.420
But other than that,

17:45.420 --> 17:46.830
that's kind of, like, an extra complication.

17:46.830 --> 17:49.410
Other than that, this is how Q-values updated.

17:49.410 --> 17:54.090
So this is the main formula of the Q-learning algorithm

17:54.090 --> 17:56.250
and this is kind of, like, the expanded version of that.

17:56.250 --> 17:58.800
And now it should all come together

17:58.800 --> 18:00.990
and make sense why we have the Bellman equation

18:00.990 --> 18:03.390
and not only what it represents, the Q-values,

18:03.390 --> 18:08.100
but also how the agent goes about updating its Q-values

18:08.100 --> 18:12.150
and finding exactly what is going on in that environment

18:12.150 --> 18:14.610
so it can come up with the optimal policy.

18:14.610 --> 18:16.530
So, I know this is quite a lot to take in,

18:16.530 --> 18:19.260
but hopefully you enjoyed today's tutorial

18:19.260 --> 18:22.101
and hopefully you were able to take away

18:22.101 --> 18:25.860
the underlying concepts and the intuition behind Q-values,

18:25.860 --> 18:29.759
and what's the whole notion of temporal difference is

18:29.759 --> 18:31.110
and why it's important,

18:31.110 --> 18:34.740
why it helps us slowly train our agents

18:34.740 --> 18:37.770
and get them to understand their environments

18:37.770 --> 18:39.240
that they're operating in.

18:39.240 --> 18:40.800
And if you'd like to learn a bit more

18:40.800 --> 18:42.270
about temporal differences,

18:42.270 --> 18:45.097
then a very popular paper is

18:45.097 --> 18:48.450
"Learning to Predict by the Method of Temporal Differences"

18:48.450 --> 18:51.003
by Richard Sutton of 1988.

18:52.650 --> 18:55.170
We've already had a reference by Richard Sutton as well,

18:55.170 --> 18:56.400
but this is another one.

18:56.400 --> 18:57.540
And actually he has a book,

18:57.540 --> 19:00.870
so if you get into, you know, his writing style

19:00.870 --> 19:03.510
and his style of communication,

19:03.510 --> 19:05.820
then check out his book as well.

19:05.820 --> 19:07.740
It's kind of, like, a more expanded version

19:07.740 --> 19:08.640
of all of these things.

19:08.640 --> 19:11.880
I haven't read the book, but that's what I'm imagining.

19:11.880 --> 19:14.640
At the same time, this is the link to the paper

19:14.640 --> 19:18.780
and you can learn a bit more about or probably a lot more

19:18.780 --> 19:21.270
about temporal differences there.

19:21.270 --> 19:23.010
And I hope you enjoyed today's tutorial

19:23.010 --> 19:24.240
and look forward to see you next time.

19:24.240 --> 19:26.433
Until then, enjoy AI.