WEBVTT

00:00.300 --> 00:02.490
-: Hello and welcome to this tutorial.

00:02.490 --> 00:04.950
This special tutorial is gonna be super exciting

00:04.950 --> 00:08.670
because we are getting closer to the A3C algorithm.

00:08.670 --> 00:09.570
You're gonna see

00:09.570 --> 00:11.490
that what we're about to implement,

00:11.490 --> 00:14.240
and that is called Eligibility Trace, or Sarsa.

00:14.240 --> 00:17.400
It is actually an algorithm of the asynchronous

00:17.400 --> 00:19.320
active critics agents algorithms.

00:19.320 --> 00:21.450
But we cannot consider it an A3C because

00:21.450 --> 00:23.460
we're still going to have one agent.

00:23.460 --> 00:24.840
But still, you're gonna see

00:24.840 --> 00:26.460
that what we're about to implement

00:26.460 --> 00:29.130
is actually taken from the following paper,

00:29.130 --> 00:31.590
which is this paper: Asynchronous methods

00:31.590 --> 00:33.750
for Deep Reinforcement Learning.

00:33.750 --> 00:35.716
And it is in this paper that we'll find

00:35.716 --> 00:39.150
the A3C algorithms that we we'll implement

00:39.150 --> 00:41.040
as the final bonus of this course.

00:41.040 --> 00:42.420
But as I said,

00:42.420 --> 00:45.090
we are getting closer to it because the model

00:45.090 --> 00:47.549
that we will implement right now is actually

00:47.549 --> 00:50.007
this one: The asynchronous

00:50.007 --> 00:51.507
and end step Q-learning.

00:51.507 --> 00:52.830
That's the one.

00:52.830 --> 00:54.630
So that's almost the A3C,

00:54.630 --> 00:58.170
which is the one after that, but with one agent.

00:58.170 --> 00:59.970
And the powerful thing about this

00:59.970 --> 01:02.160
is this end step Q-learning.

01:02.160 --> 01:04.650
We are going to learn the cumulative rewards

01:04.650 --> 01:06.450
and learn the cumulative target

01:06.450 --> 01:09.277
on n-steps instead of one step like previously.

01:09.277 --> 01:11.640
And that's what will make the training

01:11.640 --> 01:12.660
much more performant

01:12.660 --> 01:15.270
and therefore our AI much more powerful.

01:15.270 --> 01:17.520
So we actually have the pseudocode

01:17.520 --> 01:18.810
for this algorithm.

01:18.810 --> 01:20.784
It's this algorithm S2 right here.

01:20.784 --> 01:23.250
So let's click on it, and there we go.

01:23.250 --> 01:25.860
That's the algorithm we are about to implement.

01:25.860 --> 01:28.140
But remember with only one agent.

01:28.140 --> 01:30.777
The difference is that here they take an action

01:30.777 --> 01:34.410
AT according to the epsilon-greedy policy based

01:34.410 --> 01:36.840
on the Q values for the current state

01:36.840 --> 01:38.190
and the action plate.

01:38.190 --> 01:39.390
But in our case,

01:39.390 --> 01:42.030
we didn't implement an epsilon-greedy policy.

01:42.030 --> 01:43.890
We implemented in a Softmax,

01:43.890 --> 01:45.180
but the rest is the same.

01:45.180 --> 01:46.080
As you can see,

01:46.080 --> 01:48.090
we are gonna compute accumulative rewards

01:48.090 --> 01:49.110
on n-steps.

01:49.110 --> 01:50.220
Actually 10 steps,

01:50.220 --> 01:52.440
remember that n-steps is equal to 10.

01:52.440 --> 01:55.140
And so we will implement this line of code

01:55.140 --> 01:56.760
in our algorithm that we're about

01:56.760 --> 01:58.140
to implement right now.

01:58.140 --> 01:59.010
We're gonna get this

01:59.010 --> 02:00.870
and mostly we are gonna implement

02:00.870 --> 02:02.130
this as well.

02:02.130 --> 02:03.780
You'll see that we will get

02:03.780 --> 02:05.460
the maximum of the Q values

02:05.460 --> 02:07.590
for the current states and the current action.

02:07.590 --> 02:11.094
And this theta here, is just a target parameter.

02:11.094 --> 02:12.870
So let's do this.

02:12.870 --> 02:15.330
Let's attack this algorithm.

02:15.330 --> 02:17.610
This one is called the asynchronous

02:17.610 --> 02:18.900
and step Q-learning

02:18.900 --> 02:20.670
but we don't have the right to say

02:20.670 --> 02:22.770
asynchronous as far as we're concerned

02:22.770 --> 02:24.660
because we only have one agent.

02:24.660 --> 02:25.860
But therefore we can call it

02:25.860 --> 02:27.572
n-step Q-learning Eligibility Trace,

02:27.572 --> 02:29.759
or even Sorsa

02:29.759 --> 02:31.770
All right, so let's do this.

02:31.770 --> 02:33.060
It's gonna be pretty fun.

02:33.060 --> 02:36.120
We can basically follow the pseudocode here

02:36.120 --> 02:37.350
and that's what we're gonna do.

02:37.350 --> 02:38.790
And so as you can see,

02:38.790 --> 02:40.560
a parameter that we'll need is

02:40.560 --> 02:42.263
a Gemma, the Gemma parameter.

02:42.263 --> 02:44.250
That is the Dk parameter.

02:44.250 --> 02:46.050
And therefore we will start

02:46.050 --> 02:47.760
by introducing a variable

02:47.760 --> 02:50.790
for this gamma parameter and choosing a value.

02:50.790 --> 02:51.870
So let's do this.

02:51.870 --> 02:54.030
We actually don't need a class to implement this.

02:54.030 --> 02:56.310
We can simply implement this with a function

02:56.310 --> 02:57.150
because you know,

02:57.150 --> 02:58.800
we don't really need to create objects

02:58.800 --> 03:00.840
for this Eligibility Trace model,

03:00.840 --> 03:02.160
a function will be enough

03:02.160 --> 03:03.780
because basically what we wanna do

03:03.780 --> 03:05.790
is to return the inputs

03:05.790 --> 03:06.930
and the targets

03:06.930 --> 03:09.240
so that later when training the AI,

03:09.240 --> 03:11.070
we are ready to minimize the distance

03:11.070 --> 03:13.410
between the predictions and the targets.

03:13.410 --> 03:15.480
And to get the predictions, we need the inputs

03:15.480 --> 03:17.430
because we are gonna apply our brain

03:17.430 --> 03:19.710
on the input to get the output signals.

03:19.710 --> 03:21.240
That will be our predictions.

03:21.240 --> 03:22.950
And then once we have our predictions

03:22.950 --> 03:23.910
and our targets

03:23.910 --> 03:26.160
we will be ready to train the AI

03:26.160 --> 03:28.376
by trying to minimize the square distance

03:28.376 --> 03:30.660
between the predictions and the targets.

03:30.660 --> 03:32.820
So that's the whole point of doing this right now

03:32.820 --> 03:35.130
we are implementing this function to be able to

03:35.130 --> 03:37.500
return these inputs and these targets

03:37.500 --> 03:39.420
so that we can be ready for the training to

03:39.420 --> 03:41.160
minimize the square distance,

03:41.160 --> 03:43.020
predictions minus targets.

03:43.020 --> 03:44.280
All right, so let's do this.

03:44.280 --> 03:46.350
As we said, we want to implement a function.

03:46.350 --> 03:48.270
So we start with def, this function

03:48.270 --> 03:52.830
we're gonna call it eligibility underscore trace,

03:52.830 --> 03:54.210
you can also call it Sorsa,

03:54.210 --> 03:56.460
you can also call it n-step clearing,

03:56.460 --> 03:57.420
whatever you want,

03:57.420 --> 03:59.910
but let's call it Eligibility Trace.

03:59.910 --> 04:02.305
And this function is gonna take one argument

04:02.305 --> 04:05.220
which is going to be a batch.

04:05.220 --> 04:06.210
And why a batch?

04:06.210 --> 04:07.740
It's because we're gonna get

04:07.740 --> 04:10.125
some inputs and some targets

04:10.125 --> 04:13.530
because we're gonna train the AI on batches.

04:13.530 --> 04:15.870
And so the inputs and the targets will go

04:15.870 --> 04:18.390
inside some batches, and therefore the input

04:18.390 --> 04:20.670
argument here is this batch

04:20.670 --> 04:22.380
that will contain several inputs

04:22.380 --> 04:25.470
and then several targets that we will compute.

04:25.470 --> 04:26.760
So, there we go.

04:26.760 --> 04:28.500
That's the only argument we need.

04:28.500 --> 04:30.090
Now, let's go inside the function

04:30.090 --> 04:32.310
and let's define what we needed to do.

04:32.310 --> 04:35.413
So as we saw in pseudocode of the paper

04:35.413 --> 04:37.530
we need a gamma parameters.

04:37.530 --> 04:38.730
So as we said,

04:38.730 --> 04:42.150
we start by introducing this gamma parameter.

04:42.150 --> 04:42.990
So gamma equals,

04:42.990 --> 04:45.420
and we can already decide for value

04:45.420 --> 04:48.390
and we are gonna choose 0.99.

04:48.390 --> 04:49.800
That's a classic good value

04:49.800 --> 04:51.750
for the gamma and Norris'.

04:51.750 --> 04:55.337
I checked that this is a good value for our AI.

04:55.337 --> 04:57.180
All right then, next step.

04:57.180 --> 04:59.050
Next step is to prepare

05:00.073 --> 05:02.433
our inputs and our targets

05:02.433 --> 05:05.220
because that's exactly what we want to return.

05:05.220 --> 05:06.330
We want to return the inputs

05:06.330 --> 05:08.940
and the targets to prepare the training.

05:08.940 --> 05:11.430
And so we can already initialize them

05:11.430 --> 05:13.080
with an empty list,

05:13.080 --> 05:13.913
because of course,

05:13.913 --> 05:15.750
in these inputs inside the batch

05:15.750 --> 05:18.480
we're gonna have several inputs all into a list.

05:18.480 --> 05:20.520
And that's why I'm initializing the inputs

05:20.520 --> 05:23.610
as a list as well as the targets.

05:23.610 --> 05:24.990
There we go.

05:24.990 --> 05:27.540
So we initialized our inputs and our targets

05:27.540 --> 05:30.104
and in the end, this Eligibility Trace function

05:30.104 --> 05:33.330
will return exactly these inputs.

05:33.330 --> 05:35.895
And these targets will of course filter in.

05:35.895 --> 05:37.500
We will have several inputs

05:37.500 --> 05:39.330
and the associated several targets

05:39.330 --> 05:41.433
in what will be returned by the function.

05:42.300 --> 05:43.320
All right, next step.

05:43.320 --> 05:45.900
Next step is to start a full loop.

05:45.900 --> 05:46.733
And that's exactly

05:46.733 --> 05:48.870
because we're following the pseudocode

05:48.870 --> 05:50.250
of the paper.

05:50.250 --> 05:52.710
This pseudocode, and as you can see,

05:52.710 --> 05:54.937
there is this repeat code section.

05:54.937 --> 05:56.910
And repeat is exactly a full loop

05:56.910 --> 05:58.440
from pseudocode.

05:58.440 --> 05:59.580
We are gonna compute

05:59.580 --> 06:02.100
the cumulative reward right here

06:02.100 --> 06:03.930
accumulated over the 10 steps.

06:03.930 --> 06:05.280
And how is it computed?

06:05.280 --> 06:08.040
Well, in each step, that is not the last step.

06:08.040 --> 06:10.260
We're gonna get the maximum of the Q values

06:10.260 --> 06:11.093
of the current state

06:11.093 --> 06:13.230
we're in during this n-steps run.

06:13.230 --> 06:16.200
And if we reach the last state of the 10 steps,

06:16.200 --> 06:17.880
well, this will be equal to zero.

06:17.880 --> 06:20.040
That is, we don't want to update it anymore.

06:20.040 --> 06:21.270
And then we have this full loop

06:21.270 --> 06:23.430
which is going to be another full loop.

06:23.430 --> 06:25.560
They don't say repeat here, but that's the same.

06:25.560 --> 06:26.430
It's going to be

06:26.430 --> 06:28.560
a second full loop in our algorithm.

06:28.560 --> 06:31.050
Well, we will update the reward this way

06:31.050 --> 06:33.253
by multiplying it by the Dk parameter, Gemma

06:33.253 --> 06:35.970
and adding the reward.

06:35.970 --> 06:36.960
So let's do this.

06:36.960 --> 06:38.310
Let's go back to Python

06:38.310 --> 06:40.710
and let's start our full loop.

06:40.710 --> 06:41.543
So for,

06:41.543 --> 06:45.060
And what is going to be the iterator variable?

06:45.060 --> 06:47.910
Well, that's going to be our 10 step series.

06:47.910 --> 06:50.430
You know, our series of 10 transitions.

06:50.430 --> 06:53.406
So we're gonna call this variable 'series'.

06:53.406 --> 06:56.880
So that's represent a series of 10 transitions,

06:56.880 --> 06:58.830
like sequence of 10 transitions.

06:58.830 --> 07:00.846
So for series in,

07:00.846 --> 07:02.580
And then what do you think?

07:02.580 --> 07:05.262
Well, our series will belong to our batch

07:05.262 --> 07:08.250
that is the batches on which we'll train the AI.

07:08.250 --> 07:10.980
And so for series in batch,

07:10.980 --> 07:12.030
that is for all the series

07:12.030 --> 07:14.790
of 10 transitions in our input batch.

07:14.790 --> 07:17.160
Well, what are we going to do?

07:17.160 --> 07:18.840
Well to get accumulative reward,

07:18.840 --> 07:20.070
you will see in psudocode

07:20.070 --> 07:21.390
that we need the state

07:21.390 --> 07:23.490
of the first transition of the series

07:23.490 --> 07:25.230
and also the state of the last

07:25.230 --> 07:26.700
transition of the series.

07:26.700 --> 07:28.110
So what we have to do right now

07:28.110 --> 07:29.772
is get these input states.

07:29.772 --> 07:32.711
And so we are gonna put these two input states

07:32.711 --> 07:36.270
into a variable that we're gonna call 'input'.

07:36.270 --> 07:39.360
And we will get these two input states

07:39.360 --> 07:40.590
the first one of the series

07:40.590 --> 07:42.540
and the last one that's we're gonna put

07:42.540 --> 07:44.433
into a NumPy array.

07:45.780 --> 07:46.770
But no worries,

07:46.770 --> 07:48.360
we will not stay with this NumPy array.

07:48.360 --> 07:49.860
We will of course convert that

07:49.860 --> 07:51.090
into a PyTorch variable.

07:51.090 --> 07:52.856
But the first step is

07:52.856 --> 07:54.240
to put these two input states,

07:54.240 --> 07:55.530
the first one and the last one

07:55.530 --> 07:57.240
into a NumPy array.

07:57.240 --> 07:59.550
And so right here in this NumPy array,

07:59.550 --> 08:01.188
we add the first input,

08:01.188 --> 08:02.730
which is the input state

08:02.730 --> 08:04.830
of the first transition of the series,

08:04.830 --> 08:06.453
and that is series.

08:07.800 --> 08:09.240
And then to take the first transition

08:09.240 --> 08:11.550
we take the index zero of the series

08:11.550 --> 08:12.812
that's the first transition,

08:12.812 --> 08:14.372
and then we can access it

08:14.372 --> 08:18.270
by taking its attribute, which is state.

08:18.270 --> 08:20.760
And that's because in our experience replay file

08:20.760 --> 08:22.920
we defined a special structure

08:22.920 --> 08:24.420
for each of the transition.

08:24.420 --> 08:25.710
And you know, this structure

08:25.710 --> 08:27.150
each transition is composed

08:27.150 --> 08:29.188
of a state, an action, a reward,

08:29.188 --> 08:32.100
but then the last element, which is done.

08:32.100 --> 08:34.290
So this special structure that we're allowed to

08:34.290 --> 08:35.610
use right now comes

08:35.610 --> 08:36.930
from the way we defined

08:36.930 --> 08:39.270
a transition and experience replay.

08:39.270 --> 08:40.440
All right, so with this

08:40.440 --> 08:43.470
we get the input state of the first transition.

08:43.470 --> 08:45.349
And now let's get also the input state

08:45.349 --> 08:48.780
of the last transition of the series.

08:48.780 --> 08:50.010
And to do this, that's the same.

08:50.010 --> 08:52.030
We can just copy this

08:53.520 --> 08:56.880
and paste it and replace the zero here

08:56.880 --> 08:59.310
by the last index of the series,

08:59.310 --> 09:00.143
which we can access

09:00.143 --> 09:02.160
with this trick, minus one.

09:02.160 --> 09:03.930
Series minus one that states

09:03.930 --> 09:05.340
will get the input state

09:05.340 --> 09:08.250
of the last transition of the series.

09:08.250 --> 09:09.083
All right?

09:09.083 --> 09:11.146
Then we need to put these two elements

09:11.146 --> 09:14.170
inside some square bracket

09:15.240 --> 09:16.830
because that's what is expected

09:16.830 --> 09:18.900
by the NumPy array function.

09:18.900 --> 09:20.910
And then an important thing to do,

09:20.910 --> 09:24.240
since we are going to convert that into a torch

09:24.240 --> 09:26.280
denser in a torch variable,

09:26.280 --> 09:27.960
well remember a torch tenser is

09:27.960 --> 09:29.742
by definition a special array

09:29.742 --> 09:31.890
containing one single type.

09:31.890 --> 09:34.740
And so we need to force having one single type

09:34.740 --> 09:37.770
and as usual, we're gonna choose the float type.

09:37.770 --> 09:39.670
And so I'm adding this parameter here

09:40.806 --> 09:44.965
D type equals Np dot float 32,

09:44.965 --> 09:47.640
can take this one,

09:47.640 --> 09:49.830
and now we can convert that

09:49.830 --> 09:52.530
into a torch tensor in a torch variable.

09:52.530 --> 09:53.730
So let's do this.

09:53.730 --> 09:56.010
To do this, well first let's convert

09:56.010 --> 09:57.570
that into a torch tensor.

09:57.570 --> 10:02.570
And remember, we can use torch dot from NumPy.

10:03.586 --> 10:04.920
There we go.

10:04.920 --> 10:07.830
And we put all the array of the two input states

10:07.830 --> 10:10.320
inside this torch tensor

10:10.320 --> 10:12.570
with the torch from NumPy function.

10:12.570 --> 10:13.403
Perfect.

10:13.403 --> 10:15.630
So that will convert this array

10:15.630 --> 10:17.952
of the two input states into a torch tensor.

10:17.952 --> 10:20.071
And now we put this torch tensor

10:20.071 --> 10:24.423
into a torch variable using the variable class.

10:25.500 --> 10:28.440
So input will be an object of the variable class.

10:28.440 --> 10:30.660
And in fact, as you understood

10:30.660 --> 10:33.540
this variable class takes all this

10:33.540 --> 10:36.279
as an argument and that creates the object.

10:36.279 --> 10:38.460
All right, so now we should be good.

10:38.460 --> 10:40.980
We have our two inputs that we need.

10:40.980 --> 10:43.230
That is the input state of the first transition

10:43.230 --> 10:45.750
and the input state of the last transition.

10:45.750 --> 10:47.880
And now, now that we have the input,

10:47.880 --> 10:49.020
well, what can we get?

10:49.020 --> 10:51.270
We can get the output signal

10:51.270 --> 10:52.620
of the brain of the AI,

10:52.620 --> 10:54.000
that is the prediction.

10:54.000 --> 10:56.273
but we're gonna call it 'output'.

10:56.273 --> 10:58.110
That's the output signal.

10:58.110 --> 10:59.310
And to get the output.

10:59.310 --> 11:00.480
Well, now that's very easy

11:00.480 --> 11:03.180
because we already have a brain created

11:03.180 --> 11:05.550
which is our convolution neural network.

11:05.550 --> 11:09.409
And so we can simply take our brain CNN applied

11:09.409 --> 11:13.920
to the input, which will return the prediction.

11:13.920 --> 11:16.800
That is the output, as simple as that.

11:16.800 --> 11:18.300
And now we are already ready

11:18.300 --> 11:19.700
to move on to the next step.

11:20.700 --> 11:22.501
And the next step is to start to

11:22.501 --> 11:25.050
compute this cumulative reward.

11:25.050 --> 11:26.880
So now we're gonna do exactly the same

11:26.880 --> 11:29.227
as our S2 algorithm, the Sorsa,

11:29.227 --> 11:30.630
or should we call it

11:30.630 --> 11:32.250
N-steps Q-learning.

11:32.250 --> 11:33.750
We are going to introduce

11:33.750 --> 11:36.690
the cumule reward variable

11:36.690 --> 11:38.970
which will be the cumulative reward.

11:38.970 --> 11:41.280
And let's go back to the paper.

11:41.280 --> 11:43.290
As you can see right now, what we have to do to

11:43.290 --> 11:46.200
get this cumulative reward, which is R here, well

11:46.200 --> 11:49.260
at each step of the 10 steps run,

11:49.260 --> 11:50.490
we need to update it

11:50.490 --> 11:53.910
by adding a zero to this cumulative reward.

11:53.910 --> 11:55.380
If we reached the last state

11:55.380 --> 11:58.950
of the series or the maximum of the Q values,

11:58.950 --> 12:00.450
if we haven't reached the last state

12:00.450 --> 12:02.340
of the series that is for all the steps

12:02.340 --> 12:03.750
except the last step.

12:03.750 --> 12:05.720
So let's simply implement this.

12:05.720 --> 12:07.620
So let's go back to Python.

12:07.620 --> 12:10.181
So this cumulative reward, as we just saw,

12:10.181 --> 12:11.860
is going to be equal to

12:13.107 --> 12:17.250
0.0 if we reached the last state.

12:17.250 --> 12:19.440
And we can write this condition this way,

12:19.440 --> 12:23.070
if series of index minus one,

12:23.070 --> 12:26.220
that is the last transition of the series,

12:26.220 --> 12:28.200
then we add dot done.

12:28.200 --> 12:30.870
Because done actually is a attribute of, you know

12:30.870 --> 12:32.280
this transition structure

12:32.280 --> 12:33.510
that we defined in experience

12:33.510 --> 12:35.760
replay, our experience replay file.

12:35.760 --> 12:37.879
And this done comes from actually

12:37.879 --> 12:39.900
the open AI structures

12:39.900 --> 12:42.540
because if we go to the OpenAI Gym website

12:42.540 --> 12:44.905
which is actually right here, I prepared it.

12:44.905 --> 12:47.310
So that's the doom corridor of V zero.

12:47.310 --> 12:50.580
And if we go to documentation

12:50.580 --> 12:53.160
and then if we, so that's the tutorial

12:53.160 --> 12:55.440
I really encourage you to have a look at it.

12:55.440 --> 12:56.970
You can run an environment,

12:56.970 --> 12:58.380
but mostly you can see

12:58.380 --> 13:00.090
that our observations

13:00.090 --> 13:02.730
that is are transitions are defined

13:02.730 --> 13:06.990
by an observation, a reward, and this done here.

13:06.990 --> 13:08.760
And this done means exactly

13:08.760 --> 13:12.090
that a transition or a step is over.

13:12.090 --> 13:14.010
And so we're gonna use this done here

13:14.010 --> 13:15.570
for our if condition.

13:15.570 --> 13:18.990
Therefore, if series minus one dot done means

13:18.990 --> 13:21.480
if the last transition of the series

13:21.480 --> 13:23.363
is over. Is completed.

13:23.363 --> 13:25.380
And so this cumulative reward

13:25.380 --> 13:26.910
is going to be equal to zero

13:26.910 --> 13:29.070
if the last transition of the series is done.

13:29.070 --> 13:30.810
And else, if,

13:30.810 --> 13:33.060
we haven't reached the last transition,

13:33.060 --> 13:36.794
well cumulative reward is going to be updated with,

13:36.794 --> 13:41.190
as we said, the maximum of the Q values.

13:41.190 --> 13:43.920
And since this output here

13:43.920 --> 13:45.120
is the output of the brain,

13:45.120 --> 13:47.490
that is the predictions of the neural network.

13:47.490 --> 13:49.575
And as you know, the predictions

13:49.575 --> 13:51.600
of the neural network are the predicted Q values.

13:51.600 --> 13:55.110
Well, this output contains the Q values.

13:55.110 --> 13:57.840
And since we need to take the max of the Q values

13:57.840 --> 14:00.480
well we need to add first this index

14:00.480 --> 14:03.300
because the structure contains the Q values

14:03.300 --> 14:07.020
in the index one, and then we need to add data to

14:07.020 --> 14:09.480
access the data of this output structure.

14:09.480 --> 14:10.313
You know,

14:10.313 --> 14:12.240
it has the special structure of a torch variable.

14:12.240 --> 14:14.400
So with this, we get our Q values

14:14.400 --> 14:15.390
and then we want to take

14:15.390 --> 14:17.190
the maximum of our Q values

14:17.190 --> 14:19.600
and so simply we add dot max

14:20.490 --> 14:23.580
and now we get exactly what we want,

14:23.580 --> 14:24.750
as in the paper.

14:24.750 --> 14:26.820
This maximum of the Q values,

14:26.820 --> 14:28.648
for the non terminal state st.

14:28.648 --> 14:30.240
Perfect.

14:30.240 --> 14:31.590
And so now what we're gonna do

14:31.590 --> 14:33.900
is make this second for loop.

14:33.900 --> 14:36.240
That is for the 10 steps of the series.

14:36.240 --> 14:38.010
We are going to update

14:38.010 --> 14:39.690
the cumulative reward this way:

14:39.690 --> 14:43.560
By multiplying first by Gamma, the Dk parameter,

14:43.560 --> 14:44.850
which we already have,

14:44.850 --> 14:46.380
and then adding the reward.

14:46.380 --> 14:47.580
So let's do this.

14:47.580 --> 14:49.590
We are actually going to do exactly the same

14:49.590 --> 14:51.000
as in the pseudocode.

14:51.000 --> 14:53.010
-: As you can notice, they start from the right

14:53.010 --> 14:54.000
so they're not starting

14:54.000 --> 14:56.730
with the first step and going to the last steps.

14:56.730 --> 15:00.120
They start with the last step T minus one

15:00.120 --> 15:02.250
up to the first step, T start.

15:02.250 --> 15:03.390
So that's exactly what we're gonna do

15:03.390 --> 15:05.550
and that's because we want to get in the end

15:05.550 --> 15:07.584
the cumulative reward being

15:07.584 --> 15:10.705
equal to R equals R zero plus Gamma R one

15:10.705 --> 15:14.610
plus Gamma squared, R two plus dot dot dot

15:14.610 --> 15:17.610
plus Gama at the power of 10, R 10,

15:17.610 --> 15:20.051
where R one, R two, dot dot dot R 10

15:20.051 --> 15:22.601
R the reward obtained

15:22.601 --> 15:25.590
in each of the end steps of the series.

15:25.590 --> 15:26.700
So let's take a quick break

15:26.700 --> 15:28.350
before attacking this second full loop

15:28.350 --> 15:30.000
and I'll see you in the next tutorial.

15:30.000 --> 15:31.833
Until then, enjoy AI.
