WEBVTT

00:00.750 --> 00:02.820
Speaker: Hello and welcome to this tutorial.

00:02.820 --> 00:06.630
So now the agent has done its expiration and then

00:06.630 --> 00:10.200
what he's about to do is to update the shared network.

00:10.200 --> 00:13.560
So, the first thing we're gonna do is initialize

00:13.560 --> 00:15.000
the cumulative reward.

00:15.000 --> 00:20.000
We're gonna call it R, capital R, and we will initialize it

00:20.070 --> 00:24.810
as a torch tensor, but that will have dimensions one by one.

00:24.810 --> 00:27.900
Because it's just a value, but we want it to be a tensor.

00:27.900 --> 00:32.900
And so, I'm using here dot zeros and then one, one.

00:33.750 --> 00:36.350
So basically the accumulative rewards is initialized

00:36.350 --> 00:38.520
to zero, okay?

00:38.520 --> 00:42.330
Then, same if we are not done, that is

00:42.330 --> 00:45.060
if the game is not over, what we want right now

00:45.060 --> 00:48.480
is the cumulative reward to be equal to the value

00:48.480 --> 00:51.840
of the last state reached by the shared network.

00:51.840 --> 00:54.690
So we're gonna get the value output, you know

00:54.690 --> 00:57.900
the value of the V function, output of our model

00:57.900 --> 01:01.710
and this is the value we will give to the cumulative reward.

01:01.710 --> 01:04.320
So, let's first get this value.

01:04.320 --> 01:07.710
We can get it this way value then, you know,

01:07.710 --> 01:10.680
since we only want, the value we can add here

01:10.680 --> 01:13.200
underscore and then underscore again.

01:13.200 --> 01:17.190
And then we get our model because it'll output this value

01:17.190 --> 01:19.740
but only the first output of the model.

01:19.740 --> 01:21.840
Thanks to this double underscores here.

01:21.840 --> 01:25.440
And here we can just copy paste what we have here.

01:25.440 --> 01:30.440
That is, the inputs of the model with the inputs images

01:30.510 --> 01:33.210
and the top all of the hidden states and the cell states.

01:33.210 --> 01:35.850
So I'm just pasting that, and there we go.

01:35.850 --> 01:37.800
We will get the value.

01:37.800 --> 01:42.800
And so now, what we're gonna do is give to R this value.

01:43.470 --> 01:46.020
So R will be equal to value

01:46.020 --> 01:49.740
and to access to the value we add this dot data here.

01:49.740 --> 01:50.573
All right.

01:50.573 --> 01:53.820
Now, the if condition is done, and now what we're gonna do,

01:53.820 --> 01:57.660
since we just got a new value by, you know

01:57.660 --> 01:59.040
getting the output of the model,

01:59.040 --> 02:00.810
the first output of the model.

02:00.810 --> 02:04.080
Well let's already append this new value

02:04.080 --> 02:05.490
to the values lists.

02:05.490 --> 02:09.600
Therefore, we can take directly our values list then

02:09.600 --> 02:13.700
dot append, and we input variable R.

02:15.240 --> 02:18.180
Because R contains this last value.

02:18.180 --> 02:19.890
So great, that is done.

02:19.890 --> 02:23.040
Now we are going to initialize the losses

02:23.040 --> 02:25.230
and remembering the intuition lectures.

02:25.230 --> 02:26.340
You have two losses.

02:26.340 --> 02:29.580
You have the loss of the policy that is the loss related

02:29.580 --> 02:31.830
to the predictions of the agent.

02:31.830 --> 02:33.930
And then you have the loss of the value, which is

02:33.930 --> 02:36.180
the loss related to the predictions of the critic.

02:36.180 --> 02:38.220
So we are going to introduce these two variables

02:38.220 --> 02:39.840
and initialize them to zero.

02:39.840 --> 02:42.690
And therefore I'm gonna take here policy, first

02:42.690 --> 02:46.320
variable policy loss, initialize it to zero

02:46.320 --> 02:49.470
and then value loss, the loss of the value

02:49.470 --> 02:51.900
and same initialize it to zero.

02:51.900 --> 02:55.170
Then let's not forget to set the cumulative reward

02:55.170 --> 02:56.790
as a torch variable.

02:56.790 --> 02:59.520
Because we will need it to be a torch variable, because

02:59.520 --> 03:01.950
we will be computing a gradient with respect to it,

03:01.950 --> 03:04.650
because the cumulative reward is going to be a term

03:04.650 --> 03:05.850
of the value loss.

03:05.850 --> 03:07.650
So with this variable, it is now attached

03:07.650 --> 03:10.500
to the dynamic graphs with a gradient.

03:10.500 --> 03:13.230
And now finally, the last thing we need to do

03:13.230 --> 03:15.510
before starting the big training loop, you know

03:15.510 --> 03:17.460
when we apply stochastic gradient descent,

03:17.460 --> 03:19.680
to reduce this loss between the predictions

03:19.680 --> 03:20.820
and the targets.

03:20.820 --> 03:23.520
Well we need to initialize the GAE.

03:23.520 --> 03:26.430
The generalized advantage estimation

03:26.430 --> 03:29.370
and not gated auto encoder, be careful with that.

03:29.370 --> 03:32.610
GAE, the variable that we're about to initialize right now

03:32.610 --> 03:35.490
is generalized advantage estimation.

03:35.490 --> 03:39.120
So as a reminder, generalized advantage estimation

03:39.120 --> 03:41.520
is by definition the advantage

03:41.520 --> 03:45.180
of playing the action A by observing the state S.

03:45.180 --> 03:48.240
So it's a function of the action A and the state S

03:48.240 --> 03:50.190
and it is equal to the difference

03:50.190 --> 03:54.780
between the Q values, QAS, and the value of the V function.

03:54.780 --> 03:57.510
So actually I can write it here.

03:57.510 --> 04:01.170
The generalized advantage estimation is a function A

04:01.170 --> 04:04.770
of the action and the state S, and that is equal

04:04.770 --> 04:09.390
to the Q values of the action A and the state S,

04:09.390 --> 04:13.500
minus the value of the V function applied to the state S.

04:13.500 --> 04:15.840
That's the generalized advantage estimation

04:15.840 --> 04:19.170
and that's what we want to initialize right now,

04:19.170 --> 04:21.480
and we will initialize it to zero.

04:21.480 --> 04:22.980
But it has to be torch tenor,

04:22.980 --> 04:24.870
so we're gonna use the same trick

04:24.870 --> 04:27.690
as what we just did right here.

04:27.690 --> 04:30.300
We are gonna take the torch library

04:30.300 --> 04:33.630
and apply the zero's function, to set it

04:33.630 --> 04:37.170
as a tenor of only one value, which is zero.

04:37.170 --> 04:40.666
And we are going to introduce this new variable GAE

04:40.666 --> 04:43.860
and that will be equal to that torch

04:43.860 --> 04:46.560
dot zeros one one, as initialized to zero.

04:46.560 --> 04:48.780
So this will be initialized to zero

04:48.780 --> 04:50.940
and therefore the Q values of the action A

04:50.940 --> 04:52.710
in the state S, will be equal

04:52.710 --> 04:55.830
to the value of the V function of the state S.

04:55.830 --> 04:58.830
All right, and now we are ready to start the fall loop.

04:58.830 --> 05:00.480
So we're gonna have some adventure here,

05:00.480 --> 05:01.680
so take a good break

05:01.680 --> 05:04.770
and I'll see you in the next tutorial, to attack that.

05:04.770 --> 05:06.453
Until then, enjoy AI!