WEBVTT

00:00.300 --> 00:02.460
-: Hello and welcome to this tutorial.

00:02.460 --> 00:03.690
All right, so after we made

00:03.690 --> 00:06.450
these four convolutions and the LSTM,

00:06.450 --> 00:08.460
we now have an encoded state

00:08.460 --> 00:10.590
that is going to be the input

00:10.590 --> 00:12.840
of these two neural networks that we're gonna make

00:12.840 --> 00:14.760
for the actor and the critic.

00:14.760 --> 00:15.930
And speaking of them,

00:15.930 --> 00:17.640
the only thing that we have to do now

00:17.640 --> 00:20.730
is just create two linear full connections,

00:20.730 --> 00:23.190
one for the actor and one for the critic.

00:23.190 --> 00:24.390
But before we do that,

00:24.390 --> 00:27.270
we need to get the number of possible actions.

00:27.270 --> 00:29.910
And so I'm going to call a new variable here

00:29.910 --> 00:32.490
that is not gonna be a variable of the object.

00:32.490 --> 00:34.290
So I'm not gonna use self here

00:34.290 --> 00:36.240
but I'm going to create the variable

00:36.240 --> 00:37.950
none outputs
(keyboard clicks)

00:37.950 --> 00:40.680
which will represent the number of possible actions.

00:40.680 --> 00:41.940
And to get it, well,

00:41.940 --> 00:44.490
we can get it from the action space.

00:44.490 --> 00:47.046
So we take our action space,

00:47.046 --> 00:47.879
(keyboard clicks)

00:47.879 --> 00:49.920
which will be the input of the INIT function

00:49.920 --> 00:51.480
when we create the object

00:51.480 --> 00:54.840
and then we add .n to get this number

00:54.840 --> 00:56.550
of possible actions.

00:56.550 --> 00:59.160
And now the actor and the critic

00:59.160 --> 01:01.230
will take separately the same input

01:01.230 --> 01:04.050
that is the output of this whole process here

01:04.050 --> 01:06.480
with the convolutions and the LSTM.

01:06.480 --> 01:10.140
So it will take the same input, which is an encoded state

01:10.140 --> 01:12.510
but then they will have two different

01:12.510 --> 01:13.860
linear full connections

01:13.860 --> 01:17.040
so that we get eventually actually two neural networks.

01:17.040 --> 01:18.990
One for the actor and one for the critic.

01:18.990 --> 01:21.060
So let's make these two separate neural networks,

01:21.060 --> 01:22.950
but since we already did the big job

01:22.950 --> 01:24.810
with the encoding here,

01:24.810 --> 01:26.280
well what with simply need to do

01:26.280 --> 01:28.440
is just create two objects,

01:28.440 --> 01:31.110
one linear full connection for the actor,

01:31.110 --> 01:33.630
and one other linear full connection for the critic.

01:33.630 --> 01:35.460
And so that's exactly what I'm gonna do.

01:35.460 --> 01:38.220
I'm gonna create two objects now.

01:38.220 --> 01:42.240
A first object for the linear full connection of the critic

01:42.240 --> 01:43.845
which I'm gonna call,

01:43.845 --> 01:46.500
"Critic_linear."
(keyboard clicks)

01:46.500 --> 01:49.140
And to create this linear full connection,

01:49.140 --> 01:50.280
while you know how to do it,

01:50.280 --> 01:53.040
we simply need to take the nn module

01:53.040 --> 01:54.600
and then the linear class

01:54.600 --> 01:57.180
to which we have to input well the input neurons,

01:57.180 --> 02:00.690
which are the outputs of all this encoding here

02:00.690 --> 02:02.280
with the convolutions and the LSTM.

02:02.280 --> 02:03.113
That is

02:03.113 --> 02:04.320
256

02:04.320 --> 02:05.153
neurons.

02:05.153 --> 02:06.030
So we input

02:06.030 --> 02:08.580
256 here,
(keyboard clicks)

02:08.580 --> 02:11.190
and then we are gonna have one output,

02:11.190 --> 02:13.920
because remember the output of the neural network

02:13.920 --> 02:16.530
for the critic is the value of the V function

02:16.530 --> 02:18.090
applied to the input state,

02:18.090 --> 02:21.150
to the input and coded state that we made here.

02:21.150 --> 02:24.360
So if we call the input state S

02:24.360 --> 02:26.070
that is the output of all this.

02:26.070 --> 02:28.680
Well, the output of the neural network

02:28.680 --> 02:30.720
of the critic will be VS.

02:30.720 --> 02:33.840
And therefore it has one dimension, it's just a value.

02:33.840 --> 02:36.480
And so here we input one.

02:36.480 --> 02:40.470
And remember VS is what is shared among the actors

02:40.470 --> 02:42.870
so that they can get some common information

02:42.870 --> 02:45.210
that they can use to play their action

02:45.210 --> 02:46.950
in a more relevant way.

02:46.950 --> 02:50.550
Okay. So that's for the neural network of the critic.

02:50.550 --> 02:53.760
And now let's make the neural network of the actor

02:53.760 --> 02:54.930
and therefore I'm adding here

02:54.930 --> 02:59.700
self.actor, linear.
(keyboard clicks)

02:59.700 --> 03:02.820
-: And same, we already have the input encoded state.

03:02.820 --> 03:05.670
So now we simply need to add a linear full connection.

03:05.670 --> 03:08.160
And therefore same we take the nn module

03:08.160 --> 03:09.840
then the linear class.

03:09.840 --> 03:11.310
And now same.

03:11.310 --> 03:12.720
This neural network of the actor

03:12.720 --> 03:15.900
will take the encoded state that has a size

03:15.900 --> 03:17.550
of 256.

03:17.550 --> 03:20.070
So 256 here.
(keyboard clicks)

03:20.070 --> 03:22.830
But then the output is gonna be different

03:22.830 --> 03:24.330
because, of course, you know it,

03:24.330 --> 03:26.400
the output of the neural network

03:26.400 --> 03:28.650
for the actor are the Q values.

03:28.650 --> 03:30.840
The Q values of the input state

03:30.840 --> 03:33.810
the one that will encode it here, and the action plate.

03:33.810 --> 03:37.950
So again, if we call this encoded state that we made here S

03:37.950 --> 03:39.660
and the action played A,

03:39.660 --> 03:42.810
the output of this neural network, actor linear,

03:42.810 --> 03:44.850
will be QSA.

03:44.850 --> 03:45.810
And since you know,

03:45.810 --> 03:48.330
we have one Q value for each action,

03:48.330 --> 03:51.030
then therefore we have none outputs Q values.

03:51.030 --> 03:53.280
And therefore the output here is gonna be

03:53.280 --> 03:56.160
none outputs
(keyboard clicks)

03:56.160 --> 03:59.820
because none outputs is actually the number of Q values.

03:59.820 --> 04:01.110
Okay, perfect.

04:01.110 --> 04:04.140
So if you want, I can write for you

04:04.140 --> 04:05.250
output here

04:05.250 --> 04:08.670
for the critic is VS
(keyboard clicks)

04:08.670 --> 04:10.800
where S is the encoded state.

04:10.800 --> 04:12.900
And for the actor,

04:12.900 --> 04:17.900
the output is QSA.
(keyboard clicks)

04:18.090 --> 04:19.410
Alright? So that's very important

04:19.410 --> 04:21.840
to understand this distinction here

04:21.840 --> 04:23.490
and to understand that we therefore

04:23.490 --> 04:25.350
have two separate neural networks,

04:25.350 --> 04:27.400
one for the critic and one for the actor.

04:28.290 --> 04:29.370
Okay, perfect.

04:29.370 --> 04:32.160
So we are almost done with this INIT function.

04:32.160 --> 04:34.020
Now the most important thing is done.

04:34.020 --> 04:35.850
The only remaining thing that we have to do

04:35.850 --> 04:38.040
is to initialize all the weights

04:38.040 --> 04:40.860
of those two neural networks and all the buyers.

04:40.860 --> 04:42.810
And of course to do that we're gonna use

04:42.810 --> 04:44.970
the two functions that we made earlier.

04:44.970 --> 04:47.040
That is the normalized_columns_initializer

04:47.040 --> 04:48.450
and the weights INIT

04:48.450 --> 04:49.830
So let's do that quickly.

04:49.830 --> 04:52.140
It's gonna be pretty easy and pretty fast.

04:52.140 --> 04:54.840
So first we're gonna initialize some random weights.

04:54.840 --> 04:55.673
And to do this,

04:55.673 --> 04:58.380
we're gonna apply the weights INIT function to our object.

04:58.380 --> 05:01.637
So, here we have to take self to get our object

05:01.637 --> 05:03.970
and to our object, we apply

05:04.830 --> 05:06.450
the weights INIT function.

05:06.450 --> 05:10.080
So therefore inside we just need to input

05:10.080 --> 05:12.450
the weights INIT function.
(keyboard clicks)

05:12.450 --> 05:13.320
And there we go.

05:13.320 --> 05:15.840
That will apply this function to our object.

05:15.840 --> 05:16.830
And by doing this,

05:16.830 --> 05:19.110
we are just initializing some random weights

05:19.110 --> 05:22.020
to get a future optimal learning of these weights.

05:22.020 --> 05:24.300
And now, what we have to do is make

05:24.300 --> 05:27.660
a special normalization for the actor and the critic.

05:27.660 --> 05:30.810
But remember, I think I told this in the previous tutorials,

05:30.810 --> 05:33.210
we are not gonna set the same variance

05:33.210 --> 05:35.340
for the actor and the critic.

05:35.340 --> 05:37.800
The actor will get a small standard deviation,

05:37.800 --> 05:40.830
a small variance, and the critic will get a big one.

05:40.830 --> 05:41.790
And why do we do this?

05:41.790 --> 05:43.800
What's the purpose of giving

05:43.800 --> 05:46.680
a small standard deviation of the weights for the actor,

05:46.680 --> 05:49.470
and a large standard deviation of the weight for the critic?

05:49.470 --> 05:50.760
Well that allows to manage

05:50.760 --> 05:53.820
to deal exploration vs exploitation.

05:53.820 --> 05:55.890
That's exactly the purpose of doing this.

05:55.890 --> 05:57.750
By giving a small variance the actor

05:57.750 --> 05:59.520
and a large variance to the critic,

05:59.520 --> 06:01.770
we will have a good management

06:01.770 --> 06:04.560
of exploration vs exploitation.

06:04.560 --> 06:05.820
So let's do this.

06:05.820 --> 06:07.500
Let's first take care of the actor.

06:07.500 --> 06:10.170
So we take self, our object

06:10.170 --> 06:12.300
then we're gonna take the neural network of our actor,

06:12.300 --> 06:15.090
which is actor linear.
(keyboard clicks)

06:15.090 --> 06:17.190
Then we are gonna access the weights

06:17.190 --> 06:19.380
of this neural network of the actor.

06:19.380 --> 06:20.790
And remember to access the data

06:20.790 --> 06:23.370
of the weights we need to add .data.

06:23.370 --> 06:24.270
Alright.

06:24.270 --> 06:25.860
So with this we get the weights,

06:25.860 --> 06:27.120
and now we're gonna use

06:27.120 --> 06:29.320
our function normalized_columns_initializer.

06:31.380 --> 06:34.700
So I'm copying this, pasting that here.

06:34.700 --> 06:36.480
And we are going to enter as argument

06:36.480 --> 06:39.150
the standard deviation we want these weights to have.

06:39.150 --> 06:42.120
But first, remember, this function takes two arguments.

06:42.120 --> 06:45.270
First, are the weights we want to initialize.

06:45.270 --> 06:47.970
So we simply need to take that again

06:47.970 --> 06:49.830
and paste that here.

06:49.830 --> 06:53.130
And the second argument is this standard deviation,

06:53.130 --> 06:54.570
we want these weights to have.

06:54.570 --> 06:55.890
So as we said,

06:55.890 --> 06:58.800
we want a small standard deviation for the actor,

06:58.800 --> 07:02.130
and a small one is going to be 0.01.

07:02.130 --> 07:02.963
Perfect.

07:02.963 --> 07:04.020
So that's for the weights

07:04.020 --> 07:05.850
of the neural network of the actor.

07:05.850 --> 07:07.860
Now, let's take of the bias

07:07.860 --> 07:09.810
of the neural network of the actor.

07:09.810 --> 07:12.060
And therefore here we're gonna do almost the same thing.

07:12.060 --> 07:14.400
We're going to copy this,

07:14.400 --> 07:15.480
paste that below,

07:15.480 --> 07:17.250
replace weight

07:17.250 --> 07:20.340
by bias to access the bias.

07:20.340 --> 07:25.340
And after data, we're simply going to add .fill.

07:25.350 --> 07:28.050
And remember, inside we input zero,

07:28.050 --> 07:31.470
because we want all the bias to be initialized with zero.

07:31.470 --> 07:34.560
So, actually I don't think this line is necessary

07:34.560 --> 07:36.270
because as you remember,

07:36.270 --> 07:38.640
the bias are already initialized to zero

07:38.640 --> 07:41.580
with this fill function in the weight INIT function.

07:41.580 --> 07:45.120
So, you know, we're doing this just to make sure

07:45.120 --> 07:47.580
that the bias are actually initialized to zero

07:47.580 --> 07:49.680
but I think it's already done here.

07:49.680 --> 07:53.040
But anyway, now we are 100% sure.

07:53.040 --> 07:56.010
All right, and now we're gonna do the same for the critic.

07:56.010 --> 07:59.469
So, let's be efficient and let's copy these two lines.

07:59.469 --> 08:00.302
(keyboard clicks)

08:00.302 --> 08:01.800
Let's paste them here.

08:01.800 --> 08:04.140
And here we are just going to replace

08:04.140 --> 08:07.140
actor by critic.
(keyboard clicks)

08:07.140 --> 08:08.520
Same here.

08:08.520 --> 08:10.590
And now the only thing that we have to change

08:10.590 --> 08:13.530
is just the standard deviation we want the weights

08:13.530 --> 08:15.960
of the neural network for the critic to have.

08:15.960 --> 08:17.190
And, as you remember,

08:17.190 --> 08:20.040
we want this time a large standard deviation

08:20.040 --> 08:21.750
and instead of open 01,

08:21.750 --> 08:23.790
we will input one.

08:23.790 --> 08:24.623
So there we go.

08:24.623 --> 08:26.430
We have a small standard deviation

08:26.430 --> 08:29.190
for the weights of the neural network of the actor

08:29.190 --> 08:31.590
and a large standard deviation for the weights

08:31.590 --> 08:33.360
of the neural network of the critic.

08:33.360 --> 08:34.530
And, of course, let's not forget

08:34.530 --> 08:37.950
to replace actor here by critic.

08:37.950 --> 08:39.780
Alright. And now we're good.

08:39.780 --> 08:42.480
Cool. So now we have two remaining thing to do.

08:42.480 --> 08:45.900
First, is to initialize also the bias of the LSTM.

08:45.900 --> 08:48.510
And to do this, we take our object self

08:48.510 --> 08:50.553
because the LSTM belongs to our object.

08:50.553 --> 08:53.880
Then we take our LSTM then dot.

08:53.880 --> 08:56.190
And then we're gonna get the two types

08:56.190 --> 08:58.140
of bias that are in the LSTM.

08:58.140 --> 09:01.830
That's bias_ih.
(keyboard clicks)

09:01.830 --> 09:04.590
And the other one is bias_hh.

09:04.590 --> 09:07.830
So that's the two types of bias in the LSTM.

09:07.830 --> 09:09.870
And same, they're gonna be initialized to zero.

09:09.870 --> 09:12.840
So first we access to the data,

09:12.840 --> 09:13.673
and then we use

09:13.673 --> 09:16.890
the fill_function
(keyboard clicks)

09:16.890 --> 09:19.770
to fill all these bias with zeros,

09:19.770 --> 09:21.930
initialize them with zeros.

09:21.930 --> 09:22.763
Alright.

09:22.763 --> 09:26.490
And now for the second group of bias,

09:26.490 --> 09:27.810
we adhere

09:27.810 --> 09:28.643
the same

09:28.643 --> 09:31.500
but we replace ih by hh.
(keyboard clicks)

09:31.500 --> 09:32.490
Alright.

09:32.490 --> 09:36.390
So that initializes the bias of the LSTM with zeros.

09:36.390 --> 09:38.520
And now, the last thing we need to do

09:38.520 --> 09:41.850
is to use a method that is inherited from the nn module

09:41.850 --> 09:43.170
that is the train method.

09:43.170 --> 09:46.320
And basically that is just a method that puts the module

09:46.320 --> 09:47.490
in train mode.

09:47.490 --> 09:49.344
So what's the use of it?

09:49.344 --> 09:52.200
While the use is that it allows to activate

09:52.200 --> 09:54.000
if there is any, the dropouts

09:54.000 --> 09:55.530
and the batch normalizations.

09:55.530 --> 09:57.360
And so to use it, we just add

09:57.360 --> 10:00.030
self.train
(keyboard clicks)

10:00.030 --> 10:02.520
and that puts the module in train mode.

10:02.520 --> 10:03.353
Perfect.

10:03.353 --> 10:05.700
So we are done with the INIT function.

10:05.700 --> 10:08.310
We have our convolutions, we have the LSTM,

10:08.310 --> 10:10.410
we have our two separate neural networks

10:10.410 --> 10:12.150
for the critic and the actor.

10:12.150 --> 10:15.600
And all the weights and bias are well initialized.

10:15.600 --> 10:16.890
So that's all good.

10:16.890 --> 10:18.900
We are ready to move on to the next step

10:18.900 --> 10:20.820
which is to make the forward function

10:20.820 --> 10:23.490
that will, of course, forward propagate the signal

10:23.490 --> 10:26.160
from the very beginning with the original input images

10:26.160 --> 10:28.980
throughout all the brain until we get the output.

10:28.980 --> 10:30.810
So let's do that in the next tutorial.

10:30.810 --> 10:32.583
And until then, enjoy AI.