WEBVTT

00:00.510 --> 00:02.730
-: Hello and welcome to this tutorial.

00:02.730 --> 00:04.650
Now we have the brain of the model.

00:04.650 --> 00:06.330
We also have the optimizer.

00:06.330 --> 00:09.990
So basically we're ready to train our different agents.

00:09.990 --> 00:11.700
That is our different brains.

00:11.700 --> 00:13.426
So that's from now.

00:13.426 --> 00:15.690
That's where will make this big train function

00:15.690 --> 00:18.480
which will contain all the A three C algorithm

00:18.480 --> 00:21.720
and therefore what we're about to implement in this train

00:21.720 --> 00:25.230
that PyPI is just this huge train function.

00:25.230 --> 00:27.660
There will just be this big train function.

00:27.660 --> 00:28.530
Nothing else.

00:28.530 --> 00:31.230
No class, but we will use this train function

00:31.230 --> 00:34.080
and the last step of this module with domain code.

00:34.080 --> 00:35.040
So there we go.

00:35.040 --> 00:36.900
But before we start, you can notice

00:36.900 --> 00:39.120
that we'll first we import some libraries.

00:39.120 --> 00:41.970
So that's the classic libraries with the torch module

00:41.970 --> 00:43.500
I mean the torch library.

00:43.500 --> 00:46.948
Then the ends library to create the Atari environment,

00:46.948 --> 00:48.849
which will be Breakout.

00:48.849 --> 00:52.950
Then we will of course import our actor-critic class

00:52.950 --> 00:55.050
from our model file.

00:55.050 --> 00:56.370
This one.

00:56.370 --> 01:00.691
And finally we will use variable from torch.autograd

01:00.691 --> 01:04.080
to run highly performing computations of the gradient

01:04.080 --> 01:05.670
thanks to the dynamic graphs.

01:05.670 --> 01:08.839
And then we have this ensure shared grads function

01:08.839 --> 01:12.060
which I didn't want to spend too much time on this

01:12.060 --> 01:15.150
because well first this is just a function that will

01:15.150 --> 01:16.980
make sure everything works correctly.

01:16.980 --> 01:18.780
If the model used by the agent

01:18.780 --> 01:20.400
doesn't have any shared gradient.

01:20.400 --> 01:23.190
So that's why it's called ensure shared grad.

01:23.190 --> 01:24.450
And the other reason is that

01:24.450 --> 01:26.640
I don't think this function is necessary

01:26.640 --> 01:29.460
but we never know, and at least with this

01:29.460 --> 01:33.030
we'll be a 100% sure that the code will execute properly.

01:33.030 --> 01:34.830
But that's not really important.

01:34.830 --> 01:38.090
What we must focus on is this train function

01:38.090 --> 01:40.980
that we'll start making right now.

01:40.980 --> 01:41.813
So here we go.

01:41.813 --> 01:42.754
Deaf

01:42.754 --> 01:44.348
(Instructor typing)

01:44.348 --> 01:45.391
and train.

01:45.391 --> 01:46.230
We'll simply call it train.

01:46.230 --> 01:48.990
And this train function will take several arguments.

01:48.990 --> 01:50.370
The first one is rank.

01:50.370 --> 01:52.920
I'm gonna explain what it is in the second.

01:52.920 --> 01:54.930
The second one is params,

01:54.930 --> 01:57.870
so that all the parameters of the environment.

01:57.870 --> 02:02.870
Then the third parameter is going to be shared model.

02:03.150 --> 02:05.458
So you know, the shared model is what the agent will get to

02:05.458 --> 02:09.930
run its little exploration on a certain number of steps.

02:09.930 --> 02:13.020
And then finally, the last argument is going to

02:13.020 --> 02:17.790
be the optimizer that is the one we made earlier.

02:17.790 --> 02:20.070
So perfect, that's our four arguments.

02:20.070 --> 02:23.250
And now we are ready to start implementing this strain

02:23.250 --> 02:27.000
function. So the first thing we'll do is, you know

02:27.000 --> 02:28.950
you remember what A three C stands for.

02:28.950 --> 02:32.147
It stands for asynchronous actor-critic agent.

02:32.147 --> 02:34.950
So in A three C there is asynchronous.

02:34.950 --> 02:37.560
So as you understood, we have to desynchronize

02:37.560 --> 02:40.650
each training agent, and to desynchronize them

02:40.650 --> 02:44.931
we're gonna use the rank to shift each seed with this rank.

02:44.931 --> 02:47.642
So this rank parameter here, is just to shift the seed

02:47.642 --> 02:52.560
so that each training agent is desynchronized.

02:52.560 --> 02:55.920
So for example, if there is N training agents

02:55.920 --> 02:58.860
then the ranks will go from one to N,

02:58.860 --> 03:02.309
and there there will be one integer per agent from one to N.

03:02.309 --> 03:05.690
So when we shift the seed by one thread

03:05.690 --> 03:07.770
all episode random numbers created

03:07.770 --> 03:09.930
by the thread will be totally independent

03:09.930 --> 03:11.460
from the other threads.

03:11.460 --> 03:13.959
However, the seeds are fixed numbers

03:13.959 --> 03:16.479
so when we reproduce the experience

03:16.479 --> 03:19.830
we will find exactly the same events.

03:19.830 --> 03:22.200
And that's because it's deterministic

03:22.200 --> 03:23.670
with respect to the seed.

03:23.670 --> 03:25.530
So it's important to understand that

03:25.530 --> 03:28.470
and that's why the first thing we need to do is to

03:28.470 --> 03:30.780
desynchronize each training agent

03:30.780 --> 03:34.320
by using the rank here to shift the seed with the rank.

03:34.320 --> 03:35.153
So lets do this.

03:35.153 --> 03:39.150
To do that, we're going to take our torch library.

03:39.150 --> 03:40.770
Then we're gonna get the seed

03:40.770 --> 03:42.050
with manual

03:42.050 --> 03:42.900
(Instructor typing)

03:42.900 --> 03:45.360
underscore seed parenthesis.

03:45.360 --> 03:46.680
This is a function.

03:46.680 --> 03:49.890
And now we're gonna take the seeds of all the agents

03:49.890 --> 03:52.141
which we can access with params dot seed,

03:52.141 --> 03:52.974
(Instructor typing)

03:52.974 --> 03:55.680
and to shift them by the rank to desynchronize

03:55.680 --> 03:56.820
each of these agents.

03:56.820 --> 04:00.510
We will just adhere plus rank.

04:00.510 --> 04:04.500
And that will shift the seed with the rank

04:04.500 --> 04:06.660
to desynchronize each training agent

04:06.660 --> 04:09.960
because there is one seed for each training agent.

04:09.960 --> 04:11.610
All right, first thing done.

04:11.610 --> 04:13.260
And now next step.

04:13.260 --> 04:15.510
The next step is to get the environment.

04:15.510 --> 04:17.400
So we're gonna create a new variable that we're

04:17.400 --> 04:18.667
gonna call end.

04:18.667 --> 04:19.500
(Instructor typing)

04:19.500 --> 04:22.230
And now we will use to create Atari end function

04:22.230 --> 04:26.130
from the end module to create the environment for breakouts.

04:26.130 --> 04:28.230
That is to get the environment of breakouts.

04:28.230 --> 04:29.252
So we take this function,

04:29.252 --> 04:30.849
(Instructor typing)

04:30.849 --> 04:35.849
create Atari end, and now we have to input just one argument

04:36.630 --> 04:38.811
which are the parameters of the environment.

04:38.811 --> 04:41.040
And we have them because this is one

04:41.040 --> 04:42.720
of the input of the train function.

04:42.720 --> 04:44.280
This is the params here

04:44.280 --> 04:47.550
which will be the parameters of the environment of breakout.

04:47.550 --> 04:50.669
And therefore to get the breakout environment

04:50.669 --> 04:54.690
we take this params argument, then dot

04:54.690 --> 04:59.460
and then we get end name, which in the future that is

04:59.460 --> 05:02.040
in the next code with the main function that will execute.

05:02.040 --> 05:05.634
The whole code will be Breakout, Breakout VZR.

05:05.634 --> 05:06.467
All right?

05:06.467 --> 05:08.310
So that gets us the environment.

05:08.310 --> 05:12.550
Perfect. And now next step is to align the seed

05:12.550 --> 05:15.490
of the environment under one of the agents.

05:15.490 --> 05:17.610
And why do we do that?

05:17.610 --> 05:19.830
It's because remember, each agent

05:19.830 --> 05:22.041
of the A three C model has its own vision

05:22.041 --> 05:25.277
of the environment, like its own copy of the environment.

05:25.277 --> 05:28.200
And therefore we need to align each

05:28.200 --> 05:32.100
of the agents on one specific version of the environment.

05:32.100 --> 05:33.840
And to do that, we're gonna use the seed

05:33.840 --> 05:37.090
because each seed determines a specific environment.

05:37.090 --> 05:40.410
So by associating a different seed to each agent,

05:40.410 --> 05:42.150
well we'll get exactly what we want.

05:42.150 --> 05:46.177
That is that each agent will have its own environment.

05:46.177 --> 05:47.910
And so how can we do that?

05:47.910 --> 05:50.850
We can take our environment, then add

05:50.850 --> 05:53.852
dot then use the seed function to,

05:53.852 --> 05:57.510
you know choose the seed we want to get for the environment.

05:57.510 --> 05:59.700
And so now to align the seed of the environment

05:59.700 --> 06:01.560
to the seed of the agent.

06:01.560 --> 06:05.010
Well we simply need to get this because this corresponds

06:05.010 --> 06:09.360
to the seed of the agent that was shifted thanks to rank

06:09.360 --> 06:12.150
to get desynchronized training agent

06:12.150 --> 06:13.624
because they're all on the different seed.

06:13.624 --> 06:15.159
(Instructor typing)

06:15.159 --> 06:16.560
So we just need to paste that here.

06:16.560 --> 06:18.259
And this will align the seed

06:18.259 --> 06:20.742
of the environment under one of the agent.

06:20.742 --> 06:21.942
(Instructor typing)

06:21.942 --> 06:23.051
Okay?

06:23.051 --> 06:26.720
Now we are gonna get our model that is our A three C brains.

06:26.720 --> 06:28.859
And so that is now that we're gonna use

06:28.859 --> 06:31.971
the actor-critic class from our model file.

06:31.971 --> 06:35.468
So we're basically going to create an object

06:35.468 --> 06:37.680
of this actor-critic class.

06:37.680 --> 06:40.657
And we're gonna call this object model or brain if you like.

06:40.657 --> 06:44.147
But basically this object will contain all the convolutions

06:44.147 --> 06:46.721
the LSDM, the linear full connection

06:46.721 --> 06:49.188
and the forward function to propagate the signal.

06:49.188 --> 06:52.800
So it will basically contain the brains of the actor

06:52.800 --> 06:55.786
and the critic with the ability to propagate the signal

06:55.786 --> 06:58.749
throughout the brain to get the final output.

06:58.749 --> 07:00.120
So lets do this.

07:00.120 --> 07:02.346
Lets create our model.

07:02.346 --> 07:06.330
So as we said, we want to call this object model.

07:06.330 --> 07:10.559
And so we create an object of the actor-critic class

07:10.559 --> 07:11.637
and therefore we take our class

07:11.637 --> 07:13.212
(Instructor typing)

07:13.212 --> 07:16.260
actor-critic and now remember what arguments

07:16.260 --> 07:17.340
we need to input.

07:17.340 --> 07:20.307
That's actually the arguments of the init function.

07:20.307 --> 07:22.620
So self, we don't have to input it, you know

07:22.620 --> 07:25.050
that's what we have to do to use the object

07:25.050 --> 07:26.490
in the init method.

07:26.490 --> 07:30.012
But then the arguments we have to input are numb inputs,

07:30.012 --> 07:32.910
which is input shape, that is the dimensions

07:32.910 --> 07:36.930
of our input images and the actions space that contains

07:36.930 --> 07:38.670
you know, the set of actions.

07:38.670 --> 07:42.130
So let's input these arguments in the train function.

07:42.130 --> 07:45.900
So the first one, we can get it with our environment

07:45.900 --> 07:49.316
and dot and then we use observation space.

07:49.316 --> 07:50.149
(Instructor typing)

07:50.149 --> 07:52.561
So that's the space of observations.

07:52.561 --> 07:54.150
Then dot.

07:54.150 --> 07:55.560
And then to get the number of inputs,

07:55.560 --> 07:59.160
we get shape bracket zero.

07:59.160 --> 08:01.290
All right, so that's for numb inputs.

08:01.290 --> 08:04.147
And now for action space.

08:04.147 --> 08:07.050
Well that's almost the same which we need to get it

08:07.050 --> 08:09.810
from our environment that we already imported.

08:09.810 --> 08:11.880
Then dot and then action space.

08:11.880 --> 08:12.870
(Instructor typing)

08:12.870 --> 08:15.410
All right and that gives us the argument

08:15.410 --> 08:17.880
we need to input when creating an object,

08:17.880 --> 08:20.370
the model of the actor-critic class.

08:20.370 --> 08:22.023
Okay, so now we have our model.

08:23.171 --> 08:25.140
And now the next step is to prepare our input states.

08:25.140 --> 08:27.570
So remember we're still doing deeper reinforcement learning.

08:27.570 --> 08:30.480
So the input states are the input images

08:30.480 --> 08:34.140
and therefore this will be originally a MPIRE Ray

08:34.140 --> 08:35.520
which will contain one channel

08:35.520 --> 08:37.020
because we will work with black

08:37.020 --> 08:40.980
and white images and it'll have dimensions of 42 by 42.

08:40.980 --> 08:43.440
But it's important to understand and to keep in mind here

08:43.440 --> 08:46.650
that the input states are the input images.

08:46.650 --> 08:49.241
So first what we have to do is to get the MPIRE Ray

08:49.241 --> 08:51.508
then we will convert it into a torch sensor.

08:51.508 --> 08:54.529
But the first step as what we did previously

08:54.529 --> 08:56.790
is to get an MPIRE Ray.

08:56.790 --> 08:57.810
And to get it.

08:57.810 --> 08:58.950
It's actually quite simple.

08:58.950 --> 09:00.870
While first we need to create a variable

09:00.870 --> 09:03.440
for the input state, which we will call state.

09:03.440 --> 09:06.418
And this, to get the MPIRE Ray

09:06.418 --> 09:08.920
we simply need to take our environment

09:08.920 --> 09:12.378
and then add dot and then use the reset function.

09:12.378 --> 09:15.041
And this will initialize state

09:15.041 --> 09:20.041
as a MPIRE Ray of dimensions, one by 42 by 42.

09:20.190 --> 09:23.186
One means one channel, so black and white image.

09:23.186 --> 09:27.180
And 42 by 42 is of course the dimensions of the image.

09:27.180 --> 09:28.547
The number of pixels on the widths

09:28.547 --> 09:30.840
and the number of pixels on the height.

09:30.840 --> 09:32.640
So basically that's just the dimensions

09:32.640 --> 09:34.770
and that's the ones we'll work with.

09:34.770 --> 09:38.100
And now, now that we have this actually MPIRE Ray

09:38.100 --> 09:41.001
because this will get us these images of such dimensions

09:41.001 --> 09:44.960
in MPIRE Rays, now we can convert them into torch tensors.

09:44.960 --> 09:48.210
And to do this while we're going to update state again

09:48.210 --> 09:50.652
because we don't need to keep the MPIRE Rays

09:50.652 --> 09:54.710
and that's where we use torch, the torch module.

09:54.710 --> 09:57.810
And remember we already did that for Doom.

09:57.810 --> 10:02.509
We used the function from underscore NumPy parenthesis.

10:02.509 --> 10:05.850
And inside this function, we need to input the MPIRE Ray

10:05.850 --> 10:08.580
to which we want to convert into a torch tenser.

10:08.580 --> 10:10.110
And that is the state.

10:10.110 --> 10:13.830
The previous version of the state in MPIRE Ray will become

10:13.830 --> 10:16.940
by applying the from NumPy function, a torch tensor.

10:16.940 --> 10:20.098
So that just creates a tensor from the state.

10:20.098 --> 10:24.870
And now we just need to initialize the done variable.

10:24.870 --> 10:27.360
Remember the done variable is generally the variable

10:27.360 --> 10:30.161
that says if an episode is over or if the game is over.

10:30.161 --> 10:33.457
While here, we just want to introduce this done variable

10:33.457 --> 10:36.660
and initialize it to true to specify

10:36.660 --> 10:38.370
that this done variable will be equal

10:38.370 --> 10:41.160
to true when the game is done.

10:41.160 --> 10:43.080
So that will be useful for later so

10:43.080 --> 10:46.000
that the AI doesn't play indefinitely to break out.

10:46.000 --> 10:50.460
All right, so that was basically the beginning

10:50.460 --> 10:53.280
of this train function with some initialization

10:53.280 --> 10:55.320
and some things that we have to do.

10:55.320 --> 10:58.178
The most important part here was

10:58.178 --> 11:00.480
that we have to desynchronized each training agent.

11:00.480 --> 11:03.420
So that's one first principle of the A three C model,

11:03.420 --> 11:05.160
we have to apply.

11:05.160 --> 11:07.470
And now in the next tutorial, we will proceed to

11:07.470 --> 11:09.810
desynchronization with the shared model.

11:09.810 --> 11:11.910
Let's not forget that there is the different models

11:11.910 --> 11:14.340
but also the shared model which is a model

11:14.340 --> 11:16.170
that all the agents share.

11:16.170 --> 11:18.660
And so we have to synchronize with this shared model

11:18.660 --> 11:22.098
so that each agent can get this shared model to proceed

11:22.098 --> 11:25.068
to a small exploration of a certain number of steps.

11:25.068 --> 11:28.110
So that's what we'll do in the next tutorial.

11:28.110 --> 11:30.033
And until then, enjoy AI.