WEBVTT

00:00.510 --> 00:02.790
Instructor: Hello and welcome to this tutorial,

00:02.790 --> 00:05.070
almost the final tutorial of this module.

00:05.070 --> 00:07.020
I'm just going to explain the main code

00:07.020 --> 00:08.730
that will execute the whole thing

00:08.730 --> 00:10.740
before we get the exciting results

00:10.740 --> 00:12.060
and watch the videos.

00:12.060 --> 00:13.890
So, this is the main code,

00:13.890 --> 00:15.870
and as you can see, this is pretty short.

00:15.870 --> 00:18.780
We start by importing the libraries and the modules,

00:18.780 --> 00:21.480
and also the different classes and functions we made,

00:21.480 --> 00:24.360
like active critic from our model file,

00:24.360 --> 00:26.640
the train function from the train file,

00:26.640 --> 00:28.740
and the test function from the test file,

00:28.740 --> 00:31.920
and of course, we import our optimizer.

00:31.920 --> 00:33.930
Then we start with the first section

00:33.930 --> 00:36.090
where we get into a class,

00:36.090 --> 00:37.500
all the parameters.

00:37.500 --> 00:39.690
And this is this params,

00:39.690 --> 00:41.460
remember this is the params object

00:41.460 --> 00:43.980
that we created from this params class.

00:43.980 --> 00:46.620
Each time we got a parameter

00:46.620 --> 00:49.170
like the learning rate, gamma, or tau.

00:49.170 --> 00:50.820
So let's go through them quickly.

00:50.820 --> 00:54.450
This first one, LR here, is the learning rate.

00:54.450 --> 00:57.060
So as you can see, we choose a small learning rate.

00:57.060 --> 00:59.160
This second one is the gamma parameter.

00:59.160 --> 01:02.160
Again, we take it as 0.99.

01:02.160 --> 01:04.590
We take a one tau parameter,

01:04.590 --> 01:05.850
a seed of one,

01:05.850 --> 01:07.380
16 processes,

01:07.380 --> 01:08.760
20 steps,

01:08.760 --> 01:10.650
and max length of 10,000.

01:10.650 --> 01:12.510
Remember we spoke about that.

01:12.510 --> 01:14.580
This is the parameter we set

01:14.580 --> 01:17.700
to make sure the agent doesn't get stuck indefinitely

01:17.700 --> 01:18.990
into a state of the environment.

01:18.990 --> 01:21.000
So this will stop the game

01:21.000 --> 01:22.470
if the episode length

01:22.470 --> 01:25.020
goes over this maximum episode length.

01:25.020 --> 01:27.390
And eventually, of course,

01:27.390 --> 01:29.760
we get the name of our environment,

01:29.760 --> 01:30.960
Breakout V zero.

01:30.960 --> 01:33.150
And by the way, you can also play

01:33.150 --> 01:34.800
on some other environment

01:34.800 --> 01:38.370
just by changing this name of the environment here.

01:38.370 --> 01:41.430
So if you want to play to some other Breakout versions,

01:41.430 --> 01:43.470
or even some other Atari games,

01:43.470 --> 01:47.190
well you can simply replace this Breakout V zero here

01:47.190 --> 01:48.720
by some other games.

01:48.720 --> 01:51.240
But I can tell you that Breakout V zero

01:51.240 --> 01:53.700
is already very challenging.

01:53.700 --> 01:56.130
All right, so all the parameters here.

01:56.130 --> 01:59.370
And then there is the main code for the main run.

01:59.370 --> 02:01.470
And so here, let's see what we do.

02:01.470 --> 02:05.400
In this first line, we set one thread per core.

02:05.400 --> 02:08.430
Then in the second line, we get all our parameters

02:08.430 --> 02:12.150
by, you know, creating an object of the params class,

02:12.150 --> 02:15.600
which will get and initialize all these parameters here

02:15.600 --> 02:18.090
because they are the variables attached

02:18.090 --> 02:19.500
to this params object.

02:19.500 --> 02:20.970
Then we set the seed,

02:20.970 --> 02:22.980
then we get our environment

02:22.980 --> 02:26.490
using the create atari env function

02:26.490 --> 02:28.680
with the name of our environment,

02:28.680 --> 02:30.000
which is Breakout zero.

02:30.000 --> 02:31.830
You see cell dot env name,

02:31.830 --> 02:33.930
and therefore params dot env name

02:33.930 --> 02:35.370
is Breakout zero.

02:35.370 --> 02:37.860
So that will get us the environment of breakout.

02:37.860 --> 02:39.990
And by the way, this is not the usual way

02:39.990 --> 02:41.640
of creating an environment,

02:41.640 --> 02:43.680
but you know, to improve the whole process

02:43.680 --> 02:45.660
and to improve the performance,

02:45.660 --> 02:46.590
well we use this

02:46.590 --> 02:50.010
to actually create an optimized environment.

02:50.010 --> 02:52.710
And this, we do this thanks to Universe.

02:52.710 --> 02:54.330
Universe is a package

02:54.330 --> 02:56.580
that comes with all the packages you installed

02:56.580 --> 02:57.840
on OpenAI Gym.

02:57.840 --> 03:01.560
Well, thanks to Universe, we get an optimized environment.

03:01.560 --> 03:04.020
This is what is all about here.

03:04.020 --> 03:06.150
Then we get our shared model

03:06.150 --> 03:09.180
by creating an object of the active critic class.

03:09.180 --> 03:10.860
And so here it's important to understand

03:10.860 --> 03:12.900
that this shared model

03:12.900 --> 03:15.150
is the model shared by the different agents.

03:15.150 --> 03:18.210
So we have different threads in different cores.

03:18.210 --> 03:19.680
And speaking of threads,

03:19.680 --> 03:21.180
at the next line, here,

03:21.180 --> 03:23.370
shared model dot share memory.

03:23.370 --> 03:26.070
What we do is we store the model

03:26.070 --> 03:28.140
in the shared memory of the computer

03:28.140 --> 03:30.840
so that all the threads can get access to it

03:30.840 --> 03:32.970
even if they are in different cores.

03:32.970 --> 03:34.500
So that's what we do here.

03:34.500 --> 03:36.810
This is to enable this.

03:36.810 --> 03:39.030
Then we get our optimizer

03:39.030 --> 03:42.060
linked to the parameters of our shared model

03:42.060 --> 03:45.900
and with a learning rate of 0.0001.

03:45.900 --> 03:47.520
And again, it's important to understand

03:47.520 --> 03:49.980
that the optimizer is also shared

03:49.980 --> 03:52.200
because it's gonna act on the shared model.

03:52.200 --> 03:56.160
And same, at the next line, optimizer dot share memory,

03:56.160 --> 03:58.830
we store the optimizer in the shared memory

03:58.830 --> 04:01.140
so that's all the agents can get access to it

04:01.140 --> 04:02.880
to optimize the model.

04:02.880 --> 04:05.730
Then we initialize our processes.

04:05.730 --> 04:09.300
So the test process doesn't update the shared model,

04:09.300 --> 04:11.820
but it just uses it to try it on one part

04:11.820 --> 04:14.790
and print the score and record the videos.

04:14.790 --> 04:16.170
So that's exactly what it's done here

04:16.170 --> 04:17.910
with target equals test.

04:17.910 --> 04:19.530
That's the test process.

04:19.530 --> 04:20.820
And this process here

04:20.820 --> 04:24.420
is got from torch dot multi pre-processing.

04:24.420 --> 04:28.410
So here, and what it does is that it basically

04:28.410 --> 04:31.830
runs a function on an independent thread.

04:31.830 --> 04:34.320
So then when we do P start,

04:34.320 --> 04:35.850
we start a new process,

04:35.850 --> 04:38.340
which was the one initialized here at this line.

04:38.340 --> 04:41.520
And then with this process dot append P,

04:41.520 --> 04:45.270
we add the process in the list of the processes.

04:45.270 --> 04:48.240
And finally, in this loop here,

04:48.240 --> 04:51.180
we just do a loop to run all the other processes

04:51.180 --> 04:54.750
that will be trained by updating the shared model.

04:54.750 --> 04:56.400
And that's basically what happens

04:56.400 --> 04:58.350
in the last lines of code here.

04:58.350 --> 05:00.930
So, if you don't wanna get into the details of it

05:00.930 --> 05:02.850
the important thing to understand is that

05:02.850 --> 05:06.210
this will run the processes in an optimal way,

05:06.210 --> 05:09.150
and therefore we should all be good to execute this code

05:09.150 --> 05:12.810
and to have a train model and eventually watch the results.

05:12.810 --> 05:14.190
So I can't wait to do that.

05:14.190 --> 05:16.080
This is going to be pretty exciting.

05:16.080 --> 05:17.370
I will try to find (indistinct) now

05:17.370 --> 05:19.320
so that we can all watch it together.

05:19.320 --> 05:21.723
And so until next time, enjoy AI.