WEBVTT

00:00.360 --> 00:02.550
Instructor: Hello and welcome to this tutorial.

00:02.550 --> 00:05.550
All right, so in the previous tutorials we made the brain

00:05.550 --> 00:07.950
or if you want, the brains for the A3C.

00:07.950 --> 00:09.660
Now we need to train this brain,

00:09.660 --> 00:11.460
but in order to train these brains,

00:11.460 --> 00:13.140
we need an optimizer.

00:13.140 --> 00:15.840
That's the tool we'll use in stochastic gradient descent

00:15.840 --> 00:18.810
to update the weights according to how much they contribute

00:18.810 --> 00:22.170
to the error between the predictions and the targets.

00:22.170 --> 00:26.640
And what we did up until now in the first and second module,

00:26.640 --> 00:30.840
we used the Adam Optimizer by Torch in the training.

00:30.840 --> 00:33.000
But as I told you, we are dealing

00:33.000 --> 00:36.210
with a very challenging problem that is breakout.

00:36.210 --> 00:39.240
And the A3C algorithm by itself

00:39.240 --> 00:41.520
is not enough to solve this problem.

00:41.520 --> 00:44.520
We need some customized optimizers

00:44.520 --> 00:47.790
and a lot of different tricks to solve this problem

00:47.790 --> 00:49.500
without waiting for ages.

00:49.500 --> 00:51.870
So, that's the purpose of doing this

00:51.870 --> 00:54.330
and that is why we have a separate

00:54.330 --> 00:58.200
custom optimizer based on the Adam Optimizer.

00:58.200 --> 01:01.350
And that is contained in this shared Adam class.

01:01.350 --> 01:02.850
And why shared Adam?

01:02.850 --> 01:05.550
It's because it is actually the Adam Optimizer

01:05.550 --> 01:08.190
but that will work on shared states.

01:08.190 --> 01:11.220
So, we're going to explain how it works in this tutorial.

01:11.220 --> 01:13.890
So we're gonna go through the different functions here

01:13.890 --> 01:15.840
without coding them because, you know,

01:15.840 --> 01:18.570
we want to keep some energy for the next implementation

01:18.570 --> 01:20.550
that is the train of Python,

01:20.550 --> 01:23.160
which will take more than 100 lines of code.

01:23.160 --> 01:24.570
So be ready for that.

01:24.570 --> 01:27.660
And therefore we will try to explain what's going

01:27.660 --> 01:29.310
on here in one tutorial,

01:29.310 --> 01:30.450
this tutorial.

01:30.450 --> 01:32.790
And let's start right now.

01:32.790 --> 01:35.040
All right, so first we introduce this class

01:35.040 --> 01:37.560
Share Adam that will contain three functions.

01:37.560 --> 01:39.690
The init function, the shared memory function,

01:39.690 --> 01:41.160
and the step function.

01:41.160 --> 01:42.780
So, what we do first

01:42.780 --> 01:45.210
is that we inherit from

01:45.210 --> 01:46.650
Optim.adam,

01:46.650 --> 01:48.450
which is of course the Adam Optimizer

01:48.450 --> 01:50.790
that we get from the Optim module

01:50.790 --> 01:52.260
from the Torch library.

01:52.260 --> 01:54.930
So here we apply inheritance to get the tools.

01:54.930 --> 01:57.270
All related it to the Adam Optimizer.

01:57.270 --> 01:59.280
And then we start with the init function.

01:59.280 --> 02:01.140
So what happens here,

02:01.140 --> 02:05.190
first we use a super function to inherit from all the tools

02:05.190 --> 02:09.150
and all the basic parameters from the Optim.Adam class.

02:09.150 --> 02:11.340
And these basic parameters are here

02:11.340 --> 02:14.970
paras, learning rates, betas, epsilon,

02:14.970 --> 02:16.230
and weight decay.

02:16.230 --> 02:17.940
And then we start a full loop.

02:17.940 --> 02:21.840
This first full loop for group in self.param groups.

02:21.840 --> 02:24.180
So first, what is param groups?

02:24.180 --> 02:27.300
Self.param groups contains all the attributes

02:27.300 --> 02:28.500
of the optimizer.

02:28.500 --> 02:31.140
And among these attributes we have the parameters

02:31.140 --> 02:32.670
that we have to optimize.

02:32.670 --> 02:35.880
These parameters that we want to optimize are the weights

02:35.880 --> 02:37.770
of the network that are contained

02:37.770 --> 02:40.140
in self.param groups

02:40.140 --> 02:40.973
params.

02:40.973 --> 02:42.240
So, there we go.

02:42.240 --> 02:44.940
Group belongs to self.param groups.

02:44.940 --> 02:46.860
And here we have this second full loop,

02:46.860 --> 02:50.520
which will get these parameters that we want to optimize

02:50.520 --> 02:54.900
and that are exactly contained in self.param groups params.

02:54.900 --> 02:55.770
So basically,

02:55.770 --> 02:57.450
we go through self.param groups

02:57.450 --> 02:59.520
that contains old parameters.

02:59.520 --> 03:03.810
And for each group of parameters and self.param groups,

03:03.810 --> 03:05.220
we are gonna go through

03:05.220 --> 03:07.500
the parameters that we want to optimize.

03:07.500 --> 03:11.160
Therefore, for P in group params here means

03:11.160 --> 03:14.400
for each tensor of weights that we want to optimize.

03:14.400 --> 03:17.160
So for each tensor of weights that we want to optimize

03:17.160 --> 03:19.470
and then what happens inside this loop

03:19.470 --> 03:21.780
with these full lines of code.

03:21.780 --> 03:23.220
Basically what happens is

03:23.220 --> 03:26.850
that the update made by the Adam Optimizer

03:26.850 --> 03:29.850
is based on an exponential moving average

03:29.850 --> 03:31.170
of the gradient.

03:31.170 --> 03:33.030
That's this line of code here.

03:33.030 --> 03:35.520
That's the exponential moving average of the gradient

03:35.520 --> 03:38.490
of moment one, that is of order one.

03:38.490 --> 03:41.640
But the updates made by Adam is not only based on that

03:41.640 --> 03:45.180
it is also based on an exponential moving average

03:45.180 --> 03:47.220
of the square of the gradient.

03:47.220 --> 03:48.990
That is an exponential moving average

03:48.990 --> 03:52.020
of the gradient of moment two or order two.

03:52.020 --> 03:55.470
So here is the exponential moving average of order one

03:55.470 --> 03:57.270
and here is the exponential moving average

03:57.270 --> 03:58.200
of order two.

03:58.200 --> 04:00.750
For each of them, the EMA of the gradient.

04:00.750 --> 04:02.070
So that what happens here.

04:02.070 --> 04:04.500
And now, if you want to get more in depth

04:04.500 --> 04:06.840
of how the exponential moving average works,

04:06.840 --> 04:08.580
well I highly encourage you to have a look

04:08.580 --> 04:10.387
at this research paper,

04:10.387 --> 04:13.530
"Adam: A method for Stochastic Optimization"

04:13.530 --> 04:14.850
because basically,

04:14.850 --> 04:17.550
the Adam Optimizer that we're implementing right now

04:17.550 --> 04:20.880
is based on the algorithm one here.

04:20.880 --> 04:22.925
So if you want to have more details on

04:22.925 --> 04:24.810
how the algorithm works,

04:24.810 --> 04:27.660
well this paper will be definitely helpful.

04:27.660 --> 04:29.880
And then you have some further explanations

04:29.880 --> 04:32.850
on the algorithm with the Adams update rules.

04:32.850 --> 04:33.720
And so you know that

04:33.720 --> 04:36.270
only if you want to attack this before

04:36.270 --> 04:39.390
attacking the big train function that we'll make afterwards.

04:39.390 --> 04:42.180
Okay, so let's go back to Python.

04:42.180 --> 04:44.820
And now let's move on to the second function,

04:44.820 --> 04:46.170
share memory.

04:46.170 --> 04:48.000
So now I'm just gonna say a few words.

04:48.000 --> 04:50.880
The idea of this share memory function is kind of

04:50.880 --> 04:52.830
like tensor.CUDA.

04:52.830 --> 04:55.860
You know CUDA is an accelerator based on the GPU.

04:55.860 --> 04:58.140
And so basically what happens here is that

04:58.140 --> 05:02.790
we have these tensor of the states.share memory

05:02.790 --> 05:05.130
here, here, and here

05:05.130 --> 05:08.160
that behave a little bit like tensor.CUDA.

05:08.160 --> 05:10.410
So you know, accelerated computations.

05:10.410 --> 05:12.510
But the difference is

05:12.510 --> 05:13.350
that here,

05:13.350 --> 05:15.180
the tensor.share memory

05:15.180 --> 05:17.760
send the computations to a part of the GPU

05:17.760 --> 05:18.840
or the CPU

05:18.840 --> 05:22.140
that is accessible to all the paralife thread.

05:22.140 --> 05:23.580
So that's basically what is done here.

05:23.580 --> 05:25.980
That's a little bit like tensor.CUDA,

05:25.980 --> 05:27.420
but it's only sent

05:27.420 --> 05:30.360
to a part of the GPU or CPU accessible to the

05:30.360 --> 05:32.100
paralife thread.

05:32.100 --> 05:35.070
All right, and then we have the last function step.

05:35.070 --> 05:36.240
So you know this function,

05:36.240 --> 05:39.690
it's like the step method of the Adam Optimizer

05:39.690 --> 05:41.790
that we already used in this course.

05:41.790 --> 05:42.630
And so again,

05:42.630 --> 05:46.170
this is based on the algorithm one of the same paper

05:46.170 --> 05:47.160
that we saw before.

05:47.160 --> 05:48.810
So this algorithm,

05:48.810 --> 05:50.250
so again, if you want to understand

05:50.250 --> 05:52.380
in details the following lines of code,

05:52.380 --> 05:54.510
well again I anchored you to have a look

05:54.510 --> 05:57.570
at this algorithm one by this paper.

05:57.570 --> 06:01.620
And besides what is done here is not totally compulsory

06:01.620 --> 06:03.840
because this is actually a copy paste

06:03.840 --> 06:07.170
of the step method of the optim.adam class.

06:07.170 --> 06:09.630
So, basically what is done here

06:09.630 --> 06:12.720
we could have done it by using our inheritance

06:12.720 --> 06:15.630
because here we inherit from optim.adam.

06:15.630 --> 06:18.240
And so to use our inheritance,

06:18.240 --> 06:20.400
well, what we can do instead of doing all this

06:20.400 --> 06:21.300
is just-

06:21.300 --> 06:22.830
I'm gonna write here as comment,

06:22.830 --> 06:25.140
is just use the super function

06:25.140 --> 06:28.410
that we applied to our shared

06:28.410 --> 06:29.670
Adam class,

06:29.670 --> 06:31.560
then our object self.

06:31.560 --> 06:34.770
And here we just add step with parenthesis.

06:34.770 --> 06:39.150
Step is the method of the optim.adam class.

06:39.150 --> 06:40.710
And that's exactly the same.

06:40.710 --> 06:43.650
That's why I was just saying that here is just a copy paste

06:43.650 --> 06:46.860
of the step method of the optim.adam class.

06:46.860 --> 06:49.680
So I think that if you replace all this

06:49.680 --> 06:52.500
by this super function apply to share Adam

06:52.500 --> 06:53.820
and the step method,

06:53.820 --> 06:55.983
well we might get exactly the same thing.

06:57.210 --> 06:58.043
All right. So,

06:58.043 --> 06:59.880
that was interesting to have a quick look at it.

06:59.880 --> 07:02.850
Basically, you can see this as the Adam Optimizer.

07:02.850 --> 07:04.620
It's like we had a deeper look at it.

07:04.620 --> 07:05.580
But again,

07:05.580 --> 07:07.650
if you want to go in more details of all this

07:07.650 --> 07:10.830
and if you want to understand what happens behind the scene,

07:10.830 --> 07:14.160
well I encourage you to have a look at this research paper.

07:14.160 --> 07:16.050
I'll put the link in the comments here.

07:16.050 --> 07:18.960
You know, remember you will have all the codes commented

07:18.960 --> 07:19.950
in great details.

07:19.950 --> 07:22.560
So it's really good if you can have a look at it.

07:22.560 --> 07:25.410
And now I hope you have some great energy

07:25.410 --> 07:29.880
because we are going to move on to the train file,

07:29.880 --> 07:32.460
which will contain this huge train function.

07:32.460 --> 07:34.710
And that will basically train our brains,

07:34.710 --> 07:37.650
which now we can do because we have our optimizer.

07:37.650 --> 07:39.240
So have a good break now.

07:39.240 --> 07:40.230
Have a good sleep.

07:40.230 --> 07:41.940
And whenever you feel in great shape,

07:41.940 --> 07:44.400
let's move on to the next step.

07:44.400 --> 07:46.173
Until then, enjoy AI.