WEBVTT

00:00.420 --> 00:02.490
-: Hello and welcome to this tutorial.

00:02.490 --> 00:04.650
All right, so now we have our AI.

00:04.650 --> 00:06.300
It is ready to be trained,

00:06.300 --> 00:07.740
and the first step of the training

00:07.740 --> 00:09.870
is to set up experience replay.

00:09.870 --> 00:12.090
So we're slowly getting there, the training,

00:12.090 --> 00:15.390
and the good news is that we have an implemented version

00:15.390 --> 00:18.750
of experience replay besides that is adapted

00:18.750 --> 00:20.280
to eligibility trace,

00:20.280 --> 00:22.620
which I remind is a technique that

00:22.620 --> 00:25.350
instead of learning the Q-values every transition,

00:25.350 --> 00:27.210
learns it every 10 transitions.

00:27.210 --> 00:29.460
So basically that's exactly the same as before,

00:29.460 --> 00:31.650
but instead of having a single target,

00:31.650 --> 00:33.390
a single reward for each step,

00:33.390 --> 00:35.970
we're gonna have a cumulative target on 10 steps

00:35.970 --> 00:38.160
and a cumulative reward on 10 steps,

00:38.160 --> 00:40.980
and we will learn on the 10 steps each time.

00:40.980 --> 00:42.810
So we are learning on 10 transitions,

00:42.810 --> 00:45.150
10 steps instead of one like before.

00:45.150 --> 00:47.820
And with this, our AI will work wonders,

00:47.820 --> 00:50.640
and that will make some wonders for the training process.

00:50.640 --> 00:52.620
The training will take much less time

00:52.620 --> 00:53.940
thanks to this technique.

00:53.940 --> 00:57.000
But we have to specify in experience replay

00:57.000 --> 00:58.950
that we're learning every 10 steps.

00:58.950 --> 01:01.860
So that's why this experience replay is not

01:01.860 --> 01:04.290
a classic implementation of experience replay

01:04.290 --> 01:06.180
like the one for the self-driving car.

01:06.180 --> 01:08.820
It is an experience replay implementation

01:08.820 --> 01:12.000
taking into account this 10 steps learning.

01:12.000 --> 01:13.050
And therefore you will find

01:13.050 --> 01:16.080
in this experience replay file two classes,

01:16.080 --> 01:19.890
one class that makes your AI progress during 10 steps

01:19.890 --> 01:23.430
so that it can sum the rewards observed on these 10 steps.

01:23.430 --> 01:24.600
That's the first class.

01:24.600 --> 01:25.530
And we need this class

01:25.530 --> 01:27.870
because we need to include these 10 steps

01:27.870 --> 01:29.310
in the replay memory class,

01:29.310 --> 01:31.830
which is the class we implement for experience replay.

01:31.830 --> 01:34.980
And that's how we make sure that the memory also takes

01:34.980 --> 01:38.010
into account the fact that we're learning on 10 steps.

01:38.010 --> 01:39.630
So that's why you will find two classes

01:39.630 --> 01:41.820
in this implementation of experience replay.

01:41.820 --> 01:44.280
But that's only to take into account

01:44.280 --> 01:45.930
that we're learning on 10 steps

01:45.930 --> 01:49.380
and that must be taken into account also in the memory.

01:49.380 --> 01:51.930
So speaking of our memory, let's create it.

01:51.930 --> 01:55.410
We're gonna call our memory memory.

01:55.410 --> 01:58.350
And so memory is going to be an object

01:58.350 --> 02:00.510
of the replay memory class,

02:00.510 --> 02:02.460
and the replay memory class is a class

02:02.460 --> 02:05.040
of this experience replay PY file.

02:05.040 --> 02:08.943
And so I'm taking first this file experience replay,

02:10.200 --> 02:15.200
then dot, and that's where I take the replay memory class.

02:15.930 --> 02:18.030
Perfect. And now as you can see,

02:18.030 --> 02:19.860
we have to input two arguments.

02:19.860 --> 02:22.050
The first argument is n steps,

02:22.050 --> 02:25.020
which corresponds exactly to the number of steps

02:25.020 --> 02:27.480
on which we're gonna learn the Q-values.

02:27.480 --> 02:30.330
So you know the number of steps on which we accumulate

02:30.330 --> 02:32.070
the target and the reward.

02:32.070 --> 02:34.320
So we're gonna have a cumulative target

02:34.320 --> 02:35.910
and a cumulative reward.

02:35.910 --> 02:38.190
And then the second argument is the capacity.

02:38.190 --> 02:39.810
That is the size of the memory.

02:39.810 --> 02:42.780
So, for example, here we can see 10,000.

02:42.780 --> 02:45.180
So if the capacity is equal to 10,000,

02:45.180 --> 02:47.790
that means that the memory will have a size of 10,000,

02:47.790 --> 02:50.850
and therefore that means that we will get a memory

02:50.850 --> 02:54.360
after 10,000 last steps performed by the AI.

02:54.360 --> 02:57.360
But again, we're not gonna learn every transition.

02:57.360 --> 03:00.210
We're gonna learn every 10 steps among these last

03:00.210 --> 03:01.890
10,000 steps of the memory,

03:01.890 --> 03:03.750
and that's exactly this new feature

03:03.750 --> 03:06.150
that we introduced here compared to before.

03:06.150 --> 03:09.150
Before we only had this replay memory trick,

03:09.150 --> 03:11.190
and here we have this replay memory trick

03:11.190 --> 03:14.190
plus this trick of learning every 10 steps.

03:14.190 --> 03:16.020
And we're gonna learn every 10 steps,

03:16.020 --> 03:17.640
and we're gonna do it in the memory

03:17.640 --> 03:20.160
composed of the last 10,000 steps.

03:20.160 --> 03:23.250
And this that is experience replay combined

03:23.250 --> 03:27.390
to an eligibility trace with 10 steps will considerably

03:27.390 --> 03:29.490
improve the training performance.

03:29.490 --> 03:31.170
So let's input these two arguments.

03:31.170 --> 03:33.810
The first one is n steps,

03:33.810 --> 03:36.450
and that will be equal to, well, for now,

03:36.450 --> 03:38.510
let's say n steps.

03:38.510 --> 03:41.670
We will specify what n step is right after that.

03:41.670 --> 03:43.380
It will actually be an object

03:43.380 --> 03:46.590
of the other class of this experience replay file,

03:46.590 --> 03:48.480
which is the n step progress class,

03:48.480 --> 03:52.170
and that allows to make the AI progress during 10 steps.

03:52.170 --> 03:53.820
And remember, during the 10 steps,

03:53.820 --> 03:56.280
we will sum the rewards on the 10 steps

03:56.280 --> 03:59.490
to get the cumulative rewards over 10 steps.

03:59.490 --> 04:02.250
And that is exactly eligibility trace.

04:02.250 --> 04:06.180
So now what we have to do is create this n steps here,

04:06.180 --> 04:09.360
and we create it with the second class that we have

04:09.360 --> 04:11.340
in this experience replay file,

04:11.340 --> 04:13.110
which is n step progress.

04:13.110 --> 04:17.370
So now we're gonna create n steps like this,

04:17.370 --> 04:19.110
and this will be an object

04:19.110 --> 04:24.030
of the n step progress class

04:24.030 --> 04:29.030
that we take again from our experience replay file.

04:30.930 --> 04:31.830
There we go.

04:31.830 --> 04:33.660
So that's the n step progress class.

04:33.660 --> 04:35.760
And now we have to input three arguments.

04:35.760 --> 04:38.340
As you can see, we have to input the environment,

04:38.340 --> 04:41.220
which is the Doom environment here that we imported.

04:41.220 --> 04:43.680
Then the second argument is our AI,

04:43.680 --> 04:45.000
and this will be, of course,

04:45.000 --> 04:49.050
the AI that we built right here in the previous section.

04:49.050 --> 04:51.420
And the last argument is n step,

04:51.420 --> 04:55.560
and this, that's where we'll specify that we want 10 steps,

04:55.560 --> 04:57.180
to learn every 10 steps,

04:57.180 --> 04:59.190
that is, every 10 transitions.

04:59.190 --> 05:01.170
So let's input these arguments.

05:01.170 --> 05:02.790
The first one is the environment,

05:02.790 --> 05:04.920
and that's Doom,

05:04.920 --> 05:06.900
and all right,

05:06.900 --> 05:09.180
then the second one is our AI,

05:09.180 --> 05:11.820
and that we called it AI.

05:11.820 --> 05:12.720
That's the one here.

05:12.720 --> 05:14.730
So this is just the name of the argument

05:14.730 --> 05:16.710
of the n step progress class,

05:16.710 --> 05:19.740
and this AI here is our AI,

05:19.740 --> 05:21.330
the one that we built.

05:21.330 --> 05:25.290
And then the last argument is n step,

05:25.290 --> 05:27.120
and that is equal to 10.

05:27.120 --> 05:30.210
All right, so right now we are just taking into account

05:30.210 --> 05:33.480
in the memory that there is a learning on 10 steps,

05:33.480 --> 05:37.200
and this learning on 10 steps is called eligibility trace.

05:37.200 --> 05:39.630
So we're really working on the advanced stuff here,

05:39.630 --> 05:42.210
but remember that's because we're trying to beat Doom,

05:42.210 --> 05:44.250
that's nothing like making a piece of cake,

05:44.250 --> 05:47.640
so we need these advanced techniques to make it work.

05:47.640 --> 05:50.760
So now we're almost ready before moving on to the next step,

05:50.760 --> 05:54.810
which will be actually about implementing eligibility trace.

05:54.810 --> 05:57.060
The only thing that we have to include

05:57.060 --> 05:59.250
is the capacity, of course,

05:59.250 --> 06:02.130
and that is, let's say, 10,000.

06:02.130 --> 06:04.830
The memory will have a size of 10,000,

06:04.830 --> 06:06.720
meaning that the memory will contain

06:06.720 --> 06:10.440
the last 10,000 steps performed by the AI,

06:10.440 --> 06:13.890
and that will allow us to generate some mini batches,

06:13.890 --> 06:16.650
as you remember, with a sample function.

06:16.650 --> 06:19.020
The memory contains 10,000 transitions,

06:19.020 --> 06:20.550
but to train the AI,

06:20.550 --> 06:24.060
we're going to sample some mini batches of 10 transitions,

06:24.060 --> 06:25.740
not one compared to before,

06:25.740 --> 06:27.180
10 transitions this time,

06:27.180 --> 06:30.000
and we will sample these mini batches of 10 transitions

06:30.000 --> 06:33.480
in the memory composed of the 10,000 last steps.

06:33.480 --> 06:35.430
All right, so now I guess we're ready

06:35.430 --> 06:36.960
to move on to the next step,

06:36.960 --> 06:39.480
which is about implementing eligibility trace.

06:39.480 --> 06:41.640
So we're gonna have some adventure here.

06:41.640 --> 06:43.710
This will not be a simple implementation.

06:43.710 --> 06:45.270
So have a good break,

06:45.270 --> 06:47.880
and when you're ready, we can attack this.

06:47.880 --> 06:49.533
Until then, enjoy AI.