WEBVTT

00:00.420 --> 00:03.210
-: Hello and welcome to this Python tutorial.

00:03.210 --> 00:06.180
So now that we are ready to train the network

00:06.180 --> 00:08.160
to minimize the square distance

00:08.160 --> 00:09.960
between the outputs and the target,

00:09.960 --> 00:12.660
thanks to what we did with eligibility trace

00:12.660 --> 00:13.980
in the previous section.

00:13.980 --> 00:17.130
Well, basically, we are ready to start the whole training

00:17.130 --> 00:19.440
by, you know, getting our inputs, our targets,

00:19.440 --> 00:21.930
our predictions, then computing the last error

00:21.930 --> 00:23.880
between the predictions and the target,

00:23.880 --> 00:26.010
and then doing the backward propagation

00:26.010 --> 00:28.440
was to get the gradient descent to update the weights.

00:28.440 --> 00:30.480
So we are ready to do all this,

00:30.480 --> 00:33.780
but since we want to compute the moving average

00:33.780 --> 00:37.050
on 100 steps, you know, to keep track of the average

00:37.050 --> 00:38.130
during the training.

00:38.130 --> 00:41.010
Well, just before we do this whole training,

00:41.010 --> 00:44.130
we are going to make a class right now

00:44.130 --> 00:47.520
that will get this moving average on 100 steps.

00:47.520 --> 00:49.470
So no worries, we will do it quickly.

00:49.470 --> 00:51.120
We will make a class with three functions,

00:51.120 --> 00:53.790
but we will do all this in this single tutorial.

00:53.790 --> 00:54.930
So we will do it quickly.

00:54.930 --> 00:56.460
We already did it, and besides,

00:56.460 --> 00:58.650
we want to focus on the training right now

00:58.650 --> 01:00.810
because that's the most important.

01:00.810 --> 01:04.470
So let's make this class right now and this single tutorial.

01:04.470 --> 01:07.290
All right, so we are going to introduce a new class

01:07.290 --> 01:11.190
which we're gonna call MA for moving average.

01:11.190 --> 01:13.290
And then here we go with our first function.

01:13.290 --> 01:16.350
So that, of course, the init function,

01:16.350 --> 01:18.113
that never changes, init.

01:19.020 --> 01:22.110
And this init function is going to take two arguments.

01:22.110 --> 01:25.020
The first one is self for the moving average,

01:25.020 --> 01:27.960
future object, and size,

01:27.960 --> 01:32.700
which will correspond to the size of the list of the rewards

01:32.700 --> 01:34.620
of which we're going to compute the average.

01:34.620 --> 01:36.840
So this is going to be 100.

01:36.840 --> 01:39.447
All right, so we have our arguments for the init function,

01:39.447 --> 01:42.090
and now, let's go inside the function.

01:42.090 --> 01:43.140
Now, you know what to do.

01:43.140 --> 01:47.430
We have to initialize the variables specific to the object.

01:47.430 --> 01:48.900
And these are, well, first,

01:48.900 --> 01:53.900
the first one is going to be a list of rewards.

01:54.330 --> 01:57.480
So that's gonna be the list containing the 100 rewards

01:57.480 --> 01:59.310
of which we're going to compute the average.

01:59.310 --> 02:03.690
So here right now, we're just simply initializing this list

02:03.690 --> 02:05.820
with this empty list here.

02:05.820 --> 02:06.840
So list of awards,

02:06.840 --> 02:10.650
and then the second variable of our future object

02:10.650 --> 02:13.440
is going to be, of course, the size.

02:13.440 --> 02:16.350
And the size is going to be equal to the argument

02:16.350 --> 02:18.330
we will input when creating

02:18.330 --> 02:20.520
the future moving average objects.

02:20.520 --> 02:22.380
So size here,

02:22.380 --> 02:26.760
and already, we are ready to move on to the next function,

02:26.760 --> 02:29.490
which is going to be the add function.

02:29.490 --> 02:32.220
And that will add the cumulative rewards.

02:32.220 --> 02:34.080
Be careful, it's not the simple reward,

02:34.080 --> 02:35.580
it's the cumulative reward,

02:35.580 --> 02:38.280
and that's because, you know, we are doing eligibility trace

02:38.280 --> 02:40.740
and therefore, learning every 10 steps,

02:40.740 --> 02:43.020
and therefore, learning with cumulative rewards

02:43.020 --> 02:44.670
and not a simple reward.

02:44.670 --> 02:47.550
So this add function that we're about to make

02:47.550 --> 02:52.550
will add the cumulative reward to that list of rewards.

02:52.800 --> 02:56.670
So depth, we're gonna call it add, of course.

02:56.670 --> 02:59.340
And this add function is gonna take two arguments.

02:59.340 --> 03:00.750
The first one is self

03:00.750 --> 03:04.350
because we're gonna use this list of rewards here.

03:04.350 --> 03:07.080
Because simply, we're gonna append the accumulative rewards

03:07.080 --> 03:08.460
to this list of rewards

03:08.460 --> 03:11.010
so we need this self to be able to get this.

03:11.010 --> 03:11.843
So self.

03:11.843 --> 03:15.150
And the second one is going to be the rewards

03:15.150 --> 03:18.510
which will represent the cumulative reward.

03:18.510 --> 03:21.590
All right, so that's our two arguments of the add function.

03:21.590 --> 03:23.520
So now, let's go inside the function

03:23.520 --> 03:25.680
and let's define what it has to do.

03:25.680 --> 03:28.140
Okay, so very simply, the first thing it has to do

03:28.140 --> 03:32.190
is whenever we get a cumulative reward, a new one,

03:32.190 --> 03:34.560
you know, when we progress on 10 new steps,

03:34.560 --> 03:35.490
well, what we have to do

03:35.490 --> 03:39.240
is add this accumulative reward to the list.

03:39.240 --> 03:40.500
And that's exactly what we're gonna do.

03:40.500 --> 03:42.300
We're going to write a line of code

03:42.300 --> 03:44.400
that will add this new cumulative reward

03:44.400 --> 03:47.160
that we're getting after progressing on 10 steps

03:47.160 --> 03:49.260
to this list of rewards here.

03:49.260 --> 03:53.370
And to do this, we have to separate two conditions

03:53.370 --> 03:56.100
because since we will be working with batches.

03:56.100 --> 03:57.990
Well, the rewards will be in some lists,

03:57.990 --> 03:59.430
but in some other cases,

03:59.430 --> 04:02.250
the rewards can also be as a single element.

04:02.250 --> 04:06.570
And the syntax to add an element to a list,

04:06.570 --> 04:08.970
which is the list of rewards here, is not the same

04:08.970 --> 04:13.320
whether you're adding a list or a single element.

04:13.320 --> 04:14.670
So we just have to make this condition

04:14.670 --> 04:17.010
that will separate these two cases.

04:17.010 --> 04:18.750
And let's start with the first case

04:18.750 --> 04:21.270
which is the case when what we're adding

04:21.270 --> 04:23.700
to this list of rewards is a list.

04:23.700 --> 04:27.180
And to do this, we're gonna add isinstance,

04:27.180 --> 04:29.820
and in parenthesis, we input two arguments,

04:29.820 --> 04:33.150
the first one is our rewards that we're adding,

04:33.150 --> 04:34.650
so reward.

04:34.650 --> 04:37.620
And the second one is list.

04:37.620 --> 04:40.980
And so if isinstance, rewards, list

04:40.980 --> 04:45.030
means if the rewards are into a list.

04:45.030 --> 04:48.450
And so if the rewards are into a list,

04:48.450 --> 04:53.430
what we do is very simply self dot.

04:53.430 --> 04:56.470
We take our list of rewards

04:57.540 --> 05:01.230
and we are going to add this list.

05:01.230 --> 05:02.880
Because since this is a list,

05:02.880 --> 05:06.210
what we can do is use a simple addition operation

05:06.210 --> 05:08.700
because we can sum two lists together.

05:08.700 --> 05:09.930
The rewards here is a list

05:09.930 --> 05:11.910
because this will be equal to true,

05:11.910 --> 05:13.380
I mean, if we're in this case.

05:13.380 --> 05:18.380
And so we can simply sum this list to our list of rewards.

05:18.630 --> 05:21.870
And therefore, we can simply add here list of rewards

05:21.870 --> 05:24.840
plus equals rewards.

05:24.840 --> 05:27.750
And by doing this, we're just extending the list

05:27.750 --> 05:31.050
by summing these two lists together.

05:31.050 --> 05:31.883
All right.

05:31.883 --> 05:36.270
And then second condition, so we can simply add else.

05:36.270 --> 05:39.270
So that's if the rewards is not a list,

05:39.270 --> 05:42.960
and therefore, if it's a single element, and so else.

05:42.960 --> 05:44.460
What happens in that case?

05:44.460 --> 05:45.300
Well, that's the same.

05:45.300 --> 05:49.380
We want to add the reward to our list of rewards

05:49.380 --> 05:51.060
but we cannot use the syntax

05:51.060 --> 05:53.620
because rewards will no longer be a list,

05:53.620 --> 05:55.500
it'll be a single element.

05:55.500 --> 05:58.470
And so what we need to use is another syntax,

05:58.470 --> 06:00.090
which is the append function.

06:00.090 --> 06:02.820
When you want to add a single element to a list,

06:02.820 --> 06:05.850
you cannot sum the two, you have to use the append function.

06:05.850 --> 06:07.890
And so this is exactly what we're gonna do now,

06:07.890 --> 06:12.890
we are going to take our list of rewards of the object

06:13.410 --> 06:18.300
and paste that here, and then add dot append.

06:18.300 --> 06:19.133
There we go.

06:19.133 --> 06:20.550
First one.

06:20.550 --> 06:21.810
And of course, in parenthesis,

06:21.810 --> 06:24.780
we input the element we want to append.

06:24.780 --> 06:26.430
And this is, of course, rewards,

06:26.430 --> 06:29.460
but rewards in that case, will not be a list,

06:29.460 --> 06:30.900
it will be a single element

06:30.900 --> 06:34.890
like a single cumulative reward, not into a list.

06:34.890 --> 06:37.680
All right, and then we want to do this

06:37.680 --> 06:39.480
but now, we have to add something more,

06:39.480 --> 06:43.320
it's what does happen when this list of rewards

06:43.320 --> 06:45.540
gets more than 100 elements?

06:45.540 --> 06:47.610
Well, in that case, what we have to do

06:47.610 --> 06:50.880
is delete the first element of this list of reward

06:50.880 --> 06:53.640
to make sure that this list of rewards

06:53.640 --> 06:56.400
always contains no more than 100 elements.

06:56.400 --> 06:58.860
So exactly like what we did for the self-driving car

06:58.860 --> 07:00.600
when making the score window.

07:00.600 --> 07:02.070
And so to make sure of this,

07:02.070 --> 07:05.070
we're gonna add a while condition,

07:05.070 --> 07:10.070
specifying that whenever the length of our list of rewards

07:11.520 --> 07:15.060
that is the number of elements in our list of rewards.

07:15.060 --> 07:18.930
Whenever this number is larger than self dot size,

07:18.930 --> 07:21.300
that is the size that we set here

07:21.300 --> 07:24.030
in which later will be equal to 100

07:24.030 --> 07:25.530
when we create the object.

07:25.530 --> 07:27.540
Well, as soon as the number of elements

07:27.540 --> 07:30.360
of this list of rewards is larger than 100,

07:30.360 --> 07:35.360
well, what we want to do is delete the first element

07:35.490 --> 07:37.200
of our list of rewards,

07:37.200 --> 07:40.410
which we can get by taking the index zero.

07:40.410 --> 07:43.560
That is the first index of our list of rewards.

07:43.560 --> 07:47.160
This is the first element of our list of rewards

07:47.160 --> 07:50.880
and we want to delete it whenever our list of rewards

07:50.880 --> 07:53.550
contains more than 100 elements.

07:53.550 --> 07:56.670
So that with this condition here,

07:56.670 --> 07:59.010
we make sure that our list of rewards

07:59.010 --> 08:02.220
never contains more than 100 elements.

08:02.220 --> 08:04.530
And therefore, now what we can do

08:04.530 --> 08:08.130
is make a new function to compute the average

08:08.130 --> 08:09.510
of our list of rewards,

08:09.510 --> 08:12.570
which will contain on the run 100 elements.

08:12.570 --> 08:15.720
And therefore, we will compute the moving average

08:15.720 --> 08:18.060
on 100 steps each time.

08:18.060 --> 08:19.680
So let's make this new function.

08:19.680 --> 08:20.730
It's gonna be very easy

08:20.730 --> 08:23.190
because there is the mean function in Python,

08:23.190 --> 08:24.870
which is a function from NumPy,

08:24.870 --> 08:27.660
to compute the average of a list.

08:27.660 --> 08:29.790
And so let's introduce our last function here,

08:29.790 --> 08:32.160
which we're gonna call average.

08:32.160 --> 08:34.200
And this function is gonna take one argument,

08:34.200 --> 08:36.330
which is going to be self,

08:36.330 --> 08:37.800
because we're gonna use, of course,

08:37.800 --> 08:39.300
still our list of rewards,

08:39.300 --> 08:41.670
which is a variable of our objects.

08:41.670 --> 08:44.760
So self and colon.

08:44.760 --> 08:46.950
And now, let's compute the average.

08:46.950 --> 08:50.340
And so directly, we will return the average

08:50.340 --> 08:53.520
because we can get it with the mean function

08:53.520 --> 08:55.830
to which of course we are applying.

08:55.830 --> 08:58.260
Well, what we want to compute the mean of,

08:58.260 --> 09:00.180
that is our list of rewards.

09:00.180 --> 09:01.950
I think I still copied it.

09:01.950 --> 09:03.150
Yes, there we go.

09:03.150 --> 09:07.350
So we simply return the mean of our list of rewards.

09:07.350 --> 09:11.430
And the mean, as I said, is a function by NumPy.

09:11.430 --> 09:12.810
So here I'm adding the shortcut,

09:12.810 --> 09:16.740
np dot mean self list of rewards.

09:16.740 --> 09:21.030
And there we go, we have our average on 100 steps.

09:21.030 --> 09:21.863
Perfect.

09:21.863 --> 09:24.180
So we made the class very efficiently.

09:24.180 --> 09:25.680
Now we get the instructions

09:25.680 --> 09:29.250
on how to obtain a moving average on 100 steps.

09:29.250 --> 09:33.600
And since we're gonna use one moving average object

09:33.600 --> 09:34.770
when doing the training,

09:34.770 --> 09:38.730
well, let's already create this moving average object.

09:38.730 --> 09:40.890
And so we're gonna call it MA,

09:40.890 --> 09:45.890
and simply MA is going to be an object of the MA class.

09:46.560 --> 09:51.540
And as we said, we want the size to be 100

09:51.540 --> 09:55.170
because we want to compute the moving average on 100 steps.

09:55.170 --> 09:57.090
So perfect. There we go.

09:57.090 --> 10:01.830
We are now ready to train our AI to finally be intelligent.

10:01.830 --> 10:02.940
It's about time.

10:02.940 --> 10:06.210
It is from this point that our AI will become smart.

10:06.210 --> 10:07.890
So I can't wait to train it.

10:07.890 --> 10:09.210
It's gonna be quite easy

10:09.210 --> 10:11.370
because this is something we already did.

10:11.370 --> 10:12.720
But this is gonna be fun.

10:12.720 --> 10:16.230
And besides after that, it'll be time to have even more fun

10:16.230 --> 10:19.710
because basically, our AI will be fully ready that is built

10:19.710 --> 10:21.570
and also intelligent.

10:21.570 --> 10:23.670
And therefore, we will execute the code

10:23.670 --> 10:25.590
and then our AI will play "Doom."

10:25.590 --> 10:28.530
And eventually, we will watch the videos

10:28.530 --> 10:30.210
of our AI playing "Doom."

10:30.210 --> 10:33.480
And we will see if it manages to reach the best.

10:33.480 --> 10:34.530
So I can't wait.

10:34.530 --> 10:35.820
Let's do the training.

10:35.820 --> 10:37.683
And until then, enjoy AI.