WEBVTT

00:00.420 --> 00:02.820
-: Hello and welcome to this Python tutorial.

00:02.820 --> 00:05.370
All right, so we just updated the memory

00:05.370 --> 00:06.900
after reaching the new state

00:06.900 --> 00:09.300
and now let's take care of the next update.

00:09.300 --> 00:10.230
According to you now,

00:10.230 --> 00:12.810
what is going to be the next update?

00:12.810 --> 00:16.800
Well, basically we are done with one transition.

00:16.800 --> 00:19.800
We updated the last element of the transition

00:19.800 --> 00:21.150
which is the new state.

00:21.150 --> 00:23.490
So now it's like we're starting all over again.

00:23.490 --> 00:26.070
And when we are starting all over again, what it's like

00:26.070 --> 00:28.380
you know, we are in this new state at the environment

00:28.380 --> 00:31.500
and so what do we need to do now naturally?

00:31.500 --> 00:33.840
Well, of course it's to play an action.

00:33.840 --> 00:37.320
Because we already got the observation of the new state.

00:37.320 --> 00:40.830
So now, next thing that we have to do is play an action.

00:40.830 --> 00:43.290
And therefore what we need to do now is of course

00:43.290 --> 00:47.010
use the select action function to play the action.

00:47.010 --> 00:47.843
So let's do it.

00:47.843 --> 00:50.640
Let's create a new verbal action

00:50.640 --> 00:54.240
and let's play the action with the select action function.

00:54.240 --> 00:57.990
So I'm taking while first self to specify

00:57.990 --> 01:01.230
that the select action function is a method

01:01.230 --> 01:04.440
of the object of the De Qing class that will be created.

01:04.440 --> 01:09.440
So self dot select action. Here we go. Self dot

01:10.080 --> 01:12.480
select action. And then of course,

01:12.480 --> 01:16.530
since the select action function takes the state as input

01:16.530 --> 01:18.690
because of course the select action function

01:18.690 --> 01:22.350
will return the output of the neural network when

01:22.350 --> 01:25.470
the current input state entered the neural network.

01:25.470 --> 01:27.810
So we have to input the input state here

01:27.810 --> 01:30.870
and since that's the state that we just reached

01:30.870 --> 01:33.510
in the environment right now, where the input state is

01:33.510 --> 01:37.140
of course new state, because the state that we just reached

01:37.140 --> 01:40.260
at the time we were right now is new state.

01:40.260 --> 01:42.090
So in the select action function

01:42.090 --> 01:45.150
I've been putting new states.

01:45.150 --> 01:47.250
All right, so with this line of code

01:47.250 --> 01:52.140
we simply play the new action after reaching the new state.

01:52.140 --> 01:54.480
Okay? And now that we played in action, well

01:54.480 --> 01:57.000
we get the reward and therefore we

01:57.000 --> 01:58.980
get a feedback with the reward.

01:58.980 --> 02:02.190
And therefore, if we have more than 100 elements

02:02.190 --> 02:04.830
in the memory, well it would be time to learn.

02:04.830 --> 02:08.610
And therefore, what we must do now is what logically comes

02:08.610 --> 02:11.880
after selecting an action, which is of course to learn.

02:11.880 --> 02:13.890
The AI needs to start learning

02:13.890 --> 02:15.840
if it is doing the things the right way.

02:15.840 --> 02:18.870
And now since it just played the action, well

02:18.870 --> 02:20.850
we're gonna make the AI learn

02:20.850 --> 02:24.150
from its actions in the last 100 events.

02:24.150 --> 02:26.430
But before we apply this learn function

02:26.430 --> 02:30.330
we have to make this if condition to make sure

02:30.330 --> 02:33.810
that we already have reached more than 100 events.

02:33.810 --> 02:34.830
Because you know, we're learning

02:34.830 --> 02:37.500
from the random symbols of the memory.

02:37.500 --> 02:41.160
You know, we have this huge memory of 10,000 elements.

02:41.160 --> 02:42.840
We're taking some random samples

02:42.840 --> 02:45.420
of the memory of 100 elements

02:45.420 --> 02:49.530
and the AI is learning from the information contained

02:49.530 --> 02:52.830
in this sample of 100 random transitions.

02:52.830 --> 02:55.110
So let's just make this if condition

02:55.110 --> 02:57.570
to make sure that the number

02:57.570 --> 03:02.190
of elements of the memory, self dot memory,

03:02.190 --> 03:04.290
and then be careful. Just a little trick here,

03:04.290 --> 03:07.440
self dot memory is the object of your replay memory class.

03:07.440 --> 03:10.920
But then the replay memory class has a attribute

03:10.920 --> 03:12.480
which is memory.

03:12.480 --> 03:17.280
So in fact, we need to take self dot memory dot memory.

03:17.280 --> 03:22.280
The first memory is the object of the replay memory class.

03:23.010 --> 03:26.400
And the second memory is the attribute here,

03:26.400 --> 03:28.200
self dot memory.

03:28.200 --> 03:32.820
So, if the number of elements in the memory is, well

03:32.820 --> 03:36.599
we want it to be larger than 100, then Colin

03:36.599 --> 03:38.970
and then what happens?

03:38.970 --> 03:41.670
Well, we can learn, but before learning

03:41.670 --> 03:45.720
we need to get this random sample of 100 transitions.

03:45.720 --> 03:48.810
And this we can get with the simple function.

03:48.810 --> 03:52.530
And since the simple function returns the different batches

03:52.530 --> 03:55.260
the states at time T, the states at time T plus one

03:55.260 --> 03:58.140
the actions at time T and rewards at time T.

03:58.140 --> 04:01.890
Well, what we need to do now is create some new valves

04:01.890 --> 04:04.140
which are gonna be the batch of the states at time T,

04:04.140 --> 04:05.490
the batch of the next states,

04:05.490 --> 04:08.160
the batch of the rewards and the batch of the actions.

04:08.160 --> 04:10.830
And we can simply give the same names

04:10.830 --> 04:15.830
as we gave for the arguments here and base that here.

04:16.170 --> 04:19.170
And these variables will be equal

04:19.170 --> 04:22.800
to what the simple function returns

04:22.800 --> 04:26.100
because it returns exactly these batches of the states,

04:26.100 --> 04:28.320
next states, rewards and actions.

04:28.320 --> 04:32.340
So what we simply need to do now is get first

04:32.340 --> 04:35.520
our memory object, and then from this memory object

04:35.520 --> 04:38.850
we are gonna use the simple method

04:38.850 --> 04:40.470
which will take as input.

04:40.470 --> 04:43.170
Well, the number of transitions we want our

04:43.170 --> 04:46.590
AI to learn from, that is 100.

04:46.590 --> 04:47.640
That's why we made sure

04:47.640 --> 04:50.580
that the memory had more than 100 transitions.

04:50.580 --> 04:54.810
So it's gonna learn from 100 transitions of the memory

04:54.810 --> 04:56.610
so the learning will be much better.

04:56.610 --> 04:59.670
And so now, let's make this learning happen.

04:59.670 --> 05:02.760
Well, since the learn method is a method

05:02.760 --> 05:05.430
of our De Qing class, well we need to

05:05.430 --> 05:08.670
access this learn method from the future

05:08.670 --> 05:12.030
objects that will be created from the De Qing class,

05:12.030 --> 05:14.490
and therefore what we need to take is self.

05:14.490 --> 05:17.430
Self refers to that object to the De Qing class

05:17.430 --> 05:21.240
and then learn as this learn method.

05:21.240 --> 05:26.040
Learn method to which we input of course, these guys here,

05:26.040 --> 05:29.100
the batch state, the batch next state, the batch reward

05:29.100 --> 05:30.630
and the batch action.

05:30.630 --> 05:34.890
These are our batches, sampled from our memory

05:34.890 --> 05:39.180
and we get 100 of them because we have 100 transitions

05:39.180 --> 05:42.930
and from this 100 transitions, we take 100 states

05:42.930 --> 05:46.323
100 next states, 100 rewards, and 100 actions.

05:47.340 --> 05:48.840
So let's face that here.

05:48.840 --> 05:51.810
And there we go. Now the learning will happen.

05:51.810 --> 05:54.570
It will happen from all these random batches.

05:54.570 --> 05:59.160
Perfect. And now what we need to do is the very

05:59.160 --> 06:01.350
last updates after, you know

06:01.350 --> 06:04.290
reaching a new state and playing in action.

06:04.290 --> 06:06.360
Well, we got the action to play

06:06.360 --> 06:10.230
but we still didn't update the action that is ourselves

06:10.230 --> 06:11.730
that last action variable.

06:11.730 --> 06:13.740
So let's make sure we don't forget this.

06:13.740 --> 06:15.150
Let's do it right now.

06:15.150 --> 06:20.150
We will update the last action self dot last action equals

06:21.840 --> 06:23.790
and of course, action.

06:23.790 --> 06:25.380
The action that we just pay here

06:25.380 --> 06:27.180
with the select action function.

06:27.180 --> 06:30.810
All right, so now the last action is updated, then same

06:30.810 --> 06:34.140
for the new state. We reached the new state

06:34.140 --> 06:37.410
but we haven't updated the last date yet, because

06:37.410 --> 06:40.740
of course the last date was before the state at time T.

06:40.740 --> 06:43.260
But since now we reached the new state T

06:43.260 --> 06:45.030
plus one at time T plus one.

06:45.030 --> 06:48.090
Well, the last state becomes this new state here

06:48.090 --> 06:50.280
and therefore we need to update it as well.

06:50.280 --> 06:55.280
Self dot last state equals our new state.

06:57.270 --> 06:59.760
There we go. And now what do we need to update?

06:59.760 --> 07:01.650
Well, there is only one thing left.

07:01.650 --> 07:03.660
That's of course the reward.

07:03.660 --> 07:08.165
And the reward is exactly the reward we get in reality.

07:08.165 --> 07:12.150
That will be the argument of this update function

07:12.150 --> 07:14.880
which if we make the connection to our map,

07:14.880 --> 07:19.740
will be the last reward. That is the reward we get right

07:19.740 --> 07:23.430
after playing the action in this reached new state.

07:23.430 --> 07:25.470
So if we go onto some sand

07:25.470 --> 07:28.350
this last reward will be bad minus one.

07:28.350 --> 07:30.180
If we get further from the goal

07:30.180 --> 07:33.630
we will get a slightly bad reward minus 0.2.

07:33.630 --> 07:35.670
If we get closer to the goal

07:35.670 --> 07:38.455
we will get a slightly good reward, open one.

07:38.455 --> 07:41.190
And if we get too close to one edge of the map

07:41.190 --> 07:43.500
well that will be a bad punishment.

07:43.500 --> 07:45.480
We will get minus one for each edge.

07:45.480 --> 07:47.100
So that's the last reward we get.

07:47.100 --> 07:49.110
In reality, that is the one that happens

07:49.110 --> 07:50.820
for real on the map.

07:50.820 --> 07:53.580
And this will be the argument of the update function.

07:53.580 --> 07:56.220
This last reward here, that's exactly this one.

07:56.220 --> 07:59.550
And so, since this is the argument of the update function

07:59.550 --> 08:02.160
while that corresponds to this reward here

08:02.160 --> 08:07.160
and therefore our self dot last reward variable initialized

08:08.970 --> 08:11.010
at the beginning in this it function

08:11.010 --> 08:13.770
becomes the new reward we get.

08:13.770 --> 08:18.770
In reality that is reward, or that's the same last reward.

08:20.730 --> 08:23.970
All right, so now we updated our last reward.

08:23.970 --> 08:27.450
And now since we just got our last reward,

08:27.450 --> 08:29.850
well we can now update the reward window.

08:29.850 --> 08:33.840
You remember? The reward window we initialized here

08:33.840 --> 08:37.170
as one of the variable of the object of our De Qing class.

08:37.170 --> 08:39.420
That's the window that's going to keep track

08:39.420 --> 08:40.740
of how this training is going

08:40.740 --> 08:44.280
by taking the average of the last 100 rewards.

08:44.280 --> 08:46.500
So you know, it'll be like a sliding window

08:46.500 --> 08:50.160
showing us how the mean of the reward is evolving.

08:50.160 --> 08:52.890
And so since we just got our last reward

08:52.890 --> 08:55.530
well we can update the reward window.

08:55.530 --> 08:57.060
And so how do we update it?

08:57.060 --> 08:59.760
Well, we simply need to append this last reward

08:59.760 --> 09:00.690
to the window.

09:00.690 --> 09:05.250
And therefore what I'm gonna do is take my reward window

09:05.250 --> 09:08.520
self dot reward window. Here it is.

09:08.520 --> 09:12.510
And then I'm going to use the append function.

09:12.510 --> 09:14.603
And inside the append function, we need to

09:14.603 --> 09:18.780
input the element we want to append to the reward window.

09:18.780 --> 09:21.480
And that's of course the reward.

09:21.480 --> 09:22.590
All right, perfect.

09:22.590 --> 09:25.590
And then, since this reward window

09:25.590 --> 09:27.900
is going to have a fixed size

09:27.900 --> 09:29.880
you know it's not going to be a growing window

09:29.880 --> 09:32.700
it's going to be a window, a fixed size sliding

09:32.700 --> 09:35.490
with time to show us the evolution of the reward.

09:35.490 --> 09:38.730
And so now we need to decide for a size of this window

09:38.730 --> 09:41.340
and it's simply the number of means

09:41.340 --> 09:43.890
of the rewards we will have in this window.

09:43.890 --> 09:46.140
And so for example, let's get, you know

09:46.140 --> 09:49.290
the last 1,000 means of the last 100 rewards.

09:49.290 --> 09:50.280
And so to make sure of it,

09:50.280 --> 09:54.900
we're gonna add if then len, then we'd say

09:54.900 --> 09:59.900
our reward window and we simply add here, if the number

10:01.260 --> 10:06.000
of elements in the reward window is larger than 1,000

10:06.000 --> 10:11.000
well what we wanna do is delete the first element

10:12.540 --> 10:13.830
of this reward window.

10:13.830 --> 10:16.380
And the first element of this reward window

10:16.380 --> 10:19.170
has the index zero.

10:19.170 --> 10:20.003
All right?

10:20.003 --> 10:20.836
And now we make sure

10:20.836 --> 10:22.890
that this reward window will never get more

10:22.890 --> 10:24.570
than 1,000 elements.

10:24.570 --> 10:27.960
That is 1,000 means of the last 100 rewards.

10:27.960 --> 10:30.270
So that's perfect. This will be a window,

10:30.270 --> 10:32.460
a fixed size, so that we can see if the mean

10:32.460 --> 10:35.880
of the reward is increasing and therefore if the training

10:35.880 --> 10:39.810
is going well and accordingly the car does what we want.

10:39.810 --> 10:44.430
Perfect. And now one tiny little thing to do left according

10:44.430 --> 10:46.140
to you, what is it going to be?

10:46.140 --> 10:49.590
Well, remember this update function not only

10:49.590 --> 10:51.960
updates the different elements of the transition

10:51.960 --> 10:55.080
and the reward window, but also it returns the

10:55.080 --> 10:58.170
action that was played when reaching this new state.

10:58.170 --> 11:02.850
That's why we have in the map action equals brain of update

11:02.850 --> 11:04.500
last word, last signal

11:04.500 --> 11:07.290
and therefore it's supposed to return something

11:07.290 --> 11:09.360
and there's something it is supposed to return is

11:09.360 --> 11:10.800
of course the action.

11:10.800 --> 11:13.800
So the simple last thing we need to do here,

11:13.800 --> 11:17.160
is just return action.

11:17.160 --> 11:20.580
The action that was displayed when reaching the new state

11:20.580 --> 11:21.690
and that's it.

11:21.690 --> 11:23.820
Our update function is ready

11:23.820 --> 11:25.920
It's going to do all the required updates

11:25.920 --> 11:29.460
and it'll return the action when reaching the new state.

11:29.460 --> 11:30.630
That's perfect.

11:30.630 --> 11:33.450
That was the last difficult action to make

11:33.450 --> 11:35.250
for all this AI process.

11:35.250 --> 11:37.170
Now the rest will be kit stuff.

11:37.170 --> 11:40.140
We will just make a score function to return the means

11:40.140 --> 11:42.180
of the rewards in the reward window,

11:42.180 --> 11:44.820
then we will make a safe function to save the brain

11:44.820 --> 11:47.580
of the car whenever you want to quit the application

11:47.580 --> 11:49.380
and go back to it and of course,

11:49.380 --> 11:51.840
since you want to be able to load the brain of your car,

11:51.840 --> 11:54.990
when you get back to it, get back to the application.

11:54.990 --> 11:58.050
Well, we will end up by making a load function

11:58.050 --> 12:00.000
which will load your model

12:00.000 --> 12:02.820
after you saved your model with the safe function.

12:02.820 --> 12:06.480
So three functions to do left, but it's going to be simple.

12:06.480 --> 12:08.910
And then we'll have the most exciting section

12:08.910 --> 12:11.700
of this first module. That is the demo.

12:11.700 --> 12:13.830
We will see if the AI works,

12:13.830 --> 12:15.900
we will see if the car reaches the goals

12:15.900 --> 12:18.210
and we will see how we can improve it.

12:18.210 --> 12:22.140
And then eventually you will have built your first AI.

12:22.140 --> 12:24.120
So I can't wait to start the demo.

12:24.120 --> 12:26.820
Let's make these three functions first and until then.

12:26.820 --> 12:27.903
Enjoy AI.