WEBVTT

00:00.240 --> 00:02.460
-: Hello, and welcome to this tutorial.

00:02.460 --> 00:03.450
I'm super excited

00:03.450 --> 00:05.820
because we're about to make the A3C brain

00:05.820 --> 00:07.560
that is the brain of our AI.

00:07.560 --> 00:09.180
And speaking about brains,

00:09.180 --> 00:10.860
I would like to highlight something.

00:10.860 --> 00:12.030
Remember in the first module

00:12.030 --> 00:15.810
we made a simple brain with only fully connected layers.

00:15.810 --> 00:17.610
Then in the second module for Doom,

00:17.610 --> 00:21.810
we made a brain that not only had fully connected layers,

00:21.810 --> 00:25.380
but also eyes because we added the connected layers

00:25.380 --> 00:26.790
which gave eyes to the AI

00:26.790 --> 00:29.640
because it could observe the images

00:29.640 --> 00:31.800
and understand what's going on inside.

00:31.800 --> 00:34.266
And now we're gonna take it even at a high level

00:34.266 --> 00:36.780
because we are gonna make a brain

00:36.780 --> 00:39.840
that not only will have eyes and fully connected layers,

00:39.840 --> 00:41.430
but also memory.

00:41.430 --> 00:43.830
Because as I said in the previous tutorial,

00:43.830 --> 00:45.900
we're gonna add a record renewal network

00:45.900 --> 00:47.370
inside this big brain,

00:47.370 --> 00:50.550
and that will give a long memory to our brain

00:50.550 --> 00:53.580
so that it can understand the temporal relationships,

00:53.580 --> 00:57.090
the temporal properties of the input images.

00:57.090 --> 00:57.960
So there we go,

00:57.960 --> 00:59.790
and an even more powerful brain.

00:59.790 --> 01:00.990
I can tell you that the model

01:00.990 --> 01:02.640
we're about to implement right now

01:02.640 --> 01:04.980
is really, really powerful.

01:04.980 --> 01:07.470
And we can see how building AIs

01:07.470 --> 01:08.460
and doing deep learning,

01:08.460 --> 01:10.110
doing deeper enforcement learning

01:11.195 --> 01:12.600
is all about getting closer and closer

01:12.600 --> 01:14.970
to how human brain works.

01:14.970 --> 01:17.340
We started with the basic relationships that the brain

01:17.340 --> 01:19.500
with the linear full connections,

01:19.500 --> 01:20.580
then we added eyes,

01:20.580 --> 01:22.200
then we added the memory.

01:22.200 --> 01:25.230
Who knows what we're gonna add in the future models?

01:25.230 --> 01:27.150
In 2018, maybe they will add

01:27.150 --> 01:29.460
something that will make the brain look like

01:29.460 --> 01:31.650
even more like a human brain.

01:31.650 --> 01:35.280
But already with fully connected layers, eyes, and a memory,

01:35.280 --> 01:38.730
we have already a really good and functional brain.

01:38.730 --> 01:39.750
So let's do it.

01:39.750 --> 01:41.280
Let's make this brain.

01:41.280 --> 01:43.500
So as usual, we're gonna make a class for that

01:43.500 --> 01:46.320
because it's gonna have a lot of properties

01:46.320 --> 01:48.840
with the convolutions and the LCMs.

01:48.840 --> 01:50.490
So we're gonna make a init function

01:50.490 --> 01:52.170
to initialize all this,

01:52.170 --> 01:53.743
create all these connections,

01:53.743 --> 01:55.380
and then of course

01:55.380 --> 01:56.970
we'll have the forward function

01:56.970 --> 01:59.622
that will of course propagate the signal inside the brain

01:59.622 --> 02:02.550
so that we can get eventually the outputs.

02:02.550 --> 02:04.140
All right, are you ready?

02:04.140 --> 02:05.370
Let's do this.

02:05.370 --> 02:08.310
So class, we introduce a new class

02:08.310 --> 02:11.400
which we call actor-critic,

02:11.400 --> 02:13.740
because of course I'm talking about brains here.

02:13.740 --> 02:16.493
But let's not forget that we're making the A3C model

02:16.493 --> 02:19.590
which is based on the actor-critic principle

02:19.590 --> 02:22.080
with separately the actor and the critic.

02:22.080 --> 02:23.490
So we will make actually

02:23.490 --> 02:25.560
one linear full connection for the actor

02:25.560 --> 02:27.780
and one linear full connection for the critic.

02:27.780 --> 02:28.720
You'll see how we will do that.

02:28.720 --> 02:30.870
It'll be actually quite simple.

02:30.870 --> 02:32.280
So actor-critic

02:32.280 --> 02:36.720
and this actor-critic class is going to inherit

02:36.720 --> 02:38.430
from the nn. module

02:38.430 --> 02:40.830
so that we can use all the PyTorch tools.

02:40.830 --> 02:44.140
So let's do this to inherit from the nn. module

02:48.282 --> 02:50.400
while we need to take first the torch library

02:50.400 --> 02:53.070
then dot then nn, and dot and module.

02:53.070 --> 02:54.030
All right.

02:54.030 --> 02:55.623
So that way we inherit from it.

02:56.970 --> 02:57.803
All right.

02:57.803 --> 02:59.250
So there we go with our first function

02:59.250 --> 03:01.503
which will be of course the init function.

03:02.410 --> 03:03.573
So we start with init,

03:04.745 --> 03:07.200
init double underscore.

03:07.200 --> 03:09.150
Then this init function is going to take

03:09.150 --> 03:12.270
as argument self for the object.

03:12.270 --> 03:15.210
Then the input shape that is the dimensions

03:15.210 --> 03:19.170
of our input images, and we call it numb input.

03:19.170 --> 03:21.600
And the action space

03:21.600 --> 03:25.950
which is basically the space that contains all the actions.

03:25.950 --> 03:27.840
But also, you know, from this action space

03:27.840 --> 03:29.880
we can get the number of actions.

03:29.880 --> 03:31.680
That is a number of possible actions

03:31.680 --> 03:33.870
which we will actually get very soon.

03:33.870 --> 03:36.360
So that's why we also need it.

03:36.360 --> 03:39.630
So that's for the arguments, that's all we need.

03:39.630 --> 03:41.730
And then let's go inside the function

03:41.730 --> 03:45.810
and let's create all the variables proper to our brain.

03:45.810 --> 03:46.890
But before we do that

03:46.890 --> 03:49.740
remember what we have to do to activate

03:49.740 --> 03:53.040
in some way the inheritance that we can use all the tools

03:53.040 --> 03:54.990
from the nn. module,

03:54.990 --> 03:58.590
we have to use the super function this way

03:58.590 --> 04:03.590
inside of which we input actor-critic, that is our class

04:04.500 --> 04:08.130
and then, comma, self for the object.

04:08.130 --> 04:08.963
All right?

04:08.963 --> 04:10.200
Then dot.

04:10.200 --> 04:14.910
And there we go again with the init function,

04:14.910 --> 04:15.743
there we go.

04:15.743 --> 04:18.180
That gives us all the tools that we'll need

04:18.180 --> 04:20.370
from torch to build our brain.

04:20.370 --> 04:21.360
All right?

04:21.360 --> 04:24.750
Then, well, it's time to make the eyes of the AI

04:24.750 --> 04:26.430
that is the convolutions.

04:26.430 --> 04:28.230
So we're gonna do it very quickly

04:28.230 --> 04:30.090
because we already explained this

04:30.090 --> 04:32.610
in details for Doom, because remember

04:32.610 --> 04:35.370
the AI for Doom had eyes, so it's exactly the same.

04:35.370 --> 04:37.260
We're gonna make some convolutions

04:37.260 --> 04:39.450
and we will use a very simple architecture

04:39.450 --> 04:43.200
with 32 feature detectors of size, three by three

04:43.200 --> 04:45.390
a stride of two, and a padding of one.

04:45.390 --> 04:46.980
So that's a pretty classic architecture

04:46.980 --> 04:49.710
but that will actually be enough to, you know

04:49.710 --> 04:52.294
make sure that the AI understands

04:52.294 --> 04:54.450
what's going on in the breakout game.

04:54.450 --> 04:56.790
All right, so let's make those convolutions.

04:56.790 --> 04:58.230
So we start with self

04:58.230 --> 05:01.800
because the convolutions will be variables of the object.

05:01.800 --> 05:05.100
So self.com, we can call it com

05:05.100 --> 05:06.810
and there's gonna be four convolutions.

05:06.810 --> 05:09.510
So I'm calling this one cons one.

05:09.510 --> 05:11.070
And there we go.

05:11.070 --> 05:14.010
We take the nn module dot

05:14.010 --> 05:17.250
and then we take the com 2 D class

05:17.250 --> 05:21.510
because actually cons one will be an object of this class.

05:21.510 --> 05:22.343
And then

05:22.343 --> 05:25.920
inside first we input the input shape of the images.

05:25.920 --> 05:27.300
So that's exactly what we have here.

05:27.300 --> 05:32.300
So we can copy this and enter it as the first input.

05:33.060 --> 05:35.610
Then the second argument is the number

05:35.610 --> 05:38.640
of feature detectors or also the number of kernels.

05:38.640 --> 05:42.810
So we're gonna take 32, as we just said, classic choice.

05:42.810 --> 05:45.090
Then we need to choose a size of the kernel.

05:45.090 --> 05:46.830
That is the number of cells

05:46.830 --> 05:50.160
that will slide over the input image.

05:50.160 --> 05:51.690
And so remember, we can either take

05:51.690 --> 05:54.240
3, 4, 5, that's common choices.

05:54.240 --> 05:55.940
And here we're gonna choose three.

05:56.790 --> 06:01.320
And then we're gonna choose a stride of two

06:01.320 --> 06:05.220
and a padding of one.

06:05.220 --> 06:06.150
There we go.

06:06.150 --> 06:07.890
So that's for the first convolution

06:07.890 --> 06:10.770
that goes from the input image

06:10.770 --> 06:13.440
to the first convolutional layers

06:13.440 --> 06:16.320
composed of 32 convoluted images.

06:16.320 --> 06:18.900
So now we're ready to make the second convolution.

06:18.900 --> 06:21.450
So it's actually going to be almost the same.

06:21.450 --> 06:26.310
So I'm copying this line and pasting that below

06:26.310 --> 06:29.730
but pasting it below again and pasting it one last time

06:29.730 --> 06:32.190
because we're gonna have four convolutions

06:32.190 --> 06:34.650
with almost nothing to change.

06:34.650 --> 06:37.500
So we can already replace here cons one

06:37.500 --> 06:40.500
by cons two; cons one by cons three;

06:40.500 --> 06:42.630
and cons one by cons four.

06:42.630 --> 06:45.150
That will be our four convolutions.

06:45.150 --> 06:47.430
And now of course, we need to change some things here

06:47.430 --> 06:49.410
but not much because we're gonna keep a stride

06:49.410 --> 06:52.026
of two for each and a padding of one

06:52.026 --> 06:55.080
they will all have 32 feature detectors.

06:55.080 --> 06:58.740
That is 32 output convoluted images.

06:58.740 --> 07:01.170
But then here remember, this corresponds

07:01.170 --> 07:04.140
to the left part of the convolution.

07:04.140 --> 07:05.970
So actually that corresponds to what was

07:05.970 --> 07:08.580
at the right part of the previous convolution.

07:08.580 --> 07:10.290
You know, remember, it's like a domino,

07:10.290 --> 07:11.520
so it's really easy.

07:11.520 --> 07:14.820
And therefore here we have to input 32

07:14.820 --> 07:19.820
and here as well. We're gonna see very easily 32 and 32.

07:21.991 --> 07:24.870
All right, so to sum up, we start

07:24.870 --> 07:29.340
with our input images that has numb inputs dimensions,

07:29.340 --> 07:33.510
with the first convolution, we get 32 convoluted images

07:33.510 --> 07:35.760
each one detecting a specific feature.

07:35.760 --> 07:38.700
Then from these 32 convoluted images, we apply

07:38.700 --> 07:43.380
the second convolution to get 32 new convoluted images.

07:43.380 --> 07:46.500
Then same from these 32 new convoluted images

07:46.500 --> 07:48.660
we apply the third convolution to get

07:48.660 --> 07:51.120
32 new convoluted images again.

07:51.120 --> 07:53.880
And then eventually from these 32 convoluted images

07:53.880 --> 07:57.450
we apply the fourth convolution to get features.

07:57.450 --> 07:59.400
All right, And this will be enough,

07:59.400 --> 08:01.530
with this our AI will have a supervision,

08:01.530 --> 08:03.660
it will detect the ball very well.

08:03.660 --> 08:05.700
All right, so that's it for the convolution.

08:05.700 --> 08:07.380
So that's it for the eyes.

08:07.380 --> 08:09.600
And now let's take care of the memory.

08:09.600 --> 08:10.607
This new feature of this brain we're implementing

08:10.607 --> 08:14.670
as opposed to before with Doom,

08:14.670 --> 08:16.500
not only it will have a supervision

08:16.500 --> 08:18.840
but also it will have a super memory,

08:18.840 --> 08:20.880
a long memory, because we are going to

08:20.880 --> 08:23.970
implement an NSDM, long short-term memory,

08:23.970 --> 08:26.250
which is this kind of record renewal network

08:26.250 --> 08:29.850
that gives to your model some kind of a long memory

08:29.850 --> 08:33.000
so that it can learn some long temporal relationships

08:33.000 --> 08:34.470
from the past.

08:34.470 --> 08:37.650
So same, we're going to create a new variable.

08:37.650 --> 08:40.050
So I'm starting with self, and this new variable

08:40.050 --> 08:42.960
we're gonna call it simply LSTM

08:42.960 --> 08:44.040
because this will correspond

08:44.040 --> 08:47.250
to the LSTM network inside the brain.

08:47.250 --> 08:52.200
So LSTM, and now, before we write the code for the LSTM,

08:52.200 --> 08:54.990
let's make sure we understand what this LSTM

08:54.990 --> 08:56.490
part of the brain will do.

08:56.490 --> 09:00.180
So as we understood, this LSTM is used

09:00.180 --> 09:02.040
to learn the temporal properties

09:02.040 --> 09:04.050
of the input, of the input images.

09:04.050 --> 09:07.050
So for example, if the ball hits a break

09:07.050 --> 09:09.720
the LSTM will encode the bounce.

09:09.720 --> 09:11.370
So that's the first thing to understand.

09:11.370 --> 09:15.030
It will kind of encode what's happening in the game.

09:15.030 --> 09:17.310
Then the next important thing to understand when

09:17.310 --> 09:20.970
we implement an LSTM is that we need to choose an order

09:20.970 --> 09:22.920
of the temporal dependencies.

09:22.920 --> 09:25.950
And here, since we're gonna feed our neural network

09:25.950 --> 09:28.680
with a sequence of four images, then that means

09:28.680 --> 09:32.220
that we can already learn some temporal dependencies

09:32.220 --> 09:33.540
of order four.

09:33.540 --> 09:34.583
That is some temporal dependencies,

09:34.583 --> 09:38.340
where what happens at time T plus one,

09:38.340 --> 09:40.380
depends on what happens at time T

09:40.380 --> 09:43.230
T minus one, T minus two, and T minus three.

09:43.230 --> 09:45.383
So that we can definitely do that.

09:45.383 --> 09:48.690
But the good news is that we're gonna use an LSTM

09:48.690 --> 09:51.150
and therefore we will be able to learn some

09:51.150 --> 09:54.420
even more complex temporal relationships.

09:54.420 --> 09:55.620
That is, for example, we can learn

09:55.620 --> 09:58.140
some temporal properties where what happens

09:58.140 --> 10:01.350
at time T plus one will depend on what happens at time T

10:01.350 --> 10:05.940
T minus one, T minus two, T minus three down to T minus N.

10:05.940 --> 10:09.000
And that's the long part in the LSTM,

10:09.000 --> 10:10.650
long short-term memory.

10:10.650 --> 10:11.580
With this LSTM

10:11.580 --> 10:15.480
we can learn some very complex temporal relationships.

10:15.480 --> 10:18.150
All right, so let's add our LSTM.

10:18.150 --> 10:21.090
So to do this, we're gonna use the nn module

10:21.090 --> 10:25.560
and then we're gonna add the Class LSTM cell,

10:25.560 --> 10:28.350
which will create this LSTM object,

10:28.350 --> 10:31.560
which will represent the LSTM part of the neural network.

10:31.560 --> 10:33.960
Because right now, what's also important to understand

10:33.960 --> 10:37.340
is that we're making a C RNN, you know

10:37.340 --> 10:39.720
a convolutional recurrent renewal network

10:39.720 --> 10:42.990
and the RNN part comes after the CNN part.

10:42.990 --> 10:45.600
And therefore, right now what we need to input

10:45.600 --> 10:47.520
in this LSTM cell

10:47.520 --> 10:51.360
is first the size of the output after the convolution.

10:51.360 --> 10:56.190
So that is 32 times three times three.

10:56.190 --> 11:00.413
So this 32 times three times three is actually the output

11:00.413 --> 11:02.820
after the four convolutions here.

11:02.820 --> 11:07.820
But that becomes the input of the RNN, the LSTM network.

11:07.860 --> 11:10.830
And now why does the output of the four convolutions

11:10.830 --> 11:13.830
have the size 32 times three times three?

11:13.830 --> 11:16.050
Well, don't worry, it's not that direct.

11:16.050 --> 11:18.000
It's actually not a simple formula

11:18.000 --> 11:20.940
but there is a formula to compute this number

11:20.940 --> 11:23.580
of output neurons after flattening the pooled

11:23.580 --> 11:26.700
and convoluted images of the convolutions.

11:26.700 --> 11:29.610
But if we gather the terms of this big formula,

11:29.610 --> 11:32.160
well, we get 32 times three times three.

11:32.160 --> 11:33.540
I didn't wanna spend too much time

11:33.540 --> 11:36.180
on this because we have a lot to do more.

11:36.180 --> 11:37.860
And besides we already made a function

11:37.860 --> 11:39.480
to compute this number.

11:39.480 --> 11:40.950
Remember, it was for Doom

11:40.950 --> 11:43.860
when we made this count neurons function.

11:43.860 --> 11:46.950
So you can reuse it if you want, if you're not convinced

11:46.950 --> 11:50.190
but that is just what we get after gathering the terms

11:50.190 --> 11:53.490
of this big formula computing the number of outputs.

11:53.490 --> 11:55.770
So that's for the first argument.

11:55.770 --> 11:58.470
And then the second argument is going to be the number

11:58.470 --> 12:03.470
of output neurons of the LSTM, and we're gonna go for 256.

12:04.620 --> 12:07.140
Okay? And so what does that mean now?

12:07.140 --> 12:09.750
That means that now we have a vector

12:09.750 --> 12:12.510
that encodes each event of the game.

12:12.510 --> 12:15.870
Or in other words, we have an encoded state.

12:15.870 --> 12:18.750
And so that is now that we can make the separation

12:18.750 --> 12:21.630
between the actor and the critic, because you know

12:21.630 --> 12:25.200
we're gonna make actually two separate neural networks,

12:25.200 --> 12:27.270
one for the actor and one for the critic,

12:27.270 --> 12:29.190
but there will be the same encoding

12:29.190 --> 12:32.460
of the images and the temporal relationships

12:32.460 --> 12:33.990
for these two neural networks.

12:33.990 --> 12:36.570
So this is the common part that we do

12:36.570 --> 12:37.740
for these two neural networks.

12:37.740 --> 12:41.460
This will be the same beginning for the two neural networks

12:41.460 --> 12:44.435
but now things are gonna change for the actor and the critic

12:44.435 --> 12:46.725
because we're gonna make one linear full connection

12:46.725 --> 12:48.630
for the actor

12:48.630 --> 12:52.200
and one differently linear full connection for the critic.

12:52.200 --> 12:53.850
So let's take a quick break

12:53.850 --> 12:56.400
and let's do that in the next tutorial.

12:56.400 --> 12:58.173
Until then, enjoy AI.