WEBVTT

00:00.240 --> 00:02.670
Instructor: Hello and welcome to this Python tutorial.

00:02.670 --> 00:04.890
Now we have to define the five variables

00:04.890 --> 00:06.240
of this init function.

00:06.240 --> 00:08.040
That is the three convolutions,

00:08.040 --> 00:09.513
and the two full connections.

00:10.636 --> 00:11.469
So let's start with the first one.

00:11.469 --> 00:16.080
Convolution one applies convolution to the input images.

00:16.080 --> 00:18.210
So that's the original images.

00:18.210 --> 00:20.905
And now you're gonna see how everything will become

00:20.905 --> 00:22.740
so simple to create this convolution.

00:22.740 --> 00:25.590
Well, what we have to do is actually create an object

00:25.590 --> 00:27.030
of some specific class.

00:27.030 --> 00:30.240
And this class is taken from NN,

00:30.240 --> 00:32.523
and then the class is COM-2D.

00:34.560 --> 00:37.800
2D because we're working with 2D images.

00:37.800 --> 00:40.890
And now as you can see, we need to input several arguments.

00:40.890 --> 00:45.890
First one is in channels, so let's input it in channels.

00:45.960 --> 00:49.320
The second one is out channels.

00:49.320 --> 00:52.650
The third one is kernel size.

00:52.650 --> 00:55.140
And the rest of them are stride, the betting,

00:55.140 --> 00:57.060
the dilatation groups, and bias.

00:57.060 --> 00:59.220
And we have default values for all these ones.

00:59.220 --> 01:00.510
So we're not going to input them.

01:00.510 --> 01:02.460
We're gonna keep the default values.

01:02.460 --> 01:05.018
But what's important is these three arguments in channels,

01:05.018 --> 01:06.777
out channels, and kernel size.

01:06.777 --> 01:09.870
And so do you guess what they correspond to?

01:09.870 --> 01:12.030
Well, very simply, in channels correspond

01:12.030 --> 01:14.400
to the input of the convolution,

01:14.400 --> 01:16.650
and out channels correspond to the output

01:16.650 --> 01:17.910
of the convolution.

01:17.910 --> 01:20.070
So in channels, what is it going to be?

01:20.070 --> 01:22.500
Well, very simply, that's going to be the number

01:22.500 --> 01:24.570
of channels in our images.

01:24.570 --> 01:26.310
And actually we are gonna work

01:26.310 --> 01:28.860
with black and white images because basically

01:28.860 --> 01:31.410
we don't need colors to recognize the monsters.

01:31.410 --> 01:32.880
The AI is totally capable

01:32.880 --> 01:35.430
of recognizing the monsters in black and white.

01:35.430 --> 01:36.930
So we don't need the colors,

01:36.930 --> 01:39.180
it'll just recognize them by their shape.

01:39.180 --> 01:41.700
And therefore we're gonna use one channel.

01:41.700 --> 01:44.220
So one channel is when you have black and white images

01:44.220 --> 01:46.890
and three channels is when you have colored images.

01:46.890 --> 01:48.360
And therefore, since we're working with

01:48.360 --> 01:50.340
black and white images in channels,

01:50.340 --> 01:51.783
is going to be equal to one.

01:52.770 --> 01:54.210
Then out channels.

01:54.210 --> 01:57.208
So out channels is going to be equal to the images

01:57.208 --> 02:00.330
you want to have in the convolutional layer,

02:00.330 --> 02:02.940
which is the output of this convolution one.

02:02.940 --> 02:05.820
And so basically this is equal to the number

02:05.820 --> 02:09.210
of features you want to detect in your original images,

02:09.210 --> 02:12.150
because we will create one image per feature

02:12.150 --> 02:15.090
we want to detect because basically you know how it works.

02:15.090 --> 02:18.690
We apply one feature detector to the input image to

02:18.690 --> 02:21.240
detect a specific feature in the input image,

02:21.240 --> 02:23.070
and therefore the number of outputs images

02:23.070 --> 02:25.988
here is the number of features we want to detect.

02:25.988 --> 02:28.830
So now the question is how many features

02:28.830 --> 02:30.210
do we want to detect?

02:30.210 --> 02:32.460
Well, a common practice is to start

02:32.460 --> 02:35.040
with 32 feature detectors.

02:35.040 --> 02:39.660
And so that will lead us to 32 processed images

02:39.660 --> 02:41.460
in this first convolutional layer.

02:41.460 --> 02:45.537
So the input is one black and white image, a real image

02:45.537 --> 02:46.920
and the output.

02:46.920 --> 02:51.450
And the first convolution layer is 32 processed images.

02:51.450 --> 02:53.370
And by processed, I mean of course,

02:53.370 --> 02:55.080
that the convolution was applied

02:55.080 --> 02:58.230
to the input image to get 32 new images

02:58.230 --> 03:00.120
with detected features.

03:00.120 --> 03:03.360
And then we need to specify a kernel size,

03:03.360 --> 03:04.680
which is nothing else than

03:04.680 --> 03:06.660
the dimensions of this square

03:06.660 --> 03:09.600
that will go through the original image.

03:09.600 --> 03:12.180
And in common practice we use either

03:12.180 --> 03:15.660
two by two, or three by three, or five by five.

03:15.660 --> 03:17.137
And for the first one

03:17.137 --> 03:21.090
we're gonna use a five by five feature detector.

03:21.090 --> 03:22.800
That is a feature detector that will have

03:22.800 --> 03:24.780
five by five dimensions.

03:24.780 --> 03:26.460
And then we will reduce the size

03:26.460 --> 03:29.310
of this kernel for the next convolution layers.

03:29.310 --> 03:30.750
And speaking of it, this is

03:30.750 --> 03:32.370
exactly what we're gonna do now.

03:32.370 --> 03:34.556
We are going to copy this to

03:34.556 --> 03:37.253
define the second convolution,

03:37.253 --> 03:40.140
and therefore I'm facing that here,

03:40.140 --> 03:42.180
and now it's very funny and very easy.

03:42.180 --> 03:43.470
It's like a domino.

03:43.470 --> 03:44.370
The input channel

03:44.370 --> 03:47.910
of the second convolution layer is the output channel

03:47.910 --> 03:49.920
of the first convolution layer.

03:49.920 --> 03:52.080
So this number of outputs 32 here,

03:52.080 --> 03:55.440
is the same number of inputs, 32 here.

03:55.440 --> 03:57.660
And that's because we have 32 images

03:57.660 --> 04:01.410
in the input convolution layer of the second convolution.

04:01.410 --> 04:04.170
And so the second convolution is applied

04:04.170 --> 04:07.620
to this second convolutional layer,

04:07.620 --> 04:10.500
to return a third convolutional layer.

04:10.500 --> 04:13.410
And so now the question is how many new images do we want?

04:13.410 --> 04:15.990
Well same, let's create 32 new images.

04:15.990 --> 04:18.360
32 is actually a very common number

04:18.360 --> 04:19.830
in convolutional neural networks.

04:19.830 --> 04:21.540
If you look at the architectures

04:21.540 --> 04:23.970
you will find 32 in many of them.

04:23.970 --> 04:26.160
And then for the kernel size, well

04:26.160 --> 04:28.050
we need to reduce the kernel size.

04:28.050 --> 04:30.750
That is the dimensions of our feature detector.

04:30.750 --> 04:32.817
And so now we're gonna go from five to

04:32.817 --> 04:35.910
either four or even three,

04:35.910 --> 04:37.890
and then we'll go even smaller.

04:37.890 --> 04:40.800
All right, so our second convolution is ready.

04:40.800 --> 04:43.860
It takes as inputs, 32 processed images,

04:43.860 --> 04:46.770
each one detecting a first feature

04:46.770 --> 04:48.600
of the original input image

04:48.600 --> 04:51.240
and it creates 32 new images,

04:51.240 --> 04:55.110
thanks to this reduced dimensions of the feature detector.

04:55.110 --> 04:57.300
And so now let's push this even more.

04:57.300 --> 04:59.007
So I'm copying this,

04:59.007 --> 05:01.110
and pasting that here to

05:01.110 --> 05:05.460
create a third convolution to detect some features.

05:05.460 --> 05:08.010
And so now that's the same, the input channels

05:08.010 --> 05:11.340
here is the number of input images at the left

05:11.340 --> 05:13.710
of the convolution connection, and that is the number

05:13.710 --> 05:15.719
of processed images that was at the right

05:15.719 --> 05:18.314
of the previous convolution connection.

05:18.314 --> 05:20.040
So that's 32, therefore we keep 32 here.

05:20.040 --> 05:21.180
That's perfect.

05:21.180 --> 05:23.910
And now the question is again, how many new images

05:23.910 --> 05:25.290
do we want to detect?

05:25.290 --> 05:27.567
We are gonna take now 64,

05:27.567 --> 05:31.230
and therefore 64 output processed images.

05:31.230 --> 05:34.830
And of course now we take a smaller kernel size,

05:34.830 --> 05:36.630
and we're gonna take two.

05:36.630 --> 05:39.510
And so that's a very classic architecture

05:39.510 --> 05:42.480
of a convolutional layer, and it's very efficient to

05:42.480 --> 05:45.856
have a high level of feature detection inside images.

05:45.856 --> 05:48.930
All right, and so, now that we have our

05:48.930 --> 05:51.960
three convolution layers, thanks to our three

05:51.960 --> 05:53.610
convolution connections here,

05:53.610 --> 05:56.790
well now it's time to get our two full connections that

05:56.790 --> 06:00.300
I remind will take this huge vector that we obtain

06:00.300 --> 06:05.010
after flattening all the 64 times 32, times 32 again

06:05.010 --> 06:08.130
images that we got from all these convolutions.

06:08.130 --> 06:10.890
So we flatten all the pixels of these images

06:10.890 --> 06:14.190
and we gain one huge vector that will become the input

06:14.190 --> 06:16.800
of a new, fully connected neural network.

06:16.800 --> 06:19.320
And so that's when we have to make these full connections

06:19.320 --> 06:22.440
between first this huge vector and a hidden layer,

06:22.440 --> 06:25.620
and then a second full connection between the hidden layer

06:25.620 --> 06:28.320
and the output layer composed of the output neurons

06:28.320 --> 06:31.950
each one corresponding to a Q value of the possible actions.

06:31.950 --> 06:33.930
So let's make these two full connections.

06:33.930 --> 06:35.190
You know how to do that.

06:35.190 --> 06:37.530
That's exactly what we did for the self-driving car.

06:37.530 --> 06:38.970
So let's do that again.

06:38.970 --> 06:41.472
Well, first we take our NN module,

06:41.472 --> 06:45.660
then we take the linear class because again

06:45.660 --> 06:47.730
the full connection we create is an object

06:47.730 --> 06:49.260
of the linear class.

06:49.260 --> 06:51.960
And then in parenthesis, well, that's the same.

06:51.960 --> 06:54.180
First, we input the input features,

06:54.180 --> 06:58.830
that is the number of them, then the output features.

06:58.830 --> 07:02.100
And so the input features for the first full connection

07:02.100 --> 07:03.330
where is it going to be?

07:03.330 --> 07:05.520
Well, that's going to be equal to the number

07:05.520 --> 07:09.390
of pixels there are in this huge vector obtained

07:09.390 --> 07:11.940
after flattening all the processed images

07:11.940 --> 07:13.830
after the three convolutions.

07:13.830 --> 07:15.180
And so what is this number?

07:15.180 --> 07:17.340
Well, actually, there is a trick here.

07:17.340 --> 07:19.650
This number is actually hard to get.

07:19.650 --> 07:22.920
We actually need to make a function to compute that number.

07:22.920 --> 07:25.560
We don't have a variable that will get us this number.

07:25.560 --> 07:26.940
We have to compute it.

07:26.940 --> 07:29.100
And therefore, what we're gonna do now,

07:29.100 --> 07:31.230
and now it's very important to understand

07:31.230 --> 07:34.020
the mindset of programming that we must have

07:34.020 --> 07:36.960
and trying to bring to you the mindset that is

07:36.960 --> 07:38.850
what you must be thinking right now

07:38.850 --> 07:41.130
to do to overcome this obstacle.

07:41.130 --> 07:43.170
Because the first time you might say, hey

07:43.170 --> 07:45.720
I don't have this number of neurons in the flatten vector.

07:45.720 --> 07:46.620
What should I do?

07:46.620 --> 07:47.760
I'm stuck here.

07:47.760 --> 07:49.080
Well, no, actually

07:49.080 --> 07:52.950
because what you can do now is simply input any

07:52.950 --> 07:56.250
name here that will represent this number of neurons.

07:56.250 --> 07:59.610
So I'm calling it number neurons, number of neurons.

07:59.610 --> 08:02.551
And then we will simply make a function that will return

08:02.551 --> 08:05.160
in this number of neurons variable

08:05.160 --> 08:07.320
this number of pixels we're looking for.

08:07.320 --> 08:08.880
So we can totally do that.

08:08.880 --> 08:10.620
We can totally put this variable.

08:10.620 --> 08:13.020
Well, of course we'll get a warning

08:13.020 --> 08:14.160
because it doesn't exist yet,

08:14.160 --> 08:17.310
but we will create it afterwards with a function

08:17.310 --> 08:19.470
and we are totally allowed to do that even

08:19.470 --> 08:21.120
if the function comes afterwards.

08:21.120 --> 08:23.370
So that's a typical programming thinking

08:23.370 --> 08:25.980
you must have when you get that kind of obstacle.

08:25.980 --> 08:28.863
Well you can make a function to get what you're missing.

08:29.730 --> 08:31.890
All right, and then out features,

08:31.890 --> 08:34.170
and out features that the number of neurons

08:34.170 --> 08:37.830
in the hidden layer and that this time is up to you.

08:37.830 --> 08:39.270
That depends on the architecture

08:39.270 --> 08:41.220
of the neural network you want to create.

08:41.220 --> 08:44.220
And so a good number would be not a too small number.

08:44.220 --> 08:46.980
So for example, 40 neurons might be fine.

08:46.980 --> 08:48.750
We can try to increase it.

08:48.750 --> 08:51.300
If the training is not too slow, you can try to increase it.

08:51.300 --> 08:53.760
Maybe that will improve the predictions.

08:53.760 --> 08:54.900
But let's start with 40,

08:54.900 --> 08:57.120
maybe we'll increase that afterwards.

08:57.120 --> 09:00.060
All right, so that's it for the first full connection.

09:00.060 --> 09:02.283
Then we'll copy this,

09:03.300 --> 09:05.370
paste that here for the second full connection.

09:05.370 --> 09:06.203
That is the connection

09:06.203 --> 09:09.330
between the hidden layer and the output layer.

09:09.330 --> 09:13.050
And so the in features here becomes the out features

09:13.050 --> 09:15.750
of the previous layer, and that is 40.

09:15.750 --> 09:18.150
So here we input 40.

09:18.150 --> 09:20.850
That's of course the number of neurons in the hidden layer.

09:20.850 --> 09:24.060
And our features here is going to be equal to the number

09:24.060 --> 09:27.300
of output neurons there should be in our neural network.

09:27.300 --> 09:29.490
And since each output neuron corresponds

09:29.490 --> 09:30.780
to one Q value,

09:30.780 --> 09:33.180
and one Q value corresponds to one action,

09:33.180 --> 09:35.640
well the number of alpha neurons here is of course

09:35.640 --> 09:36.930
the number of actions.

09:36.930 --> 09:39.990
And we have one variable for this, which is number actions.

09:39.990 --> 09:44.813
And therefore here we input number actions.

09:44.813 --> 09:46.080
And there we go.

09:46.080 --> 09:47.370
Congratulations.

09:47.370 --> 09:49.410
We defined the architecture

09:49.410 --> 09:50.820
of our neural network.

09:50.820 --> 09:52.830
Our neural network is composed

09:52.830 --> 09:56.100
of three convolutional layers and one hidden layer.

09:56.100 --> 09:58.680
All this in one big CNN.

09:58.680 --> 10:01.380
And this CNN will detect the features in the game

10:01.380 --> 10:03.960
so that the AI will know what it has to do

10:03.960 --> 10:06.870
where it has to go, and where it needs to shoot.

10:06.870 --> 10:08.310
So there we go for this step,

10:08.310 --> 10:10.650
that's a first very important step done.

10:10.650 --> 10:13.290
Now we're gonna move on the next step, which is of course

10:13.290 --> 10:16.980
to get this number of neurons that is still missing.

10:16.980 --> 10:18.900
That's actually why we have the warning here.

10:18.900 --> 10:21.930
And define name number neurons, but no worries.

10:21.930 --> 10:24.600
Now we will make a function that will return the number

10:24.600 --> 10:27.450
of neurons in this huge vector, and we will put that number

10:27.450 --> 10:30.240
in a variable that we'll call number neurons.

10:30.240 --> 10:32.070
So let's do this in the next tutorial.

10:32.070 --> 10:33.180
That's our next step.

10:33.180 --> 10:34.893
And until then, enjoy AI.