WEBVTT

00:00.210 --> 00:02.940
-: Hello and welcome to this Python tutorial.

00:02.940 --> 00:05.460
In this tutorial, we're gonna make the first step

00:05.460 --> 00:08.220
into implementing the Deep Q-Learning model.

00:08.220 --> 00:10.380
So basically, we're about to implement

00:10.380 --> 00:13.800
the whole process of the Deep Q-Learning algorithm.

00:13.800 --> 00:16.590
And so we are gonna use what we created before

00:16.590 --> 00:18.990
that is the architecture of the neural network

00:18.990 --> 00:20.130
to replay memory,

00:20.130 --> 00:24.120
to integrate this into the whole Deep Q-Learning process.

00:24.120 --> 00:26.070
And this whole Deep Q-Learning algorithm

00:26.070 --> 00:28.290
is going to fit into one class.

00:28.290 --> 00:30.000
So, that's the last task we're making

00:30.000 --> 00:32.310
to implement artificial intelligence.

00:32.310 --> 00:35.700
And this class will just contain different functions.

00:35.700 --> 00:37.260
So, we will have the init functions

00:37.260 --> 00:41.460
which will create and initialize all the variables attached

00:41.460 --> 00:43.740
to our future Deep Q-Learning objects

00:43.740 --> 00:46.560
which will represent the Deep Q-Learning model itself.

00:46.560 --> 00:48.480
And then, we'll have some other functions.

00:48.480 --> 00:49.680
One of them will, of course,

00:49.680 --> 00:52.890
be to select the right action at each time.

00:52.890 --> 00:55.230
We will also have an update function,

00:55.230 --> 00:57.480
a score function to get the score

00:57.480 --> 00:59.820
and have an idea of how the learning is going,

00:59.820 --> 01:01.050
if it's going well,

01:01.050 --> 01:03.270
if the exploration is going well

01:03.270 --> 01:05.610
and if it can move on to exploitation.

01:05.610 --> 01:07.950
And then we'll have a save function to save the model,

01:07.950 --> 01:10.410
that is to save the brain of the car,

01:10.410 --> 01:12.450
and then eventually, a load function.

01:12.450 --> 01:14.610
So, we have a couple of functions to make.

01:14.610 --> 01:17.490
We're gonna make one function for each tutorial.

01:17.490 --> 01:20.970
And today, we're gonna start with the init function

01:20.970 --> 01:23.010
as usual when we're making a class.

01:23.010 --> 01:26.820
But first, let's not forget to introduce the class.

01:26.820 --> 01:29.850
So, we're gonna call it DQN for Deep Q-Network,

01:33.480 --> 01:36.000
then some parenthesis, colon,

01:36.000 --> 01:39.480
and there we go with our first function.

01:39.480 --> 01:41.610
So, let's do this.

01:41.610 --> 01:45.240
Def then double underscore, init,

01:45.240 --> 01:48.150
double underscore again and parenthesis.

01:48.150 --> 01:50.490
So as you understood in this init function,

01:50.490 --> 01:52.140
we are going to introduce the variables

01:52.140 --> 01:53.550
attached to our objects.

01:53.550 --> 01:56.820
So, we're gonna have a couple of lines starting all by self.

01:56.820 --> 01:59.310
And we will basically create and initialize

01:59.310 --> 02:01.260
all the variables that are needed

02:01.260 --> 02:03.510
to implement a Deep Q-Network.

02:03.510 --> 02:07.110
So we will, for example, create an object of our network

02:07.110 --> 02:10.290
because of course, we need our deep neural network.

02:10.290 --> 02:12.300
Then we will need our memory,

02:12.300 --> 02:14.730
we will create another variable for the memory.

02:14.730 --> 02:17.850
So we'll have another variable, self dot memory.

02:17.850 --> 02:20.460
But then that's not all, we will have to create as well

02:20.460 --> 02:22.470
some variables for the last state,

02:22.470 --> 02:24.930
the last action and the last reward.

02:24.930 --> 02:27.570
That's of course, you know, the variables

02:27.570 --> 02:30.870
that you see in the Deep Q-Learning algorithm.

02:30.870 --> 02:31.920
And then, what else?

02:31.920 --> 02:34.710
Well, we will also need an optimizer, you know,

02:34.710 --> 02:37.110
to perform stochastic gradient descent

02:37.110 --> 02:38.490
to update the weights

02:38.490 --> 02:42.300
according to how much they will contribute to the error

02:42.300 --> 02:44.490
when the AI is making a mistake.

02:44.490 --> 02:46.110
And then, I think that's all.

02:46.110 --> 02:48.150
That's basically, the variables

02:48.150 --> 02:50.490
we now need to create and initialize.

02:50.490 --> 02:52.140
But in this init function,

02:52.140 --> 02:54.480
we will input a couple of arguments.

02:54.480 --> 02:56.880
First, as usual, self,

02:56.880 --> 03:00.180
which is the argument referring to our object,

03:00.180 --> 03:02.850
then since, you know, we're going to create

03:02.850 --> 03:05.370
an object of the network class,

03:05.370 --> 03:07.470
well, since the network class

03:07.470 --> 03:10.140
takes as argument in the init function,

03:10.140 --> 03:12.240
input size and ambi action.

03:12.240 --> 03:13.830
While that's the same here,

03:13.830 --> 03:15.750
when creating an object of the network class,

03:15.750 --> 03:18.210
we will need to choose an input size argument

03:18.210 --> 03:20.160
and an ambi action argument.

03:20.160 --> 03:22.300
Therefore, we can just copy them

03:23.820 --> 03:27.690
and paste them here and here we go.

03:27.690 --> 03:30.570
So, these arguments will now become

03:30.570 --> 03:33.150
also some arguments of the DQN class.

03:33.150 --> 03:36.660
Whenever we create some future objects of the DQN class

03:36.660 --> 03:39.390
that is some future Deep Q-Learning models,

03:39.390 --> 03:41.850
well, we will need to specify the input size

03:41.850 --> 03:44.910
which I remind is the number of dimensions in the vectors

03:44.910 --> 03:47.880
that are encoding your states, your input states,

03:47.880 --> 03:49.620
and a number of actions

03:49.620 --> 03:53.130
which is the number of possible actions the car can make.

03:53.130 --> 03:55.890
So I remind these are, either go left,

03:55.890 --> 03:58.170
go straight or go right.

03:58.170 --> 03:59.220
Okay, perfect.

03:59.220 --> 04:00.900
Then you know, we will be creating

04:00.900 --> 04:03.030
an object of the replay memory class

04:03.030 --> 04:04.680
to create the memory object

04:04.680 --> 04:07.530
to get our memory of the transitions.

04:07.530 --> 04:10.830
And in the init function, we have the capacity argument.

04:10.830 --> 04:13.140
But since we will only be using it once,

04:13.140 --> 04:14.670
actually, when we create the memory

04:14.670 --> 04:15.990
and not anywhere after,

04:15.990 --> 04:20.010
well, we won't need to specify a capacity argument.

04:20.010 --> 04:21.240
We could do this,

04:21.240 --> 04:24.000
but we will directly input the number of transitions

04:24.000 --> 04:26.160
we want our memory to have.

04:26.160 --> 04:28.770
But then, we need one last argument

04:28.770 --> 04:32.430
which is the gamma parameter in the Deep Q-Learning model.

04:32.430 --> 04:36.360
Remember, this gamma parameter is the delay coefficient.

04:36.360 --> 04:38.190
That's a parameter of the equation

04:38.190 --> 04:40.140
and therefore we will put it here

04:40.140 --> 04:43.140
because we will be using it afterwards several times.

04:43.140 --> 04:45.030
So, let's put it here.

04:45.030 --> 04:47.130
We are gonna call it gamma

04:47.130 --> 04:49.590
so far, that is just the name of the argument.

04:49.590 --> 04:52.470
And there we go, that's all the arguments we will need

04:52.470 --> 04:53.730
for this init function.

04:53.730 --> 04:54.780
So that means,

04:54.780 --> 04:58.140
that whenever we create our Deep Q-Learning model,

04:58.140 --> 05:01.470
that is whenever we create an object of the DQN class

05:01.470 --> 05:03.810
well, we will need to specify as argument,

05:03.810 --> 05:06.090
the input size, the number of action

05:06.090 --> 05:08.190
and the gamma parameter

05:08.190 --> 05:11.160
and we'll input the real values for them soon.

05:11.160 --> 05:14.190
All right, so now let's go inside the init function.

05:14.190 --> 05:16.500
Okay, so now basically this is going to be easy.

05:16.500 --> 05:19.410
We are just about to create and initialize

05:19.410 --> 05:21.150
all the variables that we'll need.

05:21.150 --> 05:22.860
And so, let's start with the first one

05:22.860 --> 05:26.310
let's start with gamma, actually, the delay coefficient.

05:26.310 --> 05:28.410
So since this is the variable

05:28.410 --> 05:30.360
we want to be attached to our object,

05:30.360 --> 05:31.860
we'll start with self

05:31.860 --> 05:36.090
so gamma is going to be a variable of our DQN model.

05:36.090 --> 05:40.500
So self dot gamma equals the argument that will be input

05:40.500 --> 05:42.930
when creating an object of the DQN class.

05:42.930 --> 05:46.920
So gamma, and there we go with the second argument.

05:46.920 --> 05:50.820
So the second argument is going to be the reward window.

05:50.820 --> 05:52.500
So what is this reward window?

05:52.500 --> 05:54.660
Well, that's gonna be the sliding window

05:54.660 --> 05:57.150
of the mean of the last 100 rewards,

05:57.150 --> 05:58.890
which we'll use just to evaluate

05:58.890 --> 06:01.080
the evolution of the AI performance.

06:01.080 --> 06:03.450
You know, we'll have the mean of the reward

06:03.450 --> 06:06.420
into this reward window that will slide over time

06:06.420 --> 06:08.880
and what we want to observe is a mean

06:08.880 --> 06:11.910
of the last 100 rewards increasing with time.

06:11.910 --> 06:13.200
So let's initialize it

06:13.200 --> 06:18.200
with self dot reward underscore window.

06:18.690 --> 06:21.600
And so since this is going to be a sliding window

06:21.600 --> 06:24.810
of the evolving mean of the last 100 rewards,

06:24.810 --> 06:28.590
well, we are going to initialize it as an empty list

06:28.590 --> 06:31.773
and then we will append the mean of the rewards over time.

06:32.850 --> 06:33.683
All right.

06:33.683 --> 06:37.530
Then more exciting, let's create our neural network.

06:37.530 --> 06:41.520
So, we're gonna call it self dot model

06:41.520 --> 06:43.650
because basically that's the heart of the model,

06:43.650 --> 06:45.870
so I'm calling it model.

06:45.870 --> 06:48.750
And this model is going to be nothing else

06:48.750 --> 06:51.570
than an object of the network class.

06:51.570 --> 06:53.250
And to create such an object,

06:53.250 --> 06:58.200
we take our class network then parenthesis,

06:58.200 --> 07:01.890
and here, we just input the arguments of the network class.

07:01.890 --> 07:03.840
But we put these arguments

07:03.840 --> 07:05.940
in the arguments of the init function

07:05.940 --> 07:10.140
and therefore, we just need to copy them right here

07:10.140 --> 07:13.170
and just paste them in the network class.

07:13.170 --> 07:15.330
And there we go, with this line of code

07:15.330 --> 07:20.130
we create one neural network for our Deep Q-Learning model.

07:20.130 --> 07:22.860
Perfect. Then, let's create a memory.

07:22.860 --> 07:25.950
So again, we're going to create a new variable

07:25.950 --> 07:28.983
that we call self dot memory.

07:29.910 --> 07:32.790
And again, this is going to be an object

07:32.790 --> 07:34.200
of the replay memory class.

07:34.200 --> 07:36.660
So let's just take the name of our class,

07:36.660 --> 07:40.440
let's copy it, let's paste that here.

07:40.440 --> 07:44.250
And in some parenthesis, we need to input the capacity

07:44.250 --> 07:47.490
because the capacity is an argument of the init function

07:47.490 --> 07:50.190
and that's the only argument we need to input here.

07:50.190 --> 07:52.020
So, what capacity are we going to choose?

07:52.020 --> 07:54.660
Remember that corresponds to the number of transitions

07:54.660 --> 07:57.570
the number of events, last state, new state,

07:57.570 --> 07:59.820
last section and less reward.

07:59.820 --> 08:02.910
And so, as mentioned in one of the previous tutorials,

08:02.910 --> 08:04.593
we're gonna take 100,000.

08:06.974 --> 08:10.020
100,000 transitions into memory

08:10.020 --> 08:12.720
and then we will sample from this memory

08:12.720 --> 08:15.210
to get a smaller number of random transitions.

08:15.210 --> 08:18.180
And that's on which the model will learn.

08:18.180 --> 08:20.280
Okay, so now we have our memory.

08:20.280 --> 08:23.430
Perfect. Now let's get our optimizer.

08:23.430 --> 08:24.990
So again, self,

08:24.990 --> 08:28.803
we create a new variable that we call optimizer.

08:29.640 --> 08:33.630
So optimizer is another variable of our future DQN object.

08:33.630 --> 08:35.490
So, self dot optimizer.

08:35.490 --> 08:38.160
And now if we go back up,

08:38.160 --> 08:41.700
you can see that we imported torch dot optim

08:41.700 --> 08:45.120
which is a module of torch that contains all the tools

08:45.120 --> 08:47.100
to perform stochastic gradient descent.

08:47.100 --> 08:49.650
So of course, it contains some optimizers

08:49.650 --> 08:52.620
and we gave it the shortcut optim.

08:52.620 --> 08:55.290
And therefore, here's what we're gonna do

08:55.290 --> 08:59.220
is take the model optim, which is torch dot optim,

08:59.220 --> 09:00.660
and from this module,

09:00.660 --> 09:03.150
we're gonna take one of the optimizers.

09:03.150 --> 09:05.850
So as you can see, they're all listed here.

09:05.850 --> 09:07.500
Many of them are excellent.

09:07.500 --> 09:10.590
For example, RMSprop is an excellent optimizer

09:10.590 --> 09:11.423
that is for example,

09:11.423 --> 09:14.280
highly recommended for a rec renewal networks

09:14.280 --> 09:16.230
or unsupervised deep learning

09:16.230 --> 09:18.990
but the other one that is excellent

09:18.990 --> 09:22.470
and that we will choose is the add-on optimizer,

09:22.470 --> 09:23.303
that's the one.

09:23.303 --> 09:24.510
You'll see that with this one,

09:24.510 --> 09:26.550
we'll get a good self-driving car.

09:26.550 --> 09:29.220
But again, you are totally welcome to try other ones,

09:29.220 --> 09:30.720
you can try the RMSprop

09:30.720 --> 09:32.790
but for our model, we will choose add-on.

09:32.790 --> 09:34.560
So, I'm pressing enter.

09:34.560 --> 09:37.260
And in fact, you notice there is the capital A here,

09:37.260 --> 09:39.480
that's because we are creating an object

09:39.480 --> 09:41.670
of the add-on class, this is a class,

09:41.670 --> 09:45.030
but the object will be an add-on optimizer itself.

09:45.030 --> 09:48.120
But since this is a class, we need to input some arguments,

09:48.120 --> 09:50.160
the arguments of the add-on class.

09:50.160 --> 09:52.530
And the arguments are all the parameters

09:52.530 --> 09:55.350
that can customize your add-on optimizer.

09:55.350 --> 09:57.960
So for example, that's typically the learning rate,

09:57.960 --> 10:00.480
the decay or some other parameters.

10:00.480 --> 10:03.660
And besides taking all the parameters of our model,

10:03.660 --> 10:05.940
we will specify a learning rate.

10:05.940 --> 10:08.670
So, speaking of the parameters of our model,

10:08.670 --> 10:12.840
we can get them with self dot model.

10:12.840 --> 10:14.940
So, that's the model we created here,

10:14.940 --> 10:17.550
self dot model from our network class.

10:17.550 --> 10:19.020
So, self dot model.

10:19.020 --> 10:21.990
And then, to access the parameters of the model,

10:21.990 --> 10:25.470
we add another dot and then parameters

10:25.470 --> 10:28.200
with some parenthesis, very simply.

10:28.200 --> 10:31.440
So, that's just to connect the add-on optimizer

10:31.440 --> 10:35.520
to our neural network, the one that we created here.

10:35.520 --> 10:38.010
Okay, and then as we just mentioned,

10:38.010 --> 10:39.900
we're gonna add a learning rate,

10:39.900 --> 10:43.140
and the argument for this is L R.

10:43.140 --> 10:46.020
And we will set it equal to a value

10:46.020 --> 10:49.380
such that the learning doesn't happen too fast.

10:49.380 --> 10:51.780
If we get a learning rate too large,

10:51.780 --> 10:53.850
then the AI won't learn properly.

10:53.850 --> 10:57.300
We want to give our AI some time to explore,

10:57.300 --> 10:58.860
learn from its mistakes.

10:58.860 --> 11:01.380
You know, when we punish it, when it's making some mistakes,

11:01.380 --> 11:05.730
like going onto some sent or getting too close to a wall.

11:05.730 --> 11:08.850
Well, we want to give the AI some time to learn.

11:08.850 --> 11:10.500
We want to wait at the neural network

11:10.500 --> 11:12.150
to update correctly.

11:12.150 --> 11:15.180
And so, a good value for the learning rate

11:15.180 --> 11:18.150
I ended up with, have to train several of them,

11:18.150 --> 11:19.683
is 0.001.

11:21.180 --> 11:24.690
All right, and that's all we need to create an optimizer.

11:24.690 --> 11:28.530
So basically, we are creating an object of the add-on class.

11:28.530 --> 11:32.010
Great, and then, the last three variables we need

11:32.010 --> 11:36.150
are the variables composing, are transition events.

11:36.150 --> 11:38.940
So that's the last state, the last action,

11:38.940 --> 11:40.170
and the last reward.

11:40.170 --> 11:42.900
And so, that's basically, what we'll create now

11:42.900 --> 11:45.270
and we will just need to initialize them.

11:45.270 --> 11:47.010
So, let's start with the last state.

11:47.010 --> 11:48.060
The last state,

11:48.060 --> 11:53.060
we're gonna call it self dot last underscore state.

11:53.460 --> 11:56.160
And then, how we going to initialize it.

11:56.160 --> 11:58.620
Well, remember, the last state

11:58.620 --> 12:01.200
is a vector of five dimensions,

12:01.200 --> 12:04.950
a vector that is encoding one state of the environment.

12:04.950 --> 12:07.110
And as a reminder, these five dimensions

12:07.110 --> 12:09.960
are the three signals of the three tensors

12:09.960 --> 12:11.580
left straight and right

12:11.580 --> 12:15.180
and orientation and minus orientation.

12:15.180 --> 12:18.420
So, this is a vector in the intuitive sense

12:18.420 --> 12:21.210
but for PyTorch, it needs to be more than a vector,

12:21.210 --> 12:23.820
it actually needs to be a torch tensor.

12:23.820 --> 12:26.400
But not only it needs to be a torch tensor

12:26.400 --> 12:29.520
but also it needs to have one more dimension

12:29.520 --> 12:31.650
that I like to call a fake dimension

12:31.650 --> 12:33.600
that corresponds to the batch.

12:33.600 --> 12:35.850
And that's because the last state

12:35.850 --> 12:38.490
will be the input of the neural network.

12:38.490 --> 12:41.040
But when working with neural networks in general,

12:41.040 --> 12:44.430
whether it is with TensorFlow, Keras or PyTorch,

12:44.430 --> 12:48.180
well, the input vectors cannot be a simple vector by itself,

12:48.180 --> 12:49.800
it has to be in a batch.

12:49.800 --> 12:54.180
The network can only accept batch of input observations.

12:54.180 --> 12:57.870
And therefore, not only we will create a tensor

12:57.870 --> 13:00.450
for our input state vectors

13:00.450 --> 13:03.090
but also we will create this fake dimension

13:03.090 --> 13:05.160
corresponding to the batch.

13:05.160 --> 13:06.180
So let's do this

13:06.180 --> 13:09.720
and let's start by initializing a torch tensor.

13:09.720 --> 13:12.480
So to do this, there is nothing more simple.

13:12.480 --> 13:15.570
We take our torch library,

13:15.570 --> 13:20.490
then dot and then we're gonna use the tensor class

13:20.490 --> 13:22.830
because as you might have guessed,

13:22.830 --> 13:25.800
this will create an object of the tensor class

13:25.800 --> 13:28.140
that is a tensor object.

13:28.140 --> 13:31.050
And in this tensor class, we need to input one argument

13:31.050 --> 13:34.620
which will specify the size of your tensor.

13:34.620 --> 13:37.170
You can picture a tensor like an array,

13:37.170 --> 13:38.880
having one single type.

13:38.880 --> 13:41.370
But basically what this will represent now

13:41.370 --> 13:44.280
is of course, this input state,

13:44.280 --> 13:46.170
which you can see as a vector.

13:46.170 --> 13:48.120
And so, to specify the number of elements

13:48.120 --> 13:49.530
this tensor must have,

13:49.530 --> 13:52.020
well, we need to use of course the input size

13:52.020 --> 13:54.390
because the input size is exactly the number

13:54.390 --> 13:57.510
of dimensions of our input state vectors.

13:57.510 --> 13:59.160
Now, I should say tensors.

13:59.160 --> 14:03.870
And so, what we simply need to input in our tensor class

14:03.870 --> 14:07.500
to create tensor object, well, that's input size.

14:07.500 --> 14:10.383
And later on, input size will be equal to five.

14:11.340 --> 14:12.420
All right, so that's good.

14:12.420 --> 14:13.710
That's the first thing done.

14:13.710 --> 14:17.520
We just initialized a tensor as it should be.

14:17.520 --> 14:20.040
But then remember, we need to do another thing,

14:20.040 --> 14:22.500
we need to create that fake dimension

14:22.500 --> 14:26.190
because this is what the network will expect for its inputs.

14:26.190 --> 14:29.010
And to create this one fake dimension,

14:29.010 --> 14:32.220
which by the way, has to be the first dimension,

14:32.220 --> 14:34.620
you know, the fake dimension corresponding to the batch

14:34.620 --> 14:38.220
will be the first dimension of this last state variable.

14:38.220 --> 14:42.957
Well, to do this, we simply need to add dot then unsqueeze.

14:44.190 --> 14:46.020
And then, and some parenthesis,

14:46.020 --> 14:49.590
we need to input the index of this fake dimension.

14:49.590 --> 14:51.690
And as I just said, this fake dimension

14:51.690 --> 14:54.840
has to be the first dimension of the last state.

14:54.840 --> 14:57.300
And since indexes in Python start at zero,

14:57.300 --> 15:01.560
we need to input zero, so that this new fake dimension

15:01.560 --> 15:03.540
is becoming the first dimension.

15:03.540 --> 15:06.660
So, we have a first dimension corresponding to the batch

15:06.660 --> 15:10.140
and then the dimension corresponding to the tensor

15:10.140 --> 15:13.410
which will contain the five elements of your input states,

15:13.410 --> 15:17.340
these three signals, orientation and minus orientation.

15:17.340 --> 15:18.270
And there we go,

15:18.270 --> 15:21.960
we initialized our input states properly.

15:21.960 --> 15:23.280
Perfect.

15:23.280 --> 15:27.600
And then two variables to go and that's gonna be much easier

15:27.600 --> 15:32.130
because the next variable is the last action.

15:32.130 --> 15:35.220
That's a new variable we're creating for our object,

15:35.220 --> 15:36.510
last action.

15:36.510 --> 15:40.260
And remember, in the first tutorial of this section,

15:40.260 --> 15:42.720
I told you that the actions

15:42.720 --> 15:45.390
are gonna be either zero, one or two.

15:45.390 --> 15:48.990
And then, using the action to rotation vector,

15:48.990 --> 15:52.140
we will convert these indexes of these actions

15:52.140 --> 15:54.180
into the angles of the rotation

15:54.180 --> 15:58.020
which I remind are zero, 20 or minus 20.

15:58.020 --> 16:01.170
We can actually, refresh our memory with that.

16:01.170 --> 16:04.800
Well, it is exactly here, action to rotation.

16:04.800 --> 16:06.360
If the action is zero,

16:06.360 --> 16:10.170
well, this will correspond to the first index here, so zero.

16:10.170 --> 16:11.670
If the action is one,

16:11.670 --> 16:14.310
this will respond to the index one of this vector,

16:14.310 --> 16:15.660
so 20 degrees.

16:15.660 --> 16:19.200
And if the action is two, we will get minus 20 degrees.

16:19.200 --> 16:21.600
That's gonna be the rotation angle of our car

16:21.600 --> 16:23.490
when we play the action.

16:23.490 --> 16:24.330
All right?

16:24.330 --> 16:27.900
And therefore, since the action is going to be either zero,

16:27.900 --> 16:31.830
one or two, well the action is therefore a simple number.

16:31.830 --> 16:35.640
And so very simply, we can initialize it to zero.

16:35.640 --> 16:38.340
We don't need to create any tensor here or anything else,

16:38.340 --> 16:41.400
we just need to initialize it with zero.

16:41.400 --> 16:44.610
And finally, well, that's the last reward.

16:44.610 --> 16:48.840
So, self dot last reward.

16:48.840 --> 16:49.920
There we go.

16:49.920 --> 16:53.490
And again, the reward is a float number

16:53.490 --> 16:56.490
which I remind is between minus one and plus one.

16:56.490 --> 16:57.840
So, that's the number again.

16:57.840 --> 17:02.130
And as for the action, we will initialize it to zero.

17:02.130 --> 17:02.963
And there we go.

17:02.963 --> 17:06.300
Congratulations, our init function is ready.

17:06.300 --> 17:08.910
So, now we are ready to move on to the exciting stuff.

17:08.910 --> 17:12.330
And actually, the most important thing for our AI

17:12.330 --> 17:15.690
that's deciding which action to play at each time,

17:15.690 --> 17:16.980
at each time T.

17:16.980 --> 17:19.890
And that's exactly what we're gonna do in the next tutorial

17:19.890 --> 17:23.490
by creating the select action method.

17:23.490 --> 17:25.410
So, let's do this in the next tutorial.

17:25.410 --> 17:27.513
And until then, enjoy AI.
