WEBVTT

00:00.450 --> 00:03.180
-: Hello and welcome to this Python tutorial.

00:03.180 --> 00:05.130
All right, so in this tutorial, we're gonna make

00:05.130 --> 00:08.820
the function that will select the right action at each time.

00:08.820 --> 00:11.070
So basically we're gonna implement the part

00:11.070 --> 00:14.610
that will make the car do the right move at each time,

00:14.610 --> 00:16.020
that is going left, going straight,

00:16.020 --> 00:18.060
or going right to reach the goal.

00:18.060 --> 00:21.000
And to avoid the obstacles that is decent.

00:21.000 --> 00:22.037
So let's do this right now.

00:22.037 --> 00:26.010
We are gonna start as usual with a def to define a function.

00:26.010 --> 00:28.320
And then we give a name to our function,

00:28.320 --> 00:32.520
which we're gonna call select action.

00:32.520 --> 00:33.900
Then some parenthesis.

00:33.900 --> 00:37.440
And this select action function will take two arguments.

00:37.440 --> 00:41.400
The first one is self, as usual, to refer to the object.

00:41.400 --> 00:43.350
And a second argument, which,

00:43.350 --> 00:46.380
according to you is going to be which one?

00:46.380 --> 00:47.820
Well, what could it be?

00:47.820 --> 00:50.539
If you think about it, the action we select comes

00:50.539 --> 00:53.340
from the outputs of the neural network.

00:53.340 --> 00:56.220
Because the outputs of the neural network are the q-values

00:56.220 --> 00:58.500
for each of the three possible actions.

00:58.500 --> 01:00.480
And therefore, the action that will play,

01:00.480 --> 01:03.480
the action that will be the output of the neural network,

01:03.480 --> 01:05.820
depends on the input state.

01:05.820 --> 01:09.720
And the input state is exactly the second argument we need

01:09.720 --> 01:11.850
for the select action function.

01:11.850 --> 01:15.240
It's because we are literally going to take the output

01:15.240 --> 01:16.470
of the neural network,

01:16.470 --> 01:18.630
and of course the output of the neural network

01:18.630 --> 01:22.320
directly depends on the input of the neural network.

01:22.320 --> 01:24.240
So that's gonna be our argument.

01:24.240 --> 01:26.640
And now we can give it any name.

01:26.640 --> 01:29.070
We will actually call it state.

01:29.070 --> 01:31.950
Because the inputs of the neural networks

01:31.950 --> 01:34.560
are the input states that are encoded

01:34.560 --> 01:36.480
by vector five dimensions.

01:36.480 --> 01:40.080
The three signals, orientation, and minus orientation.

01:40.080 --> 01:42.120
And so now, things are gonna be easy.

01:42.120 --> 01:45.450
We are gonna feed the input states into the neural network,

01:45.450 --> 01:47.640
the one that we built right above.

01:47.640 --> 01:49.590
Right here, with the network class.

01:49.590 --> 01:52.530
And then, then we're gonna get the outputs.

01:52.530 --> 01:54.120
Which are the q-values

01:54.120 --> 01:56.160
for each of the three possible actions.

01:56.160 --> 01:58.230
And then using the softmax method,

01:58.230 --> 01:59.987
which I'm going to explain in this tutorial,

01:59.987 --> 02:03.240
we are gonna get the final action to play.

02:03.240 --> 02:04.470
So let's do this.

02:04.470 --> 02:08.430
Let's go into the function and let's implement all this.

02:08.430 --> 02:11.310
So, the first thing we need to start with is

02:11.310 --> 02:12.780
about what I've just mentioned.

02:12.780 --> 02:14.190
Softmax.

02:14.190 --> 02:17.430
The idea of the softmax is that we're gonna try to

02:17.430 --> 02:20.730
get the best action to play at each time.

02:20.730 --> 02:22.290
But, at the same time,

02:22.290 --> 02:24.960
we will be exploring the different actions.

02:24.960 --> 02:25.980
And how can we do that?

02:25.980 --> 02:28.860
How can we get the best action to play

02:28.860 --> 02:31.230
while still exploring the other actions?

02:31.230 --> 02:33.660
Well, we use this idea of softmax.

02:33.660 --> 02:37.260
Which consists of generating a distribution

02:37.260 --> 02:42.120
of probabilities for each of the q-values, q states action.

02:42.120 --> 02:45.150
You know we have one q-value for each action.

02:45.150 --> 02:46.800
Go left, go straight, or go right.

02:46.800 --> 02:49.680
But this q-value also depends on the input state.

02:49.680 --> 02:51.450
That's exactly the Q-function you saw

02:51.450 --> 02:52.830
in the intuition lectures.

02:52.830 --> 02:56.280
This Q-function is a function of the state, and the action.

02:56.280 --> 02:58.830
So since we have here one input state,

02:58.830 --> 03:00.270
which is the state here.

03:00.270 --> 03:01.890
And three possible actions.

03:01.890 --> 03:03.420
We have three q-values.

03:03.420 --> 03:04.590
Q, state action one,

03:04.590 --> 03:05.820
q state action two,

03:05.820 --> 03:07.500
and Q state action three.

03:07.500 --> 03:09.840
And we are gonna generate a distribution

03:09.840 --> 03:13.860
of probabilities with respect to these three q-values.

03:13.860 --> 03:15.870
That is, we're gonna have one probability

03:15.870 --> 03:17.790
for the first q-value.

03:17.790 --> 03:20.250
One other probability for the second q-value.

03:20.250 --> 03:22.830
And a third probability for the third q-value.

03:22.830 --> 03:25.650
And all these three probabilities will sum up to one.

03:25.650 --> 03:28.410
And so, we're gonna do all this with softmax.

03:28.410 --> 03:32.160
And softmax will attribute a large probability

03:32.160 --> 03:33.780
to the highest q-value.

03:33.780 --> 03:37.950
That's why an alternative to softmax is a simple argmax.

03:37.950 --> 03:41.520
You know, directly taking the maximum of the q-values.

03:41.520 --> 03:44.940
But in that case, we're not exploring the other actions.

03:44.940 --> 03:46.350
Thanks to these probabilities

03:46.350 --> 03:50.220
we can explore somewhere else using a temperature parameter

03:50.220 --> 03:52.170
that we're gonna see very quickly.

03:52.170 --> 03:53.580
We can still explore them

03:53.580 --> 03:55.980
by configuring this temperature parameter.

03:55.980 --> 03:58.650
That's why in general, for deep Q-learning,

03:58.650 --> 04:01.380
I highly recommend to use a softmax

04:01.380 --> 04:03.450
rather than a simple argmax.

04:03.450 --> 04:05.280
All right, so let's implement softmax,

04:05.280 --> 04:07.050
and therefore as you understood,

04:07.050 --> 04:10.200
since softmax returns the probabilities of each

04:10.200 --> 04:12.910
of the three q-values for the three possible actions,

04:12.910 --> 04:17.130
where the first variable we're going to create is probs,

04:17.130 --> 04:20.430
referring of course to these probabilities.

04:20.430 --> 04:22.290
So probs equals.

04:22.290 --> 04:25.020
And now we're gonna take our softmax function.

04:25.020 --> 04:28.290
And according to you, where are we going to take it from?

04:28.290 --> 04:30.900
Well, of course, remember we imported

04:30.900 --> 04:34.500
the torch.nn.functional submodule.

04:34.500 --> 04:37.020
Which I remind is the module that contains most

04:37.020 --> 04:39.810
of the actions to implement in neural network.

04:39.810 --> 04:41.490
We gave it the shortcut F.

04:41.490 --> 04:44.520
And so that's exactly from this functional submodule

04:44.520 --> 04:47.280
that we're gonna take our softmax function.

04:47.280 --> 04:49.470
But, since we gave it the shortcut F,

04:49.470 --> 04:52.410
we start here with an F representing functional.

04:52.410 --> 04:56.070
From which we take our softmax function.

04:56.070 --> 04:58.230
Here it is, that's the first one.

04:58.230 --> 04:59.760
And parenthesis.

04:59.760 --> 05:00.593
All right.

05:00.593 --> 05:04.110
And now, what do we need to input in this softmax function?

05:04.110 --> 05:06.330
Well, that's of course the entities

05:06.330 --> 05:10.140
for which we want to generate the probability distribution.

05:10.140 --> 05:11.550
And where are these entities?

05:11.550 --> 05:13.830
Well, these are of course the q-values.

05:13.830 --> 05:16.920
So now the question is, how can we get the q-values?

05:16.920 --> 05:19.950
Well, of course the q-values are the output

05:19.950 --> 05:21.150
of the neural network.

05:21.150 --> 05:23.550
And to get these outputs of the neural network,

05:23.550 --> 05:24.600
well, here we go.

05:24.600 --> 05:27.120
We need to take our neural network.

05:27.120 --> 05:29.250
But in fact, we already have it.

05:29.250 --> 05:33.480
Because that's what we initialized in the innate function.

05:33.480 --> 05:36.090
You know, we created self.model,

05:36.090 --> 05:38.340
which is nothing else on our neural network.

05:38.340 --> 05:41.580
Because it is an object of the network class.

05:41.580 --> 05:42.780
And so that's perfect.

05:42.780 --> 05:45.384
We can just take our model here in softmax.

05:45.384 --> 05:48.270
Apply this model to the input state.

05:48.270 --> 05:49.860
Which is the argument here.

05:49.860 --> 05:53.070
And that will return the outputs that we're looking for.

05:53.070 --> 05:54.510
That is the q-values.

05:54.510 --> 05:57.630
And so now your intuition, why we had to take the model here

05:57.630 --> 06:00.840
to introduce it in the innate function might get better.

06:00.840 --> 06:04.440
For those of you starting with object oriented programming,

06:04.440 --> 06:07.170
you will see that all this will become natural.

06:07.170 --> 06:12.170
So softmax, then, so we take our model, self.model.

06:12.360 --> 06:14.310
Because this must be the model

06:14.310 --> 06:17.160
of the object that we created here.

06:17.160 --> 06:20.760
But then, we need to get the output

06:20.760 --> 06:22.740
of our neural network model.

06:22.740 --> 06:25.650
And therefore we're gonna add here some parenthesis.

06:25.650 --> 06:27.960
In which we're going to input, well,

06:27.960 --> 06:30.600
the input state, named state here.

06:30.600 --> 06:35.490
So, what we wanna do at first is enter state.

06:35.490 --> 06:37.890
But now we must be careful to something.

06:37.890 --> 06:40.770
State looks like a simple state right now.

06:40.770 --> 06:42.990
But remember, that state is actually going to

06:42.990 --> 06:44.820
be a torch tensor.

06:44.820 --> 06:46.440
Because later we're gonna use

06:46.440 --> 06:48.570
this self.last state,

06:48.570 --> 06:52.170
to put it as the argument of the select action function.

06:52.170 --> 06:54.270
The state argument that is here is actually

06:54.270 --> 06:57.660
going to become later this self.last state.

06:57.660 --> 06:59.820
And since this is a torch tensor,

06:59.820 --> 07:01.740
well the model will accept it.

07:01.740 --> 07:02.820
So that's fine.

07:02.820 --> 07:05.130
But now we can improve the algorithm.

07:05.130 --> 07:08.550
So. as you understood, state is a torch tensor.

07:08.550 --> 07:11.851
And, as we said earlier, most of the tenors are wrapped

07:11.851 --> 07:15.630
into a variable that will also contain a gradient.

07:15.630 --> 07:17.730
So right now what we're gonna do, first,

07:17.730 --> 07:21.150
is wrap this input state that is a tensor,

07:21.150 --> 07:22.800
into a torch variable.

07:22.800 --> 07:25.530
But since this is the input state,

07:25.530 --> 07:28.140
well there is not going to be some differentiation.

07:28.140 --> 07:29.850
We will not be using the gradient

07:29.850 --> 07:33.900
of this state torch variable in the computations.

07:33.900 --> 07:38.040
And therefore, what we're gonna do now is convert

07:38.040 --> 07:42.813
this torch tensor state into a torch variable.

07:44.760 --> 07:45.810
Like so.

07:45.810 --> 07:48.930
But then to specify that we don't want the gradient

07:48.930 --> 07:52.560
in the graph of all the computations of the nn module,

07:52.560 --> 07:57.560
well, we will adhere, comma, volatile, equals, true.

07:58.140 --> 08:01.770
So that now we have our state torch tensor

08:01.770 --> 08:03.450
into a torch variable.

08:03.450 --> 08:07.350
But, thanks to this volatile equals true parameter,

08:07.350 --> 08:10.590
well, we won't be including the gradient associated

08:10.590 --> 08:12.060
to this input state,

08:12.060 --> 08:16.830
to the graph of all the computations of the nn.module.

08:16.830 --> 08:18.510
So that's another technical trick.

08:18.510 --> 08:20.190
This will save us some memory,

08:20.190 --> 08:23.130
and therefore this will improve the performance.

08:23.130 --> 08:25.140
So I highly recommend to do this.

08:25.140 --> 08:27.870
And now we're gonna add something more fun.

08:27.870 --> 08:29.520
It's about this temperature parameter

08:29.520 --> 08:30.810
that I've just mentioned.

08:30.810 --> 08:33.420
So this temperature parameter is the parameter

08:33.420 --> 08:36.180
that will allow us to modulate how the neural network

08:36.180 --> 08:40.170
will be sure of which action it should decide to play.

08:40.170 --> 08:43.440
So, this temperature parameter will be a positive number.

08:43.440 --> 08:45.990
And the closer it is to zero,

08:45.990 --> 08:48.000
the less sure the neural network

08:48.000 --> 08:49.770
will be when playing in action.

08:49.770 --> 08:51.854
And the higher this temperature parameter is,

08:51.854 --> 08:54.257
the more sure the neural network will be

08:54.257 --> 08:56.880
of the the action it decides to play.

08:56.880 --> 08:58.530
And to add this parameter,

08:58.530 --> 09:01.170
I'm going to multiply the output,

09:01.170 --> 09:02.640
which are the q-values,

09:02.640 --> 09:05.490
by this temperature parameter.

09:05.490 --> 09:08.160
So let's start, for example, with seven.

09:08.160 --> 09:09.480
And I'm going to specify here

09:09.480 --> 09:13.440
the little comment, t equals seven.

09:13.440 --> 09:15.540
So that's the temperature parameter

09:15.540 --> 09:17.220
that I'm setting equal to seven.

09:17.220 --> 09:18.390
We're gonna try some other ones,

09:18.390 --> 09:20.430
but I just wanna start with a small one.

09:20.430 --> 09:22.138
Because you're gonna see that with a small one,

09:22.138 --> 09:25.950
our car will still behave like some kind of an insect.

09:25.950 --> 09:28.500
But then by increasing this temperature parameter,

09:28.500 --> 09:30.480
our car will look more like a car.

09:30.480 --> 09:34.440
And besides, the self driving will be much, much better.

09:34.440 --> 09:35.340
And so that makes sense,

09:35.340 --> 09:38.850
because the higher is this temperature parameter,

09:38.850 --> 09:42.150
the higher will be the probability of the winning q-value.

09:42.150 --> 09:43.380
Because for example,

09:43.380 --> 09:48.150
if we have softmax of the q-values,

09:48.150 --> 09:52.170
let's take some simple numbers, 1, 2, 3.

09:52.170 --> 09:55.303
If softmax of 1, 2, 3 equals, for example,

09:55.303 --> 09:57.897
0.04, 0.11, and 0.85.

10:01.260 --> 10:04.020
Then by increasing the temperature, you know,

10:04.020 --> 10:05.670
by taking a higher temperature.

10:05.670 --> 10:07.620
Right now the temperature equals one.

10:07.620 --> 10:10.410
By taking a higher temperature, like for example, two.

10:10.410 --> 10:15.410
So, softmax, let's copy this and multiply it by,

10:15.930 --> 10:18.180
for example, two or three.

10:18.180 --> 10:20.670
Softmax have the same q-values,

10:20.670 --> 10:24.330
but multiplied by this temperature parameter of three.

10:24.330 --> 10:28.470
Well, we will get something like zero for the first q-value.

10:28.470 --> 10:31.170
Because this had a very low probability.

10:31.170 --> 10:33.330
So that's something around zero.

10:33.330 --> 10:36.480
Then, something very small for the second probability,

10:36.480 --> 10:39.390
because this was still a low probability.

10:39.390 --> 10:43.320
So let's say, for example, 0.02.

10:43.320 --> 10:46.170
But then, the third probability,

10:46.170 --> 10:48.360
since it was the largest one,

10:48.360 --> 10:50.100
and a pretty high one/

10:50.100 --> 10:52.140
Well, by increasing the temperature,

10:52.140 --> 10:54.180
this probability will be even larger.

10:54.180 --> 10:56.310
Because we're gonna be even more sure

10:56.310 --> 10:59.220
that this is the right q-value corresponding

10:59.220 --> 11:00.900
to the action we must play.

11:00.900 --> 11:05.900
And therefore, this is gonna be something like 0.98.

11:05.940 --> 11:08.490
Now, by increasing this temperature parameter,

11:08.490 --> 11:10.530
well we are now even more sure

11:10.530 --> 11:13.440
that the third action here should be the action to play.

11:13.440 --> 11:15.750
Because the probability for the q-value

11:15.750 --> 11:18.450
of this action is not only the largest one,

11:18.450 --> 11:19.800
but also very high.

11:19.800 --> 11:22.650
So that's what this temperature parameter is all about.

11:22.650 --> 11:24.720
It's about the certainty

11:24.720 --> 11:27.330
of which action we should decide to play.

11:27.330 --> 11:29.460
All right, so, I'm gonna remove this comment.

11:29.460 --> 11:31.170
This was just to explain.

11:31.170 --> 11:33.480
And now, let's get our action.

11:33.480 --> 11:35.520
So, how are we going to do that?

11:35.520 --> 11:38.280
Well, the principle of this softmax method is

11:38.280 --> 11:41.040
not only to generate a probability distribution

11:41.040 --> 11:43.230
for each of the q-values, but also,

11:43.230 --> 11:46.470
and that's the second step of this softmax method.

11:46.470 --> 11:50.130
We take a random draw from this distribution

11:50.130 --> 11:51.990
to get our final action.

11:51.990 --> 11:54.330
And of course, we will will have a high chance to

11:54.330 --> 11:56.190
get the action that corresponds

11:56.190 --> 11:58.800
to the q-value that has the highest probability.

11:58.800 --> 12:01.299
Because that's exactly how a distribution works.

12:01.299 --> 12:02.550
So there we go.

12:02.550 --> 12:04.050
Let's get our action.

12:04.050 --> 12:06.180
So we're going to introduce a new variable

12:06.180 --> 12:07.923
that we're gonna call action.

12:08.760 --> 12:12.930
And this action is going to be a random draw

12:12.930 --> 12:14.850
of the probability distribution

12:14.850 --> 12:17.520
that we just created at this line before.

12:17.520 --> 12:20.160
And so how do we get such a random draw?

12:20.160 --> 12:22.290
Well, we're gonna take our probs,

12:22.290 --> 12:24.480
probabilities of each of the q-values.

12:24.480 --> 12:26.610
We take probs, and then dot,

12:26.610 --> 12:31.530
and then we're gonna use the multinomial function.

12:31.530 --> 12:33.570
And that will give us a random draw

12:33.570 --> 12:36.150
from this distribution, probs.

12:36.150 --> 12:36.983
So that's all.

12:36.983 --> 12:38.460
That will give us the action.

12:38.460 --> 12:39.480
Perfect.

12:39.480 --> 12:42.750
And now of course, we are going to return the action.

12:42.750 --> 12:44.820
But there is a little trick here.

12:44.820 --> 12:47.250
Well, it's the fact that this probs dot

12:47.250 --> 12:50.190
multinomial returns the pie torch variable

12:50.190 --> 12:51.450
with a fake batch.

12:51.450 --> 12:54.570
You know, this fake dimension corresponding to the batch.

12:54.570 --> 12:58.140
And therefore to get the right result that we want,

12:58.140 --> 13:00.461
that is the action zero, one or two.

13:00.461 --> 13:05.461
We just need to add here, data, and then some brackets.

13:05.670 --> 13:09.180
And the action zero, one, or two that we're looking for

13:09.180 --> 13:13.560
is contained in the indexes, zero and zero.

13:13.560 --> 13:14.760
All right, and there we go.

13:14.760 --> 13:16.860
Now we have our action.

13:16.860 --> 13:18.900
Thanks to this select action function,

13:18.900 --> 13:22.770
the AI will now know which actions are play at each time.

13:22.770 --> 13:25.440
Terrific. So now we can move on to the next function

13:25.440 --> 13:27.510
which will be the learn function.

13:27.510 --> 13:30.359
And that's where we will train the whole neural network

13:30.359 --> 13:32.460
you know, with all the forward propagation and

13:32.460 --> 13:33.960
then the back propagation,

13:33.960 --> 13:35.910
using stochastic gradient descent.

13:35.910 --> 13:39.120
Well, basically, we will implement the whole training

13:39.120 --> 13:40.980
of the deep learning model that is

13:40.980 --> 13:43.440
at the heart of our artificial intelligence.

13:43.440 --> 13:44.700
So I can't wait to do that.

13:44.700 --> 13:47.190
This is going to be an exciting tutorial.

13:47.190 --> 13:49.500
And so, I'll see you in the next tutorial.

13:49.500 --> 13:51.213
Until then, enjoy AI.