WEBVTT

00:00.600 --> 00:03.150
-: Hello and welcome to this Python tutorial.

00:03.150 --> 00:04.170
So let's do this.

00:04.170 --> 00:06.120
Let's make this full loop,

00:06.120 --> 00:08.730
starting from the right and going to the left.

00:08.730 --> 00:11.910
And to do this, we're gonna add four.

00:11.910 --> 00:15.930
So this time the iterative variable is going to be our step

00:15.930 --> 00:17.910
because we're gonna go from the last step

00:17.910 --> 00:20.670
to the first step of the series of transitions.

00:20.670 --> 00:22.920
And so four step and on the trick

00:22.920 --> 00:24.810
to go from the right to the left

00:24.810 --> 00:28.980
is to use for step in, reversed.

00:28.980 --> 00:32.070
Reversed, and now we just need to input a sequence.

00:32.070 --> 00:35.340
And this sequence is going to be of course, our series.

00:35.340 --> 00:39.030
So we input our series, but as you can see in the paper

00:39.030 --> 00:41.910
we go from T minus one to T start.

00:41.910 --> 00:43.980
So we don't go from the last last step

00:43.980 --> 00:46.980
that is the terminal state but the state before that,

00:46.980 --> 00:48.360
that is T minus one,

00:48.360 --> 00:50.430
but to T start, that is the first step.

00:50.430 --> 00:53.160
And so here to go from

00:53.160 --> 00:55.200
not the last state but the state before,

00:55.200 --> 00:59.370
we need to add in brackets column minus one.

00:59.370 --> 01:00.960
I'm sure that for those of you

01:00.960 --> 01:02.130
who followed the machine learning

01:02.130 --> 01:04.560
and the deep learning course you know this trick,

01:04.560 --> 01:07.950
column minus one means that you're going up to

01:07.950 --> 01:10.320
the element before the last element

01:10.320 --> 01:12.090
but not up to the last element.

01:12.090 --> 01:14.820
And therefore we get the sequence we want.

01:14.820 --> 01:17.430
That is we're gonna go from the element

01:17.430 --> 01:20.520
before the last element up to the first element

01:20.520 --> 01:22.170
and that we do things to reversed.

01:22.170 --> 01:24.150
To go from the right to the left.

01:24.150 --> 01:27.150
All right, so we are ready to enter the full loop.

01:27.150 --> 01:29.790
And so inside this full loop, what are we going to do?

01:29.790 --> 01:32.880
Well, we're gonna do exactly as in the paper.

01:32.880 --> 01:35.610
We are going to update the accumulative reward

01:35.610 --> 01:37.380
by multiplying it by Gemma

01:37.380 --> 01:40.500
and adding the reward obtained in the current step.

01:40.500 --> 01:42.660
That is in the step of the full loop.

01:42.660 --> 01:44.160
All right, so let's do this.

01:44.160 --> 01:45.570
Going back to Python.

01:45.570 --> 01:50.570
And so we want to update our accumulative reward

01:50.610 --> 01:55.610
the following way by first multiplying it by Gemma.

01:57.780 --> 01:58.650
There we go.

01:58.650 --> 02:00.510
Here, we multiply it by Gemma.

02:00.510 --> 02:05.510
And then we want to add the reward of the step

02:05.730 --> 02:09.420
which we can access this way with this special structure.

02:09.420 --> 02:13.110
Remember that reward is a attribute of the step object.

02:13.110 --> 02:16.260
And so here of course, we add a plus.

02:16.260 --> 02:19.320
So accumulative reward equals the reward of this step

02:19.320 --> 02:20.850
we are in right now, the loop,

02:20.850 --> 02:24.330
plus Gemma times the previous cumulative reward

02:24.330 --> 02:26.130
before it is updated.

02:26.130 --> 02:28.170
Perfect, so now I think we're good.

02:28.170 --> 02:30.480
We're following thoroughly the algorithm.

02:30.480 --> 02:32.820
And now time for the next steps.

02:32.820 --> 02:35.070
Well, now it's going to become pretty easy.

02:35.070 --> 02:37.110
We go back to the first full loop

02:37.110 --> 02:38.790
because this full loop is just to

02:38.790 --> 02:41.040
compute the cumulative reward,

02:41.040 --> 02:42.690
going from the right to the left

02:42.690 --> 02:45.870
by updating this way, following the algorithm.

02:45.870 --> 02:48.780
And now as you remember, the goal of doing all this

02:48.780 --> 02:52.140
is to get our inputs ready and our targets ready

02:52.140 --> 02:53.910
so that we can minimize the square difference

02:53.910 --> 02:55.920
between the two for the training.

02:55.920 --> 02:58.440
And so right now the only thing that we have to do left

02:58.440 --> 03:01.200
is get these inputs and targets ready.

03:01.200 --> 03:02.910
So let's do this.

03:02.910 --> 03:04.860
First, what we need to do is add

03:04.860 --> 03:08.490
the first state of the series in our inputs list.

03:08.490 --> 03:11.370
So far this input state is in this input variable

03:11.370 --> 03:14.250
but that was just to compute the output.

03:14.250 --> 03:16.500
So we're gonna get this input state

03:16.500 --> 03:18.210
of the first step separately

03:18.210 --> 03:20.130
because that's exactly what we need to append

03:20.130 --> 03:21.360
in our inputs list.

03:21.360 --> 03:23.340
So let's get this separately

03:23.340 --> 03:25.980
therefore we're gonna call it state.

03:25.980 --> 03:28.260
And so exactly the same as here,

03:28.260 --> 03:29.430
we can get it this way

03:29.430 --> 03:32.730
by taking the first index of the series

03:32.730 --> 03:34.530
which contains the first transition,

03:34.530 --> 03:36.000
and then adding that state

03:36.000 --> 03:38.340
to get the state of this first transition.

03:38.340 --> 03:40.230
So that's the state we need.

03:40.230 --> 03:43.140
Then same, we're gonna get separately the target

03:43.140 --> 03:46.860
associated to this input state of the first transition.

03:46.860 --> 03:50.130
And so I'm introducing a new variable here, target,

03:50.130 --> 03:53.520
which will be equal to the Q value of the first step.

03:53.520 --> 03:57.060
And since the Q value is returned by the new network

03:57.060 --> 03:58.860
it is contained in output.

03:58.860 --> 04:02.730
And since output is the output associated to this input

04:02.730 --> 04:05.910
which contains the first element of the transition

04:05.910 --> 04:08.940
while we can get this Q value of the first state

04:08.940 --> 04:13.050
by just taking output here and taking the index zero,

04:13.050 --> 04:16.470
and then we add dot data.

04:16.470 --> 04:19.530
That will simply get us the Q value of the input state

04:19.530 --> 04:21.030
of the first transition.

04:21.030 --> 04:23.070
And that is exactly the target Q value.

04:23.070 --> 04:25.320
So that's why we're taking it.

04:25.320 --> 04:28.650
Then we are going to update this target variable

04:28.650 --> 04:31.170
but only for the action that was selected

04:31.170 --> 04:33.180
in the first step of the series.

04:33.180 --> 04:35.850
And to access this first step of the series

04:35.850 --> 04:39.300
well, we need to take first series zero

04:39.300 --> 04:40.560
because this is exactly

04:40.560 --> 04:42.990
the first step of the series, series zero.

04:42.990 --> 04:46.500
And to access the action corresponding to this first step

04:46.500 --> 04:51.150
of the series, well, we need to add here, dot action.

04:51.150 --> 04:55.110
Again, that is this attribute structure that we're using.

04:55.110 --> 04:56.760
You know, action is a attribute

04:56.760 --> 04:58.860
of the first step of the series,

04:58.860 --> 05:01.170
that is the first transition of the series

05:01.170 --> 05:03.300
because each transition of the series

05:03.300 --> 05:04.710
has the following structure;

05:04.710 --> 05:07.200
state, action, reward, and done.

05:07.200 --> 05:09.840
So action here, this attribute action here,

05:09.840 --> 05:11.490
means that we're simply getting

05:11.490 --> 05:14.310
the action of this first state.

05:14.310 --> 05:17.400
And so the target for that specific action

05:17.400 --> 05:20.970
of the first step is exactly what needs to be updated

05:20.970 --> 05:22.830
by the cumulative reward.

05:22.830 --> 05:25.320
So basically here we're just gonna write that

05:25.320 --> 05:29.490
the target associated to the action that was played

05:29.490 --> 05:34.380
in the first step of the series is this cumulative reward

05:34.380 --> 05:36.120
that we just computed.

05:36.120 --> 05:40.830
All right, and now we're finally ready to update our input

05:40.830 --> 05:43.860
by appending this first input state here,

05:43.860 --> 05:46.920
and this first target here for the first step.

05:46.920 --> 05:50.070
We only need to update the first step of the series

05:50.070 --> 05:52.950
because you know, we train the AI on 10 steps

05:52.950 --> 05:56.160
and therefore the input is the first step of the 10 steps.

05:56.160 --> 05:58.470
And also we get the target in this first step.

05:58.470 --> 06:00.180
But then we don't need to get any inputs

06:00.180 --> 06:03.180
or any targets in the following steps of the 10 steps

06:03.180 --> 06:06.420
because basically the learning happens 10 steps afterwards.

06:06.420 --> 06:09.000
So that's why right now we are only getting the state

06:09.000 --> 06:11.820
and the target of the first step of the series.

06:11.820 --> 06:13.410
So it's important to understand that.

06:13.410 --> 06:15.570
And therefore, if we understand that

06:15.570 --> 06:17.820
then now we understand that we have to input them

06:17.820 --> 06:20.640
in our list of inputs and our list of targets.

06:20.640 --> 06:21.570
So let's do this.

06:21.570 --> 06:25.170
First, let's append the states to our input.

06:25.170 --> 06:27.840
So we take our inputs list

06:27.840 --> 06:32.520
and we use the append function to add the state

06:32.520 --> 06:34.020
which remember is the input state

06:34.020 --> 06:36.093
of the first step of the series.

06:36.960 --> 06:39.570
And then we are going to append the target

06:39.570 --> 06:42.480
of the first step to our list of targets.

06:42.480 --> 06:45.090
And to do this, we take our list of targets.

06:45.090 --> 06:47.070
And same, we use the append function

06:47.070 --> 06:49.470
to append this first target.

06:49.470 --> 06:51.360
There we go, almost done.

06:51.360 --> 06:54.240
And now we need to return the last things

06:54.240 --> 06:56.070
which are of course what we need

06:56.070 --> 06:58.530
that's what we said at the beginning of this tutorial.

06:58.530 --> 07:02.040
The inputs and the targets that are now updated.

07:02.040 --> 07:04.170
So we're gonna add here return

07:04.170 --> 07:06.480
then we're gonna get our inputs first

07:06.480 --> 07:08.970
but then that's the same we need to convert them

07:08.970 --> 07:11.940
into a npi array first

07:11.940 --> 07:14.220
then do a type conversion to make sure we have

07:14.220 --> 07:17.850
a single type with D type equals

07:17.850 --> 07:21.870
np.float32 the same.

07:21.870 --> 07:25.320
And then we convert this into a torch sensor

07:25.320 --> 07:27.540
because of course we're working with pi torch

07:27.540 --> 07:29.550
so that's totally compulsory.

07:29.550 --> 07:34.550
And so I'm using the torch from npi function again,

07:37.140 --> 07:39.120
and that gives us our inputs.

07:39.120 --> 07:42.480
Perfect, and now let's do the same for the targets.

07:42.480 --> 07:44.670
Now we're gonna use this trick, which is quicker.

07:44.670 --> 07:46.890
We're gonna stack the targets together.

07:46.890 --> 07:50.670
And to do this, we need to take first our torch library

07:50.670 --> 07:54.060
because we're gonna use the stack function

07:54.060 --> 07:57.540
by torch to stack the targets.

07:57.540 --> 08:01.530
And so this line of code basically returns the inputs

08:01.530 --> 08:03.660
and the targets that were just updated

08:03.660 --> 08:07.170
through this eligibility trace (indistinct) algorithm,

08:07.170 --> 08:09.240
or we can call it N-step Q learning.

08:09.240 --> 08:10.770
And so now congratulations,

08:10.770 --> 08:12.780
we are ready to do the final training

08:12.780 --> 08:15.060
because basically the training consists

08:15.060 --> 08:17.550
of minimizing the square differences

08:17.550 --> 08:21.120
between the predictions of our inputs and the targets.

08:21.120 --> 08:22.830
So let's get our AI smart.

08:22.830 --> 08:25.170
It will become smart in the next tutorial.

08:25.170 --> 08:27.213
And so until then, enjoy AI.