WEBVTT

00:00.300 --> 00:02.370
-: Hello and welcome to this tutorial.

00:02.370 --> 00:04.290
Okay, so we just computed the entropy

00:04.290 --> 00:06.240
and added it to the entropies list,

00:06.240 --> 00:07.410
and now what we're gonna do

00:07.410 --> 00:09.510
is take a random draw of an action

00:09.510 --> 00:11.700
according to the distribution of probabilities

00:11.700 --> 00:13.170
of he SUFMACS.

00:13.170 --> 00:14.550
So let's do this.

00:14.550 --> 00:15.810
That's the next step.

00:15.810 --> 00:17.340
We are still in the loop

00:17.340 --> 00:20.130
because we're still running on the steps here.

00:20.130 --> 00:22.560
And so you now know how to play the action.

00:22.560 --> 00:24.870
We will first introduce a variable

00:24.870 --> 00:27.150
for the action, called action

00:27.150 --> 00:30.150
and then we take our distribution of probabilities

00:30.150 --> 00:34.440
and we're gonna use the multinomial function

00:34.440 --> 00:37.380
to take a random draw from this distribution

00:37.380 --> 00:39.210
of probabilities.

00:39.210 --> 00:41.490
And then we add that data.

00:41.490 --> 00:43.110
So it's important to note

00:43.110 --> 00:46.470
that the action will actually be a tensor

00:46.470 --> 00:47.790
with only one value

00:47.790 --> 00:50.970
but you should not see this as a simple value.

00:50.970 --> 00:53.880
You should see this as a tensor of dimensions

00:53.880 --> 00:57.150
one by one that contains this value for the action

00:57.150 --> 00:59.760
and that's because it is unsqueezed.

00:59.760 --> 01:02.940
And now, still in the same for loop,

01:02.940 --> 01:06.990
we are gonna get the log probability

01:06.990 --> 01:10.170
associated to the action that was displayed.

01:10.170 --> 01:13.560
And so I'm updating my log probability here

01:13.560 --> 01:15.000
by taking the previous one

01:15.000 --> 01:18.240
the previous log probe that we computed here.

01:18.240 --> 01:21.810
And then I'm gonna use the gather method

01:21.810 --> 01:24.120
to which I'm going to input one

01:24.120 --> 01:27.510
and the action that was displayed, because we want to

01:27.510 --> 01:31.470
get the lock probability that is associated to this action.

01:31.470 --> 01:35.490
And so as a second argument here, I'm gonna input my action

01:35.490 --> 01:38.820
but that has to be as a torch variable,

01:38.820 --> 01:40.830
as required by the gather function

01:40.830 --> 01:44.490
and the gather function just indexes with a tenser integer.

01:44.490 --> 01:47.310
All right, so now we just got the log probe associated

01:47.310 --> 01:48.683
to the action that was displayed

01:48.683 --> 01:52.350
and now the next step is to append what we got

01:52.350 --> 01:53.790
to the list here.

01:53.790 --> 01:55.740
So we got the value

01:55.740 --> 01:58.800
that's what we got here as the output of the model.

01:58.800 --> 02:00.900
Then we also got the log probe.

02:00.900 --> 02:04.140
So we are gonna add the log probes to the log probes list.

02:04.140 --> 02:06.900
We already appended the entropy to the entropies list.

02:06.900 --> 02:07.740
So, we're good.

02:07.740 --> 02:09.660
And the rewards, we will get it afterwards.

02:09.660 --> 02:12.960
So we will now append the value and the log probe

02:12.960 --> 02:15.480
to the values list and the log probes list.

02:15.480 --> 02:16.313
Let's do this.

02:16.313 --> 02:18.930
We take our values list, we add dot,

02:18.930 --> 02:22.740
we use the append function and we add the value

02:22.740 --> 02:24.960
that was just returned by the model.

02:24.960 --> 02:28.830
Perfect. Then same for the log probes.

02:28.830 --> 02:31.350
We just got our new log probes,

02:31.350 --> 02:36.180
and we are going to append it to the log probes list.

02:36.180 --> 02:40.140
And so in this append function, we input log probe.

02:40.140 --> 02:42.663
Our log probe, that was just computed here.

02:43.950 --> 02:47.280
All right, so our lists are now well updated.

02:47.280 --> 02:49.740
So now what we're gonna do is play the action

02:49.740 --> 02:52.920
because actually right here we selected the action,

02:52.920 --> 02:55.380
by taking a random draw from this distribution

02:55.380 --> 02:56.670
of probabilities here,

02:56.670 --> 02:58.470
but we actually haven't played it yet.

02:58.470 --> 03:00.330
And we are gonna play it now

03:00.330 --> 03:03.180
so that we can reach the new state

03:03.180 --> 03:05.160
and therefore get the new transition.

03:05.160 --> 03:08.040
And to play it, we're gonna take our environment

03:08.040 --> 03:10.500
because we play the action in our environment.

03:10.500 --> 03:13.200
Then we're gonna use the step method.

03:13.200 --> 03:16.020
And inside we specify the action

03:16.020 --> 03:18.180
that was selected to play it.

03:18.180 --> 03:20.520
And to do this, we take our action

03:20.520 --> 03:22.830
and we add .numpy

03:22.830 --> 03:26.040
because that's what is expected by the step function. Okay?

03:26.040 --> 03:27.670
But this returns

03:28.860 --> 03:31.470
actually the new state

03:31.470 --> 03:35.970
and also the new reward because by reaching a new state

03:35.970 --> 03:37.860
we get a new reward,

03:37.860 --> 03:40.410
and also we get a new value

03:40.410 --> 03:43.500
for done to know if the game is done or not.

03:43.500 --> 03:44.333
All right?

03:44.333 --> 03:46.080
So with this, we play the action

03:46.080 --> 03:48.510
we reach the new state and we get the new reward

03:48.510 --> 03:50.490
and we know if we're done with the game.

03:50.490 --> 03:52.950
And speaking of being done with the game,

03:52.950 --> 03:55.140
well, we're just gonna add something here

03:55.140 --> 03:58.290
that will make sure that an agent is not stucked

03:58.290 --> 03:59.220
in some state.

03:59.220 --> 04:00.053
And to do that

04:00.053 --> 04:03.873
we're gonna update the done variable the following way.

04:04.830 --> 04:07.440
Well, it's gonna be equal to done

04:07.440 --> 04:09.600
or we're gonna add a condition

04:09.600 --> 04:12.210
saying that the episode of the game

04:12.210 --> 04:14.370
should not last too much time.

04:14.370 --> 04:16.590
And it will see in the main function

04:16.590 --> 04:20.160
that there will be a max episode length parameter

04:20.160 --> 04:22.140
which will be equal to 10,000.

04:22.140 --> 04:25.830
And we don't want an episode to last more than 10,000 units.

04:25.830 --> 04:29.190
So we're gonna add here episode,

04:29.190 --> 04:30.150
length,

04:30.150 --> 04:32.820
which is the length of an episode,

04:32.820 --> 04:34.830
and we're gonna write the condition

04:34.830 --> 04:36.090
larger,

04:36.090 --> 04:38.310
than max,

04:38.310 --> 04:40.560
episode length.

04:40.560 --> 04:41.880
There we go and actually

04:41.880 --> 04:45.990
max length, we are getting it from our parameters,

04:45.990 --> 04:49.293
therefore I'm adding here perams. Perams.maxespisodelength.

04:50.580 --> 04:54.900
So this means that if the game is done,

04:54.900 --> 04:57.660
or the length of the episode is larger

04:57.660 --> 05:00.570
than the maximum length of episode set,

05:00.570 --> 05:03.510
which will be equal to 10,000, well the game will be done

05:03.510 --> 05:04.960
and we will start a new game.

05:05.910 --> 05:08.160
Okay? So that's just a precaution.

05:08.160 --> 05:10.470
And speaking of precaution, we're gonna add

05:10.470 --> 05:11.730
another precaution.

05:11.730 --> 05:15.420
It's to clamp the reward between minus one and plus one.

05:15.420 --> 05:17.160
We already got the reward here,

05:17.160 --> 05:19.200
but we want to make sure that the reward

05:19.200 --> 05:21.090
is between minus one and plus one.

05:21.090 --> 05:24.150
And to do this, we simply need to update the reward

05:24.150 --> 05:26.400
by doing this: taking the max,

05:26.400 --> 05:29.610
then taking the min of reward

05:29.610 --> 05:31.170
and one.

05:31.170 --> 05:34.350
And here we take the max of the minimum of reward and one

05:34.350 --> 05:36.150
and minus one.

05:36.150 --> 05:38.670
And that will make sure the reward is between minus one

05:38.670 --> 05:40.860
and plus one. All right?

05:40.860 --> 05:42.360
So another precaution,

05:42.360 --> 05:46.470
and now we just want to check if the game is done

05:46.470 --> 05:49.200
in which case we will restart the environment.

05:49.200 --> 05:50.550
And why do we need to check that now?

05:50.550 --> 05:53.040
It's because we just reached a new state.

05:53.040 --> 05:54.870
We just passed a new transition.

05:54.870 --> 05:58.080
So we need to check that after passing this new transition

05:58.080 --> 05:59.880
while the game is not done.

05:59.880 --> 06:04.880
So if done, again- If done, then in that case,

06:05.220 --> 06:07.260
we will restore the environment

06:07.260 --> 06:09.670
by setting the episode

06:10.830 --> 06:11.663
length,

06:13.140 --> 06:14.310
to zero.

06:14.310 --> 06:18.060
And also the state will be re initialized.

06:18.060 --> 06:19.320
And to re initialize it,

06:19.320 --> 06:23.403
we take our environment and we use the reset function.

06:24.570 --> 06:27.480
Okay, now we get out of this if condition.

06:27.480 --> 06:29.190
That was just a checking.

06:29.190 --> 06:32.820
And now what we're gonna do is since we reached a new state

06:32.820 --> 06:35.190
while this new state is right now, MPIArray

06:35.190 --> 06:38.130
because remember, the states are the input images

06:38.130 --> 06:40.560
which originally are MPIArrays.

06:40.560 --> 06:43.260
And so now what we have to do is to convert the new state

06:43.260 --> 06:44.580
into a torched tensor.

06:44.580 --> 06:47.010
So we are going to update our state

06:47.010 --> 06:50.610
and we're gonna use the torch library,

06:50.610 --> 06:54.090
and of course the from numpy

06:54.090 --> 06:55.350
function

06:55.350 --> 07:00.180
to convert this numpy erased date the input images

07:00.180 --> 07:02.130
into a torch sensor.

07:02.130 --> 07:03.210
Perfect.

07:03.210 --> 07:05.190
And now the last thing we need to do

07:05.190 --> 07:07.170
before getting out of this for loop,

07:07.170 --> 07:08.760
that is the loop on our steps.

07:08.760 --> 07:10.140
Well, it's to of course

07:10.140 --> 07:12.990
append the reward to the rewards list.

07:12.990 --> 07:15.270
That's the last thing that needs to be updated.

07:15.270 --> 07:18.300
We updated, all the lists here except for the reward.

07:18.300 --> 07:19.530
So we're gonna do that right now.

07:19.530 --> 07:23.640
We take our rewards and we use the append function

07:23.640 --> 07:27.420
to append the last reward that was just received.

07:27.420 --> 07:28.253
Perfect.

07:28.253 --> 07:31.350
And just before we get out of the for loop

07:31.350 --> 07:35.400
we just need to do one last check to check that.

07:35.400 --> 07:39.480
If it's done, then we want to stop the exploration.

07:39.480 --> 07:42.540
And so we're simply going to add here a break.

07:42.540 --> 07:46.140
Meaning that if it's done, we stop the exploration

07:46.140 --> 07:48.210
and we directly move on to the next step,

07:48.210 --> 07:51.570
which will be the update of the shared model.

07:51.570 --> 07:54.510
And now we are done with this for loop.

07:54.510 --> 07:58.170
Now that the agent has done its exploration,

07:58.170 --> 08:00.810
it'll update the shared model.

08:00.810 --> 08:03.450
And we will take care of that in the next tutorial.

08:03.450 --> 08:05.163
Until then, enjoy A.I.