WEBVTT

00:00.270 --> 00:02.130
Instructor: Hello and welcome to this tutorial.

00:02.130 --> 00:05.460
Congratulations again for being done with the A3C.

00:05.460 --> 00:08.400
We made it, we made the brains and we trained them

00:08.400 --> 00:11.310
but now we still have to make a test agent

00:11.310 --> 00:14.160
which will not update the model at all,

00:14.160 --> 00:17.160
but it will just use the shared model

00:17.160 --> 00:19.500
to make its own explorations.

00:19.500 --> 00:22.470
And of course, in this code we will record some videos

00:22.470 --> 00:24.180
and these will be the videos

00:24.180 --> 00:27.843
of this test agent playing Breakout with a certain score.

00:28.800 --> 00:30.840
So let's go through this code.

00:30.840 --> 00:32.400
The most important is done so,

00:32.400 --> 00:35.220
as I told you, we're not going to code it line by line

00:35.220 --> 00:36.270
but I think it's important

00:36.270 --> 00:38.760
that you understand what's going on here.

00:38.760 --> 00:40.980
So there we go with this code.

00:40.980 --> 00:42.750
In the first section, as you noticed

00:42.750 --> 00:44.610
we import the libraries

00:44.610 --> 00:47.520
and then we define this test function

00:47.520 --> 00:50.730
which will make this test agent do its own exploration

00:50.730 --> 00:52.560
and play the Breakout game.

00:52.560 --> 00:53.393
So there we go.

00:53.393 --> 00:55.350
This test function takes three arguments.

00:55.350 --> 00:56.340
The first one is rank,

00:56.340 --> 00:59.790
and that is still to de synchronize the test agent

00:59.790 --> 01:02.190
as we did for the training agents.

01:02.190 --> 01:03.900
Then we have our parameters of course

01:03.900 --> 01:05.370
because we will need some.

01:05.370 --> 01:07.830
And of course we have the shared model

01:07.830 --> 01:11.070
because this test agent will use the shared model

01:11.070 --> 01:13.350
to do its own exploration.

01:13.350 --> 01:15.540
All right, then there we go inside the function

01:15.540 --> 01:16.710
in this line of code

01:16.710 --> 01:18.810
we de synchronize the test agent

01:18.810 --> 01:20.760
exactly as we did before.

01:20.760 --> 01:23.130
Then we import the environment.

01:23.130 --> 01:25.410
So I remind that in the main code

01:25.410 --> 01:27.210
which will be in the next tutorial,

01:27.210 --> 01:29.850
we'll end name here will be replaced by

01:29.850 --> 01:30.750
Breakout V zero,

01:30.750 --> 01:34.050
so that we can go into the Breakout V zero environment

01:34.050 --> 01:35.220
and play the game.

01:35.220 --> 01:36.600
And video equals true means

01:36.600 --> 01:40.140
that we will get the videos of our AI playing Breakout.

01:40.140 --> 01:43.020
So basically this line of code in total

01:43.020 --> 01:46.643
means that we run one environment with a video.

01:46.643 --> 01:48.810
Then at the next line of code

01:48.810 --> 01:51.300
we de synchronize this environment.

01:51.300 --> 01:54.660
So exact same principle as in a train function.

01:54.660 --> 01:56.820
Then we get our model.

01:56.820 --> 02:00.690
And to do this, we create an object of the act credit class

02:00.690 --> 02:02.730
and we input the input shape

02:02.730 --> 02:06.390
with our environment, observation space, and shape zero.

02:06.390 --> 02:08.610
So exactly like for the train function.

02:08.610 --> 02:13.610
And our outputs which are the actions with end action space.

02:13.890 --> 02:17.550
So exactly like before, then something new here

02:17.550 --> 02:19.830
since we are done with the training

02:19.830 --> 02:22.800
we don't want to put the model in train mode

02:22.800 --> 02:24.750
because simply we don't want it to train.

02:24.750 --> 02:27.150
We want to put it in eval mode.

02:27.150 --> 02:29.880
And that's what we do here with model.eval.

02:29.880 --> 02:33.330
So that's just basically to put the test agent

02:33.330 --> 02:36.060
in a mode that will basically test it

02:36.060 --> 02:38.640
test its performance, evaluate it.

02:38.640 --> 02:41.490
Then here we get our input states

02:41.490 --> 02:43.830
which are the input images from the game

02:43.830 --> 02:46.800
which at this point are none by race.

02:46.800 --> 02:49.470
Then here we convert them into torch dancers.

02:49.470 --> 02:52.260
Here we initialize the sum of the rewards

02:52.260 --> 02:55.200
here we initialize done to true.

02:55.200 --> 02:58.920
So still just like last time, then something new again.

02:58.920 --> 03:01.620
We introduce this third time variable

03:01.620 --> 03:03.810
with time to time function

03:03.810 --> 03:06.150
to measure the time of computations.

03:06.150 --> 03:08.820
And that's because we want to get the starting point.

03:08.820 --> 03:12.120
Then here for the actions, we use a very practical type

03:12.120 --> 03:14.940
of queue that allows to add an element to the queue

03:14.940 --> 03:16.590
from the right or from the left.

03:16.590 --> 03:17.760
So that's very practical

03:17.760 --> 03:19.980
and I'll give you the reference link

03:19.980 --> 03:22.170
in the commented version of the code.

03:22.170 --> 03:25.140
So you'll have a look at what this dequeue is

03:25.140 --> 03:27.450
and that's what allows to do that.

03:27.450 --> 03:30.240
Then we initialize the lengths of an episode

03:30.240 --> 03:31.500
with zero of course,

03:31.500 --> 03:34.830
and then we will increment the size in this well loop.

03:34.830 --> 03:36.600
So we use the sim trick here.

03:36.600 --> 03:40.920
While true and in the loop we increment the length

03:40.920 --> 03:42.480
of the episode by one.

03:42.480 --> 03:45.000
When the game is done, when the game is over,

03:45.000 --> 03:48.870
we reload the last date of the shared model,

03:48.870 --> 03:51.450
the shared model that was updated by the other models.

03:51.450 --> 03:55.170
Remember that here, the shared model is no longer updated.

03:55.170 --> 03:59.040
Then still if the game is over, if the game is done,

03:59.040 --> 04:03.690
we re-in it, we re-initialize the cell states, CX

04:03.690 --> 04:06.720
and the hidden states, HX.

04:06.720 --> 04:10.140
And else if the game is not over,

04:10.140 --> 04:13.830
well we keep the same cell states and hidden states

04:13.830 --> 04:15.990
but we make sure they are in a torch variable

04:15.990 --> 04:18.210
so that they can be attached to a gradient.

04:18.210 --> 04:20.370
Okay, so that's something we already did

04:20.370 --> 04:24.270
in the train function and then still in the well loop.

04:24.270 --> 04:27.480
And after having updated the cell states

04:27.480 --> 04:29.130
and the hidden states the right way,

04:29.130 --> 04:31.860
depending on the two cases here, well what do we do?

04:31.860 --> 04:34.350
We get the predictions of the model.

04:34.350 --> 04:37.740
So that's exactly what we do here with this line of code.

04:37.740 --> 04:38.790
So we get the value

04:38.790 --> 04:40.590
which is the output of the critic,

04:40.590 --> 04:43.590
the action value which is the output of the actor,

04:43.590 --> 04:46.050
and then the top all of the hidden states, HX

04:46.050 --> 04:48.240
and the cell states CX.

04:48.240 --> 04:50.880
Then we generate a distribution of probabilities

04:50.880 --> 04:54.240
of the actions that is of the Q values, action value here.

04:54.240 --> 04:56.430
And we do this with the soft max function.

04:56.430 --> 04:58.020
And of course we don't need to get

04:58.020 --> 04:59.670
the log probabilities here,

04:59.670 --> 05:02.640
because this is just for the training for the test agent.

05:02.640 --> 05:04.530
It will just play the actions.

05:04.530 --> 05:06.450
Which will just use, you know,

05:06.450 --> 05:08.850
like Doom a soft max body to play it.

05:08.850 --> 05:10.920
But we are not doing any training here.

05:10.920 --> 05:12.540
So we have just the prob.

05:12.540 --> 05:15.870
And from this prob, we play the action

05:15.870 --> 05:19.050
by taking directly the arc max of these probabilities.

05:19.050 --> 05:20.580
That is it takes the action

05:20.580 --> 05:22.770
that has the highest probability.

05:22.770 --> 05:24.000
And the reason is that

05:24.000 --> 05:26.820
the test agent doesn't do any exploration.

05:26.820 --> 05:29.460
Remember that we want to give a chance to pick

05:29.460 --> 05:32.400
up some actions that have low probabilities

05:32.400 --> 05:35.670
when we want to do some exploration of these other actions

05:35.670 --> 05:37.890
and you know, not taking each time the action

05:37.890 --> 05:39.840
that has the highest probability.

05:39.840 --> 05:42.450
But here the test agent won't do any exploration

05:42.450 --> 05:44.280
and therefore that's why we directly

05:44.280 --> 05:47.370
take the action that has the maximum probability.

05:47.370 --> 05:49.740
Okay, and then once we play the action

05:49.740 --> 05:53.460
we reach the next state and we get the next reward.

05:53.460 --> 05:57.150
And done is updated whether or not the game is over.

05:57.150 --> 05:58.620
So this, we get all this

05:58.620 --> 05:59.550
with this line of code

05:59.550 --> 06:02.040
by playing the action after

06:02.040 --> 06:04.830
having selected it with the arc max here.

06:04.830 --> 06:08.280
So we play the action here and we get the state

06:08.280 --> 06:11.460
we get the reward and done is updated.

06:11.460 --> 06:14.430
Okay, and then since we just got a new reward

06:14.430 --> 06:16.830
we're gonna update the sum of the reward

06:16.830 --> 06:19.170
by simply adding this new reward

06:19.170 --> 06:21.510
and finally whenever the game is done,

06:21.510 --> 06:24.300
so if done means when the game is done

06:24.300 --> 06:27.000
when the AI finishes playing the game,

06:27.000 --> 06:29.250
well we're gonna print the results

06:29.250 --> 06:30.660
with the time,

06:30.660 --> 06:32.430
the episode reward,

06:32.430 --> 06:33.660
the lengths of the episode,

06:33.660 --> 06:37.440
that is how much time did it last playing Breakout.

06:37.440 --> 06:41.130
And this is just how we print all these variables

06:41.130 --> 06:42.840
Using these time tricks.

06:42.840 --> 06:44.070
So that's for the time.

06:44.070 --> 06:46.140
Then reward sum is just a variable

06:46.140 --> 06:47.250
for the sum of the rewards.

06:47.250 --> 06:49.140
And episode length is the variable

06:49.140 --> 06:51.210
for the length of an episode.

06:51.210 --> 06:54.720
Okay, and then once we printed all the results,

06:54.720 --> 06:58.170
well since the game is over and we wanna start a new game

06:58.170 --> 07:00.450
we're gonna re-initialize everything that is,

07:00.450 --> 07:02.190
the sum of the reward to zero,

07:02.190 --> 07:04.170
the length of an episode to zero.

07:04.170 --> 07:06.180
We are gonna re-init all the actions

07:06.180 --> 07:08.160
by using this clear function,

07:08.160 --> 07:10.110
reset the input images you know,

07:10.110 --> 07:13.590
by re-putting all the breaks all together.

07:13.590 --> 07:17.160
And finally, we use this time, that's sleep,

07:17.160 --> 07:19.680
60 seconds to just do a break

07:19.680 --> 07:22.830
of one minute to let the other agents practice.

07:22.830 --> 07:25.230
And that's if the game is over.

07:25.230 --> 07:29.160
Okay, and finally, we have this last line of code,

07:29.160 --> 07:30.780
which will get us the new state

07:30.780 --> 07:32.250
and then we can move forward.

07:32.250 --> 07:34.530
We can continue in this new game.

07:34.530 --> 07:35.850
So there we go.

07:35.850 --> 07:37.440
That's the test function,

07:37.440 --> 07:39.090
thanks to which we will see the videos

07:39.090 --> 07:40.500
in one or two tutorials.

07:40.500 --> 07:42.780
I hope we will be all together like last time

07:42.780 --> 07:43.980
to watch the results,

07:43.980 --> 07:46.380
that is with you, Carol, and me.

07:46.380 --> 07:47.490
That will be fun.

07:47.490 --> 07:50.370
And I'm telling you, expect to see good results.

07:50.370 --> 07:51.540
But keep in mind,

07:51.540 --> 07:55.080
that this Breakout game was super challenging.

07:55.080 --> 07:57.300
We thought it was the simplest game to play first,

07:57.300 --> 07:58.440
but not at all.

07:58.440 --> 08:01.650
It actually turned out to be much more difficult than Doom

08:01.650 --> 08:04.170
and that's why we put it in the last module.

08:04.170 --> 08:07.569
But anyway, let's make this main function

08:07.569 --> 08:09.600
in the next tutorial.

08:09.600 --> 08:12.690
Saying, this is not the most important here now that

08:12.690 --> 08:14.640
the A3C is all implemented.

08:14.640 --> 08:16.170
So we will not code it line by line

08:16.170 --> 08:17.550
and we'll expand the code

08:17.550 --> 08:20.520
and very quickly we will get to the results.

08:20.520 --> 08:22.263
Until then, enjoy AI.