WEBVTT

00:00.720 --> 00:02.010
-: Hello and welcome back to the course

00:02.010 --> 00:03.720
on artificial intelligence.

00:03.720 --> 00:04.860
In today's tutorial, we're talking

00:04.860 --> 00:07.470
about the final A in A3C.

00:07.470 --> 00:09.540
We're talking about advantage.

00:09.540 --> 00:10.373
So, there it is.

00:10.373 --> 00:12.330
We've already spoken about actor critic

00:12.330 --> 00:14.220
and asynchronous previously,

00:14.220 --> 00:16.320
and slowly built our way

00:16.320 --> 00:18.690
to what we are going to be looking at today.

00:18.690 --> 00:21.720
And with advantage, we're going to put everything together.

00:21.720 --> 00:23.640
So, this is what we have so far.

00:23.640 --> 00:25.170
We've got a neural network

00:25.170 --> 00:27.751
which is shared between the agents,

00:27.751 --> 00:29.550
the asynchronous agents,

00:29.550 --> 00:31.920
and then we've got the critic

00:31.920 --> 00:33.270
which is also shared between agents.

00:33.270 --> 00:34.620
So, how does this all play out

00:34.620 --> 00:36.450
and why is this critic shared between the agents?

00:36.450 --> 00:37.620
Let's have a look at that.

00:37.620 --> 00:38.940
Well, to understand this better.

00:38.940 --> 00:39.900
We're going to look at an example.

00:39.900 --> 00:42.330
We're going to look at this agent, for instance

00:42.330 --> 00:44.760
and see what happens when he's in a certain state

00:44.760 --> 00:46.830
and he needs to make a decision of what action to play.

00:46.830 --> 00:50.910
So, this agent is in a state, he sees this image,

00:50.910 --> 00:53.490
and then what happens is this information goes

00:53.490 --> 00:54.720
into the neural network.

00:54.720 --> 00:56.640
It goes through the convolutional layer,

00:56.640 --> 00:58.920
then it goes into the pooling layer, then it goes

00:58.920 --> 01:01.710
into the flattening layer, and then from there it goes

01:01.710 --> 01:03.840
into the hidden layers of the neural network.

01:03.840 --> 01:07.997
And then as an output, he gets all of these policy values

01:07.997 --> 01:09.930
the QU values or the policy.

01:09.930 --> 01:14.100
And also he gets the V value, the critic value.

01:14.100 --> 01:17.670
And so, as we know, neural networks, in order to operate

01:17.670 --> 01:21.090
they need to propagate certain errors

01:21.090 --> 01:22.740
or losses back through the network.

01:22.740 --> 01:25.110
So this way, in order to update the weights.

01:25.110 --> 01:28.290
So what weights or so what losses

01:28.290 --> 01:29.580
are we going to be dealing with here?

01:29.580 --> 01:30.540
Well, we've got two losses.

01:30.540 --> 01:32.940
We've got the value loss and the policy loss.

01:32.940 --> 01:34.890
So, the value loss is linked to the V value,

01:34.890 --> 01:36.960
policy loss is linked to policy.

01:36.960 --> 01:41.340
And so value of a loss, we've already dealt with it before.

01:41.340 --> 01:44.700
We know that we have rewards

01:44.700 --> 01:46.440
and we know that we have a discount factor.

01:46.440 --> 01:50.670
So, basically this is very similar to what we were talking

01:50.670 --> 01:54.720
about in the deep Q learning tutorials.

01:54.720 --> 01:58.830
Basically, the network predicts a certain value, V

01:58.830 --> 02:03.450
and at the same time, we can estimate what should be based

02:03.450 --> 02:05.640
on what we know about the environment so far,

02:05.640 --> 02:09.210
we can estimate what should the value V, be in the state

02:09.210 --> 02:11.580
and by comparing the two, we can calculate the value loss

02:11.580 --> 02:13.860
and then back propagate the network to update the weights.

02:13.860 --> 02:15.150
So that's pretty straightforward.

02:15.150 --> 02:17.760
The new thing here is the policy loss.

02:17.760 --> 02:21.690
And so what is this policy loss and how does it work?

02:21.690 --> 02:25.560
Well, this is the part where this whole situation

02:25.560 --> 02:28.500
where the critic is shared between the actors

02:28.500 --> 02:32.580
or between the agents, is going to finally emerge.

02:32.580 --> 02:34.650
So, to understand policy loss

02:34.650 --> 02:36.900
we need to introduce a value called advantage

02:36.900 --> 02:40.740
hence the name of this part of this tutorial,

02:40.740 --> 02:43.320
and this whole part of the A3C algorithm, the advantage.

02:43.320 --> 02:47.910
And the advantage is calculated as Q of SNA minus V of S.

02:47.910 --> 02:51.843
So basically the Q value that you chose to play

02:51.843 --> 02:53.730
of the action that you chose to play

02:53.730 --> 02:55.200
in the state that you were in,

02:55.200 --> 02:57.630
state S minus the value of that state.

02:57.630 --> 02:59.790
So, this is the difference between the two,

02:59.790 --> 03:00.720
and that is called advantage.

03:00.720 --> 03:03.090
And advantage is used

03:03.090 --> 03:04.980
in the calculation of the policy loss.

03:04.980 --> 03:06.420
Now, we won't go into the formula

03:06.420 --> 03:09.450
of the policy loss calculation because it's quite complex.

03:09.450 --> 03:11.700
It uses entropy or it can use entropy.

03:11.700 --> 03:12.605
doesn't have to,

03:12.605 --> 03:14.730
we're not going to dissect that formula

03:14.730 --> 03:17.040
but we're going to understand this on an intuitive level.

03:17.040 --> 03:17.873
Why are we doing this?

03:17.873 --> 03:20.040
Why are we calculating this advantage

03:20.040 --> 03:21.630
and how is it going to help us?

03:21.630 --> 03:24.150
Well, let's look at this for a second.

03:24.150 --> 03:26.010
The Q value here,

03:26.010 --> 03:30.990
comes from what the neural network predicted for this agent.

03:30.990 --> 03:34.470
So, it predicted in this specific action

03:34.470 --> 03:36.510
in this specific state for the actions that it can play.

03:36.510 --> 03:39.210
So, it's got these actions and it can select one

03:39.210 --> 03:41.430
of them and you can play it.

03:41.430 --> 03:44.150
Well, whereas the V value is

03:44.150 --> 03:46.440
or the value that is dictated by the critic

03:46.440 --> 03:50.070
it is the value that we have here in this shared part.

03:50.070 --> 03:52.350
And that's the key here, that this part is shared.

03:52.350 --> 03:55.890
So, the critic, because this is how the critic comes

03:55.890 --> 03:58.740
into play, because we've got a value that we choose

03:58.740 --> 04:00.300
or the action that we choose to play

04:00.300 --> 04:02.010
for this agent in that state.

04:02.010 --> 04:06.360
But then the critic can tell us what is the known value

04:06.360 --> 04:07.200
of that state.

04:07.200 --> 04:10.948
What is overall the known value for this whole group

04:10.948 --> 04:14.130
of agents that are performing together?

04:14.130 --> 04:15.900
Because they're sharing, not necessarily

04:15.900 --> 04:16.860
because they're sharing network,

04:16.860 --> 04:19.950
because they're sharing the critic, they're all contributing

04:19.950 --> 04:22.680
to this, to these V values that are being calculated

04:22.680 --> 04:23.513
for different states.

04:23.513 --> 04:25.410
So, the whole A3C algorithm says,

04:25.410 --> 04:29.130
okay so, the critic knows a V value.

04:29.130 --> 04:32.310
How much better is your Q value that you're

04:32.310 --> 04:35.340
selecting compared to the known V value?

04:35.340 --> 04:36.630
That's what it's saying.

04:36.630 --> 04:37.980
So, that's basically it.

04:37.980 --> 04:40.050
So okay, I'm going to select

04:40.050 --> 04:43.488
a Q value here based on my policy,

04:43.488 --> 04:45.210
based on whatever we use,

04:45.210 --> 04:47.200
like a softmax function

04:48.120 --> 04:50.037
or a epsilon-greedy policy or something like that.

04:50.037 --> 04:51.780
And of course, there'll be exploration

04:51.780 --> 04:54.120
plus exploitation combined in there

04:54.120 --> 04:55.530
but we select a Q value.

04:55.530 --> 04:57.150
And now the question is,

04:57.150 --> 05:00.270
what is the advantage?

05:00.270 --> 05:01.103
Hence it's called advantaage

05:01.103 --> 05:02.070
What is the advantage

05:02.070 --> 05:04.920
that your selected action brings compared

05:04.920 --> 05:07.830
to the known value of that state?

05:07.830 --> 05:09.450
And that is the essence of the advantage.

05:09.450 --> 05:12.090
And basically then that is used to

05:12.090 --> 05:13.680
calculate the policy loss.

05:13.680 --> 05:16.600
And then the policy loss is then backpropagated through

05:16.600 --> 05:17.940
back through the network.

05:17.940 --> 05:20.730
So they're both backpropagated through the network

05:20.730 --> 05:23.370
and the weights are adjusted in order

05:23.370 --> 05:26.160
for the network to better represent the value of the critic.

05:26.160 --> 05:28.410
And also, so that's this top part,

05:28.410 --> 05:30.889
but then also, the key here is that the value

05:30.889 --> 05:33.780
the weights are backpro, when this, the

05:33.780 --> 05:35.130
this policy loss is backpropagated.

05:35.130 --> 05:36.330
The weights are adjusted

05:36.330 --> 05:40.814
in such a way so that this advantage is maximized.

05:40.814 --> 05:43.440
So like that's, that's the intuitive side

05:43.440 --> 05:44.820
of the intuitive understanding of it

05:44.820 --> 05:47.387
that we are backpropagating this policy loss

05:47.387 --> 05:50.700
through the network, in order to help

05:50.700 --> 05:52.050
maximize this advantage.

05:52.050 --> 05:53.694
And what, what that means is basically

05:53.694 --> 05:56.820
that when an agent comes across bad actions

05:56.820 --> 05:58.530
like actions where the Q values less

05:58.530 --> 06:00.612
than the known value for this state.

06:00.612 --> 06:03.360
So basically the whole A3C algorithm knows

06:03.360 --> 06:05.674
that the value for this state is something X

06:05.674 --> 06:08.190
and then all of a sudden you came across a very

06:08.190 --> 06:11.670
bad action and the, and you did a, you chose a bad action.

06:11.670 --> 06:12.743
And what that means

06:12.743 --> 06:14.388
for the A3C algorithm means that, well

06:14.388 --> 06:16.380
why would we do something like that

06:16.380 --> 06:17.970
when it's worse than we already

06:17.970 --> 06:20.247
what we already know about this whole environment

06:20.247 --> 06:22.050
and what we could, could have done

06:22.050 --> 06:24.406
so we shouldn't do more of that, and therefore

06:24.406 --> 06:27.690
the weights are just in a way, so that happens rarer.

06:27.690 --> 06:29.760
So that happens less rare.

06:29.760 --> 06:32.070
So that's a less frequent occurrence that we

06:32.070 --> 06:33.111
choose that bad action.

06:33.111 --> 06:34.020
On the other hand

06:34.020 --> 06:36.600
if you choose a very good action where Q value is greater

06:36.600 --> 06:40.310
than V or much greater than when during this backpropogation

06:40.310 --> 06:42.090
of the policy loss through the network

06:42.090 --> 06:43.410
the weights are gonna be updated

06:43.410 --> 06:47.130
in such a way to reinforce that, to reencourage

06:47.130 --> 06:49.710
reassure that, to happen again

06:49.710 --> 06:51.882
So that the weights will be adjusted in such a way

06:51.882 --> 06:54.120
so the A3C algorithm will think, Oh wow

06:54.120 --> 06:55.140
that that was really cool.

06:55.140 --> 06:58.290
The advantage was very high there, I should do more of that

06:58.290 --> 07:00.270
and therefore it will update the weights

07:00.270 --> 07:04.320
in such a way that will be more likely to occur

07:04.320 --> 07:05.246
in the future, that action.

07:05.246 --> 07:08.790
So, and therefore that is, you know

07:08.790 --> 07:12.089
that's how the network is slowly, slowly going to adapt

07:12.089 --> 07:15.710
and slowly going to construct itself into something that

07:15.710 --> 07:18.561
on one hand calculates the value correctly

07:18.561 --> 07:20.130
and then on the other hand

07:20.130 --> 07:22.656
or as correctly as possible, and on the other hand

07:22.656 --> 07:27.656
it encourages or it has actions which have a high advantage.

07:28.350 --> 07:30.570
So there we go. That's this part.

07:30.570 --> 07:32.730
And now let's have a look at another one just

07:32.730 --> 07:34.770
to kind reinforce what we just discussed.

07:34.770 --> 07:35.970
Let's look at the top one.

07:35.970 --> 07:36.981
So same thing here.

07:36.981 --> 07:40.920
The top agent sees a situation,

07:40.920 --> 07:43.050
a state is in a state and

07:43.050 --> 07:44.190
then needs to decide what to do.

07:44.190 --> 07:46.650
So sends this information to the network.

07:46.650 --> 07:47.940
So this image goes into network,

07:47.940 --> 07:50.130
goes to convolution layer, pooling layer,

07:50.130 --> 07:52.723
flattening layer goes into the hidden layers,

07:52.723 --> 07:54.720
and then from here we get an output.

07:54.720 --> 07:57.722
We get the Q values of the policy, we get the V values.

07:57.722 --> 07:59.160
Again, the same thing.

07:59.160 --> 08:00.960
We've got two losses.

08:00.960 --> 08:02.670
We've got the value loss, which is here,

08:02.670 --> 08:03.752
policy loss which is here.

08:03.752 --> 08:06.204
Value loss we already know how it's calculated

08:06.204 --> 08:09.150
and we discussed this in the deep Q learning

08:09.150 --> 08:10.632
and just discussed this just now as well.

08:10.632 --> 08:12.840
So that's how the value loss calculated.

08:12.840 --> 08:15.000
And then the policy loss, again,

08:15.000 --> 08:16.596
in order to calculate that

08:16.596 --> 08:18.420
which we are not going to go into for now

08:18.420 --> 08:21.660
but on an intuitive level, we're calculating the advantage

08:21.660 --> 08:25.530
which is, okay, so we took a certain action

08:25.530 --> 08:28.029
we chose a certain action based on our selection policy

08:28.029 --> 08:30.870
whether it's softmax or epsilon-greedy

08:30.870 --> 08:34.188
or whatever other selection policy that we're using.

08:34.188 --> 08:37.800
And then what's the action we took?

08:37.800 --> 08:41.334
Now let's compare it to the known value of the state

08:41.334 --> 08:44.580
which comes from the shared critic.

08:44.580 --> 08:46.560
So this critic is kind of like, if you think about it

08:46.560 --> 08:50.190
he's kind of observing all of these agents at the same time.

08:50.190 --> 08:51.990
He's looking at this one, looking at this one, this one

08:51.990 --> 08:53.557
they're all contributing towards the critic to

08:53.557 --> 08:56.250
get the critic more up to speed

08:56.250 --> 08:57.480
with the environment to make sure

08:57.480 --> 08:59.640
that the critic is representative

08:59.640 --> 09:02.520
of what's going on in the actual environment.

09:02.520 --> 09:04.020
So that the weights, that's this is,

09:04.020 --> 09:05.250
that's where the value loss comes in.

09:05.250 --> 09:08.698
So that the weights of the actual neural network

09:08.698 --> 09:12.120
that they reflect very well

09:12.120 --> 09:16.230
the actual situation of things in the environment so

09:16.230 --> 09:20.100
that they can then rely on this value and then use it here.

09:20.100 --> 09:22.926
And so basically, So all of these agents,

09:22.926 --> 09:25.672
All of these agents are contributing

09:25.672 --> 09:27.660
to this critic but then at the same time,

09:27.660 --> 09:29.852
through this value loss but at the same time,

09:29.852 --> 09:32.996
the critic is observing the decisions

09:32.996 --> 09:35.550
or the policies of these agents.

09:35.550 --> 09:37.357
Like, it's like going, looking back at the,

09:37.357 --> 09:39.238
like I'm trying to draw like an arrow to the policy.

09:39.238 --> 09:40.800
An arrow. An arrow.

09:40.800 --> 09:42.360
So looking back at them at the

09:42.360 --> 09:43.410
decisions that they're making.

09:43.410 --> 09:46.230
It's criticizing these decisions through the advantage.

09:46.230 --> 09:48.120
It's saying, Okay, you made a decision.

09:48.120 --> 09:50.220
You chose this, you chose this action.

09:50.220 --> 09:51.210
That's great.

09:51.210 --> 09:52.470
Now let's calculate the advantage.

09:52.470 --> 09:53.303
What does the advantage?

09:53.303 --> 09:56.195
Advantages is A equals the Q value of

09:56.195 --> 09:58.810
the decision I made or the choice I made.

09:58.810 --> 10:00.842
The action I made, chose to take.

10:00.842 --> 10:04.736
Minus the known value to the critic,

10:04.736 --> 10:05.913
the known value to the critic.

10:05.913 --> 10:07.319
So calculate the difference.

10:07.319 --> 10:08.619
If it's a low difference,

10:08.619 --> 10:11.640
your policy then when your policy loss

10:11.640 --> 10:13.438
is backpropagated through the network,

10:13.438 --> 10:15.330
the weights are going to be adjusted,

10:15.330 --> 10:17.160
it's gonna encourage the weights to be adjusted

10:17.160 --> 10:19.140
in such a way that, that doesn't happen again.

10:19.140 --> 10:22.424
That Q value or that Q value is gonna be lower.

10:22.424 --> 10:26.580
So that, because our policy selects the actions

10:26.580 --> 10:28.920
based on the Q values, the higher the Q value

10:28.920 --> 10:30.570
the more likely it's going to be selected.

10:30.570 --> 10:32.500
So if we were using like an argmax policy

10:32.500 --> 10:34.140
then we've just always select

10:34.140 --> 10:35.220
the one with the highest as,

10:35.220 --> 10:36.300
as you remember, we discussed this

10:36.300 --> 10:38.880
then we'd always select the one with the highest Q value.

10:38.880 --> 10:41.010
But we actually, we are using a probabilistic approach

10:41.010 --> 10:44.520
whereas either using like softmax or epsilion-greedy policy.

10:44.520 --> 10:45.641
And then, so we're basically selecting,

10:45.641 --> 10:47.550
we can select any one of them

10:47.550 --> 10:49.230
but the higher the Q value the better.

10:49.230 --> 10:50.864
So if we selected something

10:50.864 --> 10:52.794
and then the advantage was very low,

10:52.794 --> 10:55.710
then bam, the network's gonna be updated

10:55.710 --> 10:58.237
in such a way that, next time

10:58.237 --> 11:00.956
this Q value of that certain action is gonna be less

11:00.956 --> 11:02.334
and maybe something else will be more.

11:02.334 --> 11:06.090
So that's how that, that is played out.

11:06.090 --> 11:08.760
And on the other hand if we select something

11:08.760 --> 11:10.800
where the advantage is gonna be high

11:10.800 --> 11:13.770
then this is gonna go into the policy loss

11:13.770 --> 11:15.090
and then the network's gonna be updated.

11:15.090 --> 11:19.560
So that is a more commonly observed like scenario.

11:19.560 --> 11:22.662
And so basically this whole policy loss helps

11:22.662 --> 11:25.394
the network adapt or morph in such a way

11:25.394 --> 11:27.716
that we do more of the good stuff

11:27.716 --> 11:29.698
good actions and good things,

11:29.698 --> 11:31.680
and do less of the bad things.

11:31.680 --> 11:33.720
And that's how these two losses come into play

11:33.720 --> 11:35.100
and that's how they backpropagate.

11:35.100 --> 11:39.270
So hopefully that clears up in a very intuitive way.

11:39.270 --> 11:40.690
Of course, we didn't go into the formulas

11:40.690 --> 11:43.566
into the complex mathematics behind all of this

11:43.566 --> 11:46.608
and like into the very intricate details

11:46.608 --> 11:48.866
but at the same time hopefully on an intuitive way,

11:48.866 --> 11:50.076
in an intuitive way,

11:50.076 --> 11:53.775
all of this clears up why

11:53.775 --> 11:56.400
we have the actor and the critic

11:56.400 --> 11:58.324
and how they interact together that,

11:58.324 --> 12:01.110
you know, you have these agents asynchronous

12:01.110 --> 12:02.959
so this is the asynchronous side of things.

12:02.959 --> 12:06.210
Then you, this is your actor and your critic

12:06.210 --> 12:07.200
and this is your advantage.

12:07.200 --> 12:08.490
And how that all comes into play.

12:08.490 --> 12:10.379
So these asynchronous agents

12:10.379 --> 12:14.910
they're playing this or exploring the environment

12:14.910 --> 12:15.891
and working through environment

12:15.891 --> 12:20.190
and they're all, altogether contributing to a critic

12:20.190 --> 12:23.640
which is then observing their policies

12:23.640 --> 12:27.053
observing the actors, which is what this is called.

12:27.053 --> 12:30.690
And through the advantage

12:30.690 --> 12:32.879
and therefore coming up with this policy loss

12:32.879 --> 12:34.729
and then policy and value loss

12:34.729 --> 12:37.500
their backpropagated to adjust the network

12:37.500 --> 12:39.450
in order to on one hand represent

12:39.450 --> 12:43.560
the true way of things in the environment.

12:43.560 --> 12:47.910
On the other hand, to improve the actor's performances.

12:47.910 --> 12:48.754
So there we go.

12:48.754 --> 12:52.442
That's a quick recap of the intuition we discussed.

12:52.442 --> 12:54.990
Once again, hopefully this is all

12:54.990 --> 12:57.000
coming together on an intuitive level.

12:57.000 --> 12:59.070
And of course in the practical tutorials

12:59.070 --> 13:02.610
we'll talk more about how all of this works.

13:02.610 --> 13:05.580
I will walk you through this, the process of building this

13:05.580 --> 13:08.250
but having this image in your mind

13:08.250 --> 13:10.680
and this like, kind of like, a roadmap of everything.

13:10.680 --> 13:12.840
How it comes together will be, well should be

13:12.840 --> 13:15.030
I hope it'll be very helpful for you

13:15.030 --> 13:17.634
to better navigate the practical side of things.

13:17.634 --> 13:21.930
And in terms of additional reading for today

13:21.930 --> 13:23.370
we've got two elements.

13:23.370 --> 13:25.740
So first one is on the advantage.

13:25.740 --> 13:26.820
So here we've got

13:26.820 --> 13:29.291
High-Dimensional Continuous Control Using

13:29.291 --> 13:32.994
Generalized Advantage Estimation. by John Schulman.

13:32.994 --> 13:36.930
And this is an image of a stick figure

13:36.930 --> 13:38.384
getting up, like standing up.

13:38.384 --> 13:41.760
And here you can find even more about advantage,

13:41.760 --> 13:43.320
and advantage and you'll find out

13:43.320 --> 13:44.880
the different types of advantages.

13:44.880 --> 13:47.545
So you've got the general advantage estimation

13:47.545 --> 13:50.400
you've got advantages that you use actually

13:50.400 --> 13:52.290
in the formulas in the calculation.

13:52.290 --> 13:55.200
So if you want to find out more about advantage

13:55.200 --> 13:57.630
and exactly how it works, the formulas behind it

13:57.630 --> 14:02.630
and some of the top, top elements or formulas

14:02.880 --> 14:06.120
and know-hows in the space of

14:06.120 --> 14:07.818
this advantage that we discussed

14:07.818 --> 14:09.765
then this is the article to go to.

14:09.765 --> 14:13.813
And one more, one other element that

14:13.813 --> 14:16.097
or piece of work that we wanted to

14:16.097 --> 14:20.010
remind you about is the blog.

14:20.010 --> 14:21.960
A series of blog posts by Arthur Giuliani

14:21.960 --> 14:23.575
which we've mentioned a couple times already.

14:23.575 --> 14:27.498
This is part eight, which is specifically about A3C.

14:27.498 --> 14:30.494
So here you can get

14:30.494 --> 14:33.203
another explanation.

14:33.203 --> 14:36.240
So with a bit more mathematics about what's going on

14:36.240 --> 14:37.073
and you, maybe you can pick up

14:37.073 --> 14:39.480
some additional things from here.

14:39.480 --> 14:41.340
Just two things to keep in mind.

14:41.340 --> 14:44.987
First of all, as always, this blog is TensorFlow

14:44.987 --> 14:46.110
whereas we are using PyTorch.

14:46.110 --> 14:46.943
So keep that in mind.

14:46.943 --> 14:50.550
And the second thing is that the way we structured

14:50.550 --> 14:52.931
our approach is, we talked about actor critic first

14:52.931 --> 14:54.839
then we talked about asynchronous

14:54.839 --> 14:58.043
and then we talked about advantage.

14:58.043 --> 15:01.290
Whereas in his blog, Arthur first talks

15:01.290 --> 15:03.930
about asynchronous and actor critic then advantage.

15:03.930 --> 15:05.550
So keep that in mind.

15:05.550 --> 15:07.350
So hopefully that doesn't throw you off

15:07.350 --> 15:09.990
but other than that of course it's a great piece

15:09.990 --> 15:12.360
of content and we do highly recommend

15:12.360 --> 15:14.790
checking it out for some additional information.

15:14.790 --> 15:16.800
So there we go. Hopefully enjoy today's tutorial

15:16.800 --> 15:18.690
and I look forward to seeing you next time.

15:18.690 --> 15:20.673
Until then, enjoy AI.