WEBVTT

00:00.690 --> 00:01.523
-: Hello and welcome back

00:01.523 --> 00:03.240
to the course on artificial intelligence.

00:03.240 --> 00:05.910
Today we're talking about the first part

00:05.910 --> 00:08.370
of A3C, the actor-critic part.

00:08.370 --> 00:10.500
So, here we've got asynchronous advantage

00:10.500 --> 00:12.390
actor-critic algorithm,

00:12.390 --> 00:13.680
and we're going to be talking

00:13.680 --> 00:15.540
about that underlined actor-critic part.

00:15.540 --> 00:17.370
That's where we're going, we're going to start.

00:17.370 --> 00:18.600
You could technically start anywhere,

00:18.600 --> 00:20.790
but it just makes a lot more sense to start

00:20.790 --> 00:23.280
from actor-critic, because that way

00:23.280 --> 00:26.820
we'll have like a very consecutive explanation

00:26.820 --> 00:29.640
of or intuitive understanding of what's going on.

00:29.640 --> 00:30.690
It's going to facilitize better

00:30.690 --> 00:33.090
if we start surprisingly at the at the end

00:33.090 --> 00:34.170
of the separation.

00:34.170 --> 00:37.680
All right, so, so far in this course we've come

00:37.680 --> 00:40.830
up with Deep Convolutional Q learning

00:40.830 --> 00:42.270
which is illustrated over here.

00:42.270 --> 00:46.290
So we've got the computer seeing the pixels

00:46.290 --> 00:48.810
so the actual image and pixels, not just the vector.

00:48.810 --> 00:49.650
So it's not cheating.

00:49.650 --> 00:52.110
It's actually seeing exactly what a human sees.

00:52.110 --> 00:54.420
It sees the monsters, it sees the health,

00:54.420 --> 00:58.620
it sees the parameters at the bottom, it sees the corridor

00:58.620 --> 01:00.720
it sees the gun, it sees exactly the same thing

01:00.720 --> 01:03.090
as a human would see when playing this game.

01:03.090 --> 01:07.135
Then that image is passed through a convolutional layer.

01:07.135 --> 01:09.240
Then it's passed through a pulling layer.

01:09.240 --> 01:13.140
Then it's flattened and goes into a neural network.

01:13.140 --> 01:16.080
And then at at output, we've got actions, as you remember

01:16.080 --> 01:17.640
we've got those Q values.

01:17.640 --> 01:20.220
Then we apply a action selection policy to them.

01:20.220 --> 01:22.470
So for instance, we apply a soft max

01:22.470 --> 01:24.510
and we find out which action we want to take.

01:24.510 --> 01:26.100
And so there's some exploration

01:26.100 --> 01:30.300
plus exploitation going on there, a combination of the two.

01:30.300 --> 01:33.660
So that is how Deep Convolutional Q learning works.

01:33.660 --> 01:35.010
But now let's see what we're going to do with this.

01:35.010 --> 01:37.290
So for simplicity sake, just so that it's easier

01:37.290 --> 01:38.790
for us to operate with this, because we're going

01:38.790 --> 01:41.790
to adjusting this image and moving it around

01:41.790 --> 01:44.377
we're going to replace the circles with squares, with these

01:44.377 --> 01:46.148
with these rectangular boxes.

01:46.148 --> 01:47.700
And we're also going to get rid

01:47.700 --> 01:51.270
of those lines in between, just changing them to arrows.

01:51.270 --> 01:54.450
So this doesn't change the essence.

01:54.450 --> 01:56.610
This is just the representation on this chart.

01:56.610 --> 01:59.070
This is still, even this representation is still Deep

01:59.070 --> 02:01.950
Convolutional Q learning is just going to be easier

02:01.950 --> 02:05.520
for us to modify it and show exactly what A3C is.

02:05.520 --> 02:07.800
So that's just how we're going to represent things

02:07.800 --> 02:12.000
from here and what A3C does or this specific part,

02:12.000 --> 02:13.410
so we're starting, remember

02:13.410 --> 02:15.120
we're starting like step by step.

02:15.120 --> 02:16.680
We're starting with the actor-critic part.

02:16.680 --> 02:19.020
So we're going to see how we go

02:19.020 --> 02:22.860
from Deep Convolutional Q learning to A3C step by step.

02:22.860 --> 02:24.449
And first step, we're going to introduce this,

02:24.449 --> 02:26.220
this actor critic part over here.

02:26.220 --> 02:27.750
So we're going to talk about that.

02:27.750 --> 02:32.640
So the first thing that happens is this last bit

02:32.640 --> 02:33.870
the output is actually,

02:33.870 --> 02:35.190
we're just going to redraw it like this.

02:35.190 --> 02:36.480
So it's exactly the same output,

02:36.480 --> 02:40.410
exactly the same Q values or exactly the same actions.

02:40.410 --> 02:43.260
So if he, if you had eight possible actions

02:43.260 --> 02:44.610
you still have eight possible actions.

02:44.610 --> 02:45.930
We're just gonna put them

02:45.930 --> 02:47.250
at the top so they take up less space.

02:47.250 --> 02:49.230
So nothing, so far, nothing has changed.

02:49.230 --> 02:52.080
So, so far this, and this are exactly the same.

02:52.080 --> 02:55.110
But now this is where the actor-critic part comes in.

02:55.110 --> 02:56.610
We're going to have a second output.

02:56.610 --> 02:59.940
We're going to have, so the first one is a set of outputs.

02:59.940 --> 03:02.790
And here we're going to have a separate individual output.

03:02.790 --> 03:06.183
So technically we're going to be using our neural network.

03:07.146 --> 03:12.146
So once, the image and everything, like the values go

03:12.390 --> 03:14.700
through the network from left to right over here,

03:14.700 --> 03:16.590
they don't just spit out one set of values,

03:16.590 --> 03:17.880
they spit up actually two sets.

03:17.880 --> 03:20.670
And so the top set, we already know what it is,

03:20.670 --> 03:22.920
it's, it's the possible actions

03:22.920 --> 03:25.350
but here we're actually going to have another extra value.

03:25.350 --> 03:26.670
So let's have a look at that.

03:26.670 --> 03:28.620
How, what, what is that value?

03:28.620 --> 03:29.850
So here we go.

03:29.850 --> 03:31.350
That's the top.

03:31.350 --> 03:33.750
So we just kind of like reduce the size

03:33.750 --> 03:35.340
of this illustration.

03:35.340 --> 03:38.940
The top output is are the Q values

03:38.940 --> 03:41.250
as we discussed previously for the actions.

03:41.250 --> 03:43.440
So they're, same thing, everything is same.

03:43.440 --> 03:45.420
But then now this bottom part, oh

03:45.420 --> 03:47.160
and the top part is actually called the actor.

03:47.160 --> 03:47.993
We're going to give it a name.

03:47.993 --> 03:50.430
That's the actor, because that's the part

03:50.430 --> 03:52.560
where the agent chooses what it wants to do.

03:52.560 --> 03:55.680
So that it's like it's acting, it's, it's as if it's

03:55.680 --> 03:56.580
it's performing on stage.

03:56.580 --> 03:57.480
And it'll make more sense

03:57.480 --> 04:00.900
once we have the second name up on the screen as well.

04:00.900 --> 04:04.260
And then the second output is just like one value

04:04.260 --> 04:06.060
and that is V of S.

04:06.060 --> 04:07.953
So that is the value of the state.

04:09.166 --> 04:12.690
So if Q of S is the, Q of S and A,

04:12.690 --> 04:15.210
is the Q value of a certain action.

04:15.210 --> 04:17.580
And as you can see, that's why there's action one

04:17.580 --> 04:19.680
action two, action three and up to action six,

04:19.680 --> 04:21.990
so however many actions there possibly are in that state.

04:21.990 --> 04:23.460
So in a given state S,

04:23.460 --> 04:25.290
what is the Q value of taking action A?

04:25.290 --> 04:28.410
A, action two, action one, action two, and so on.

04:28.410 --> 04:30.510
Then here we're also predicting,

04:30.510 --> 04:34.560
we're also using neural network to predict what is the value

04:34.560 --> 04:36.630
of the state we're actually in.

04:36.630 --> 04:40.800
And this part is called the critic.

04:40.800 --> 04:43.860
And so that's the intuitive

04:43.860 --> 04:45.810
or the kind of the not even full intuitive,

04:45.810 --> 04:48.510
that's just like the start of the intuition

04:48.510 --> 04:51.120
behind actor-critic that there's two outputs now

04:51.120 --> 04:53.610
from the neural network, not just one.

04:53.610 --> 04:55.380
Before, we just had that one output

04:55.380 --> 04:56.430
which we now call the actor.

04:56.430 --> 04:59.220
But now we have two outputs, actor and critic.

04:59.220 --> 05:00.930
And there's gonna be a dynamic between them

05:00.930 --> 05:02.370
which we'll explore further.

05:02.370 --> 05:03.840
But for now, it's important to understand

05:03.840 --> 05:06.780
that we are predicting not just the Q values

05:06.780 --> 05:10.320
of the actions that you, that the agent can take

05:10.320 --> 05:11.153
from this certain state.

05:11.153 --> 05:13.590
But we're also predicting the value of being

05:13.590 --> 05:15.217
in this current state using that same neural network.

05:15.217 --> 05:20.217
So that's a core of the first step into the actor-critic.

05:20.760 --> 05:22.620
And now we're going to need to talk about asynchronous

05:22.620 --> 05:23.760
which we'll do in next tutorial

05:23.760 --> 05:25.380
in order to understand exactly

05:25.380 --> 05:27.000
what's going on between the actor-critic.

05:27.000 --> 05:29.550
And the final thing for today is that all

05:29.550 --> 05:32.670
of these Q values, as we know, that's also called policy.

05:32.670 --> 05:36.330
So in some literature, in some blogs and some discussions

05:36.330 --> 05:41.130
you might find in the actor-critic, you might find the

05:41.130 --> 05:44.820
author talking about Q values on this side of the actor.

05:44.820 --> 05:48.810
In some, in other literature and blog posts and discussions

05:48.810 --> 05:51.960
you will find the, the author talking about the policy.

05:51.960 --> 05:56.220
So, and usually it's used, they use a, like a Greek letter P

05:56.220 --> 05:57.720
for representing the policy,

05:57.720 --> 06:00.090
or it might just say policy of state.

06:00.090 --> 06:02.100
So altogether, these are the policy

06:02.100 --> 06:05.310
of database because as we remember, the policy is the

06:05.310 --> 06:08.010
if you put all the actions together, the possible actions

06:08.010 --> 06:11.670
and then it, it's deciding which action to take.

06:11.670 --> 06:12.630
So it's, these are gonna be

06:12.630 --> 06:14.280
like the probabilities of taking each action.

06:14.280 --> 06:15.510
So that's the policy.

06:15.510 --> 06:17.430
So don't, don't be thrown off,

06:17.430 --> 06:19.440
if you see one or the other.

06:19.440 --> 06:21.090
They basically mean the same thing.

06:21.090 --> 06:23.340
So on one hand here, you've got the policy or the Q values,

06:23.340 --> 06:25.260
on the other hand, you've got the actual value of the state

06:25.260 --> 06:27.750
and they're being predicted from the neural network.

06:27.750 --> 06:29.310
So that's the start of the actor-critic.

06:29.310 --> 06:31.320
We'll continue with this in the next tutorial

06:31.320 --> 06:33.180
when we're talking about asynchronous,

06:33.180 --> 06:34.380
and I look forward to seeing you there.

06:34.380 --> 06:36.063
Until then, enjoy AI.