WEBVTT

00:00.660 --> 00:03.000
Instructor: Hello, and welcome back to the course on AI.

00:03.000 --> 00:04.170
In the previous part,

00:04.170 --> 00:06.990
we talked about the key learning intuition.

00:06.990 --> 00:08.340
We started there.

00:08.340 --> 00:12.240
And in fact, we actually got all the way to this part

00:12.240 --> 00:14.490
and where we talked about learning.

00:14.490 --> 00:18.180
And now we're going to move on to the actual acting part.

00:18.180 --> 00:21.000
So there's two parts, two distinct parts

00:21.000 --> 00:22.260
that we have to remember.

00:22.260 --> 00:23.550
So, that's the learning part,

00:23.550 --> 00:26.370
but now he's done all of this, that's beautiful.

00:26.370 --> 00:27.900
Now he actually has to take an action.

00:27.900 --> 00:29.550
He has to decide what is he gonna do?

00:29.550 --> 00:31.710
Is he gonna do action one, two, three, or four.

00:31.710 --> 00:33.000
And so, how does he do that?

00:33.000 --> 00:37.140
Well, the way he does it is now given those same Q-values,

00:37.140 --> 00:38.310
so the Q-values don't change.

00:38.310 --> 00:40.620
After we have these Q-values, we've compared them,

00:40.620 --> 00:42.480
we've calculated the loss, we've propagated the error,

00:42.480 --> 00:43.313
we've updated the weights,

00:43.313 --> 00:45.780
but the Q-values don't change in that whole process.

00:45.780 --> 00:48.390
So after we've got the Q-values, they're fixed.

00:48.390 --> 00:49.350
We know what they are.

00:49.350 --> 00:51.600
So this happens, the network's updated.

00:51.600 --> 00:53.940
And now using those same Q-values that we had

00:53.940 --> 00:55.560
what we're going to do is we're going to pass them

00:55.560 --> 00:58.620
through a Soft Max Function.

00:58.620 --> 01:02.190
And again, Soft Max is described in, I think in annex two.

01:02.190 --> 01:03.720
And we'll talk a bit more

01:03.720 --> 01:06.240
about Soft Max further down in,

01:06.240 --> 01:09.420
or we'll talk about this action selection policy

01:09.420 --> 01:12.120
further down in the rest of this section.

01:12.120 --> 01:13.740
So just in the future tutorials,

01:13.740 --> 01:16.260
but for now, we're just gonna just say we we're passing it

01:16.260 --> 01:17.220
through the Soft Max Function.

01:17.220 --> 01:18.900
And basically what it does is it allows,

01:18.900 --> 01:20.160
it helps select the best one.

01:20.160 --> 01:22.200
It selects the best action possible.

01:22.200 --> 01:23.670
And there's a small caveat to that.

01:23.670 --> 01:25.846
It's not just the best one possible.

01:25.846 --> 01:26.940
We'll talk about that

01:26.940 --> 01:28.950
in the action selection policy tutorial.

01:28.950 --> 01:29.850
But for now, let's just say

01:29.850 --> 01:31.800
it selects the best action from here.

01:31.800 --> 01:36.180
It says, okay, so Q1, you know, the likelihood

01:36.180 --> 01:37.680
basically we know the Q-Values,

01:37.680 --> 01:38.820
so it predicted the Q-Value.

01:38.820 --> 01:40.770
So it can look at them and say,

01:40.770 --> 01:43.050
okay, so the highest Q-Value out of these,

01:43.050 --> 01:46.260
just as we did in the simple Q-learning algorithm,

01:46.260 --> 01:47.460
it'll just look at all these four,

01:47.460 --> 01:48.960
and say the highest Q-value is this one,

01:48.960 --> 01:50.160
and I'm gonna select that action.

01:50.160 --> 01:50.993
We're gonna take those.

01:50.993 --> 01:52.200
And that's pretty much it.

01:52.200 --> 01:53.640
That's how it chooses which action to take,

01:53.640 --> 01:54.963
takes the action

01:54.963 --> 01:57.840
and then all of this process happens again

01:57.840 --> 01:59.280
for the next state.

01:59.280 --> 02:00.360
The ad agent ends up in,

02:00.360 --> 02:02.130
in our case, in the next square of the maze,

02:02.130 --> 02:04.620
but generally speaking into the next state.

02:04.620 --> 02:05.453
So there we go.

02:05.453 --> 02:07.590
That's how we feed

02:07.590 --> 02:12.590
in a reinforcement learning problem into a neural network

02:12.660 --> 02:16.170
through a vector describing the state that we're in.

02:16.170 --> 02:17.520
And once we feed it in,

02:17.520 --> 02:20.580
there's two parts of the process that happen.

02:20.580 --> 02:22.410
Part one is the learning.

02:22.410 --> 02:25.170
So, remember that part where we compare each of the Q-values

02:25.170 --> 02:27.390
with the targets and then we back propagate the loss

02:27.390 --> 02:29.040
through the network to update the weights

02:29.040 --> 02:31.620
so that our network is learning

02:31.620 --> 02:33.570
as we go through this maze

02:33.570 --> 02:35.190
or through this environment.

02:35.190 --> 02:37.740
And also the second part is, of course, we have to act,

02:37.740 --> 02:39.360
we have to select an action,

02:39.360 --> 02:42.720
and that is where we pass the Q-values

02:42.720 --> 02:44.400
through the Soft Max Function,

02:44.400 --> 02:46.620
and or basically an action selection policy,

02:46.620 --> 02:48.450
which we'll talk about further down.

02:48.450 --> 02:51.750
And then we simply select the action that we want to take

02:51.750 --> 02:52.860
and we perform that action.

02:52.860 --> 02:54.750
And then this whole process starts again.

02:54.750 --> 02:56.310
And then maybe the agent gets to the end,

02:56.310 --> 02:59.310
maybe the agent doesn't pass the game,

02:59.310 --> 03:01.260
in any case, the game ends.

03:01.260 --> 03:04.830
And then once again, the whole process repeats.

03:04.830 --> 03:06.690
The agent plays the whole game again

03:06.690 --> 03:08.280
and then that stops.

03:08.280 --> 03:12.210
So basically, that's another epoch every time the agent,

03:12.210 --> 03:14.310
you know, every time the game ends,

03:14.310 --> 03:15.510
whether favor beyond favor,

03:15.510 --> 03:16.680
that's the end of an epoch.

03:16.680 --> 03:17.730
And then he starts again,

03:17.730 --> 03:18.570
and then he starts again,

03:18.570 --> 03:20.430
and then he starts again, and so on.

03:20.430 --> 03:23.160
So, that happens and this process happens

03:23.160 --> 03:24.750
for every single time

03:24.750 --> 03:26.550
the agent is in a new in a new state.

03:26.550 --> 03:28.380
So the state is encoded here.

03:28.380 --> 03:29.400
So that's important.

03:29.400 --> 03:31.350
So not just for every single game that he plays,

03:31.350 --> 03:33.030
but for every single state.

03:33.030 --> 03:35.100
So he's in a state, it goes through his process,

03:35.100 --> 03:36.480
it updates, and so on,

03:36.480 --> 03:38.130
and happens every single time.

03:38.130 --> 03:39.270
And so the learning happens,

03:39.270 --> 03:41.730
and then the acting happens as well.

03:41.730 --> 03:45.000
So that is Deep Q-Learning in,

03:45.000 --> 03:47.040
oh the intuition behind Deep Q-Learning.

03:47.040 --> 03:49.740
We've got lots more to cover off.

03:49.740 --> 03:51.510
And then of course we've got practical.

03:51.510 --> 03:52.350
And in the meantime,

03:52.350 --> 03:55.710
if you'd like to get some additional information

03:55.710 --> 03:59.580
on Deep Q-Learning, we've got a recommended reading.

03:59.580 --> 04:01.860
So, we've already spoken about

04:01.860 --> 04:05.160
Arthur Giuliani's series of blog posts.

04:05.160 --> 04:08.250
If you look at simple enforcement learning

04:08.250 --> 04:10.470
of potential flow part four,

04:10.470 --> 04:12.630
you will find the part that's relevant

04:12.630 --> 04:14.280
to what we discussed today.

04:14.280 --> 04:18.330
Note that here, he talks about convolutions.

04:18.330 --> 04:20.970
We are not covering revolutions in this section.

04:20.970 --> 04:21.960
We're gonna be talking about them

04:21.960 --> 04:23.730
in the next section of the course.

04:23.730 --> 04:25.710
So, the difference here is that,

04:25.710 --> 04:28.140
so just kind of skip the convolutions part for now

04:28.140 --> 04:30.660
and we'll talk about them in the next part of the course.

04:30.660 --> 04:33.930
But the difference is in convolutions you're like looking,

04:33.930 --> 04:36.570
your agent is looking at the image

04:36.570 --> 04:38.880
and therefore he has to process an image.

04:38.880 --> 04:40.350
So, an additional complication.

04:40.350 --> 04:43.560
For now, we're slowly gradually building up to that.

04:43.560 --> 04:47.580
For now, we are encoding our environment through.

04:47.580 --> 04:49.680
So if you look here, we're encoding our environment

04:49.680 --> 04:51.060
or maybe like look at this one,

04:51.060 --> 04:54.660
probably encoding our environment as

04:54.660 --> 04:58.710
a or encoding a state the agent is in as a vector.

04:58.710 --> 05:01.440
So in our case, it was a very simple of two values.

05:01.440 --> 05:04.050
But sometimes people even in that simple maze,

05:04.050 --> 05:06.270
sometimes, or as you'll see from this blog post,

05:06.270 --> 05:08.370
sometimes people prefer the one hot

05:08.370 --> 05:10.200
encoded version of that state.

05:10.200 --> 05:13.590
So basically, where every single box of the maze has a,

05:13.590 --> 05:15.150
so you have like a vector,

05:15.150 --> 05:16.620
for in our case it would be 12 values,

05:16.620 --> 05:17.730
three by four.

05:17.730 --> 05:19.560
So, it's like either one or a zero,

05:19.560 --> 05:20.776
depending on which elements,

05:20.776 --> 05:23.040
which box you're in in the environment.

05:23.040 --> 05:27.390
So whichever way you decide to encode your environment

05:27.390 --> 05:29.430
and the state of your environment,

05:29.430 --> 05:30.480
that's how we're encoding it.

05:30.480 --> 05:31.500
So, it's basically a vector.

05:31.500 --> 05:33.450
The key here is that it's not a convolution,

05:33.450 --> 05:35.160
so it's not like an image

05:35.160 --> 05:36.420
and there's no convolution involved.

05:36.420 --> 05:37.830
So this part will come later.

05:37.830 --> 05:39.570
For us, it starts over here.

05:39.570 --> 05:41.340
And that just simplifies the process

05:41.340 --> 05:43.530
for us to gradually understand better.

05:43.530 --> 05:45.840
And of course, don't forget that this blog post

05:45.840 --> 05:46.800
is really an intensive flow

05:46.800 --> 05:50.100
and we are using Pie Torch in our tutorials.

05:50.100 --> 05:52.890
So, hopefully you enjoy this quick intro

05:52.890 --> 05:57.180
into a Deep Convolutional, not convolutional,

05:57.180 --> 05:59.310
yet Deep Q-Learning.

05:59.310 --> 06:02.910
And on that note, I look forward to seeing you next time.

06:02.910 --> 06:05.583
And until then, enjoy artificial intelligence.