WEBVTT

00:00.630 --> 00:01.463
-: Hello and welcome back

00:01.463 --> 00:03.900
to the course on artificial intelligence.

00:03.900 --> 00:07.020
And finally, we are onto the fun stuff.

00:07.020 --> 00:08.966
We're onto deep key learning.

00:08.966 --> 00:10.710
All right, so let's have a look.

00:10.710 --> 00:13.680
Previously we spoke about key learning and what it's all

00:13.680 --> 00:16.560
about and we learned about the agent, the environment

00:16.560 --> 00:20.793
and how the agent will look at the state he or she is in

00:20.793 --> 00:24.720
take an action, get a reward, enter into a new state.

00:24.720 --> 00:27.090
And based on that feedback loop

00:27.090 --> 00:28.380
they will continue taking actions

00:28.380 --> 00:29.460
and they'll learn from that.

00:29.460 --> 00:32.250
Understand what are the better actions to take.

00:32.250 --> 00:34.715
And so we looked at this basic example of a maze.

00:34.715 --> 00:37.920
We understood that as the agent explorers environment

00:37.920 --> 00:40.530
and understands what the values of the states are

00:40.530 --> 00:43.350
then we moved on from dealing with the values of the states

00:43.350 --> 00:46.407
to dealing with the values of the actions or the Q values.

00:46.407 --> 00:49.164
And then based on that, we understood how plans

00:49.164 --> 00:51.930
in a non stochastic environments work

00:51.930 --> 00:55.260
and how policies work in stochastic environments,

00:55.260 --> 00:57.090
and this is an example of policy.

00:57.090 --> 00:58.740
So that's a quick recap

00:58.740 --> 01:01.020
of everything we discussed in the basic Q learning.

01:01.020 --> 01:02.580
And now let's have a look

01:02.580 --> 01:05.460
at how this can be taken to the next level

01:05.460 --> 01:08.250
through deep learning, through adding deep learning.

01:08.250 --> 01:10.650
Okay, so this is our environment,

01:10.650 --> 01:14.460
and what we're going to do now is we are going to add,

01:14.460 --> 01:18.210
instead of just doing basic calculations

01:18.210 --> 01:21.870
in this matrix that we have, which is pretty, pretty simple

01:21.870 --> 01:24.300
what we're going to do is we're going to add two axes.

01:24.300 --> 01:27.750
We're going to add an x and y axis, or we'll call them X one

01:27.750 --> 01:30.480
and X two, just to make things even more general.

01:30.480 --> 01:34.430
And here we've got to, we'll number the columns 1, 2, 3, 4

01:34.430 --> 01:36.780
here we'll number the rows 1, 2, 3.

01:36.780 --> 01:41.010
And so now every single state can be described

01:41.010 --> 01:43.800
by a pair of two values, X one and X two.

01:43.800 --> 01:45.420
So any one of these squares

01:45.420 --> 01:47.891
in which the agent can possibly be in,

01:47.891 --> 01:50.970
can be described by X one, X two.

01:50.970 --> 01:54.420
So for instance, right now he's in the square

01:54.420 --> 01:58.470
with X one equal to one and x two equal to two.

01:58.470 --> 02:00.450
And therefore that's in the same

02:00.450 --> 02:02.100
that same way we can skip any square

02:02.100 --> 02:03.540
meaning we can describe any state.

02:03.540 --> 02:06.000
And of course this is a very simplified version

02:06.000 --> 02:08.940
of an environment of describing states, but nevertheless

02:08.940 --> 02:10.290
it works in this case.

02:10.290 --> 02:14.190
And that means that now we can feed

02:14.190 --> 02:17.400
these states into a neural network.

02:17.400 --> 02:19.380
And by the way, here, I would just like to mention

02:19.380 --> 02:21.720
that at the end of the course we've got annexes

02:21.720 --> 02:24.240
we've got annex number one and annex number two.

02:24.240 --> 02:26.345
In order to proceed successfully with this section

02:26.345 --> 02:29.010
it's highly advisable that you check out annex number one,

02:29.010 --> 02:31.740
which is on artificial neural networks

02:31.740 --> 02:32.790
so you understand how they work

02:32.790 --> 02:35.697
so that we can, we don't have to delve into that here

02:35.697 --> 02:37.470
and we can just use the benefits

02:37.470 --> 02:40.800
of the knowledge of how artificial neural networks work.

02:40.800 --> 02:43.200
And so we feed in this information

02:43.200 --> 02:45.268
on the state into a neural network

02:45.268 --> 02:49.770
and then it will process this information.

02:49.770 --> 02:51.060
So X one and X two

02:51.060 --> 02:52.650
depending on the structure of the neural network

02:52.650 --> 02:55.410
it might have multiple hidden layers and so on.

02:55.410 --> 02:57.360
So that's something that you'll figure out

02:57.360 --> 02:58.920
in the practical tutorials.

02:58.920 --> 03:00.930
-: But at the end, we will structure

03:00.930 --> 03:03.630
in such a way that it spits out four values.

03:03.630 --> 03:06.630
And these four values are actually going to be RQ values.

03:06.630 --> 03:09.900
So the values which dictate which action we need to take.

03:09.900 --> 03:11.790
And further down in this tutorial, we'll see exactly

03:11.790 --> 03:14.637
how these Q values are used to decide which action is taken.

03:14.637 --> 03:19.020
But the main point here is that we no longer look

03:19.020 --> 03:22.318
at just this maze from a Q learning perspective.

03:22.318 --> 03:24.660
We're now taking the states

03:24.660 --> 03:26.550
of the maze and we're feeding them into

03:26.550 --> 03:30.808
a deep neural network in order to get these Q values.

03:30.808 --> 03:32.310
And, and at the end of the day

03:32.310 --> 03:33.930
we are still going to come up with an action.

03:33.930 --> 03:35.460
We're still going to understand how

03:35.460 --> 03:37.140
what action we need to take

03:37.140 --> 03:39.000
and we'll discuss all this in more detail.

03:39.000 --> 03:40.380
But the question right now is why.

03:40.380 --> 03:42.093
Why are we doing all of this?

03:43.200 --> 03:44.310
Why are we making things so

03:44.310 --> 03:46.650
much more complicated when that initial approach

03:46.650 --> 03:48.214
of Q learning was working already?

03:48.214 --> 03:52.110
Well, the reason for that is the Q learning was working

03:52.110 --> 03:54.210
in this very simplistic environment

03:54.210 --> 03:55.410
and we're continuing to deal

03:55.410 --> 03:57.004
for now with this very simplistic environment

03:57.004 --> 04:00.030
in order to better understand the concepts.

04:00.030 --> 04:02.760
But at the same time, that simple Q learning

04:02.760 --> 04:05.626
will no longer work in more complex environments.

04:05.626 --> 04:08.401
And we're talking about, for instance

04:08.401 --> 04:10.680
the self-driving cars which you'll be creating,

04:10.680 --> 04:13.920
or playing doom when the other artificial intelligence

04:13.920 --> 04:16.170
is playing doom or other Atari games

04:16.170 --> 04:19.260
like breakout or even self-driving cars

04:19.260 --> 04:23.490
and more advanced reinforcement learning things such

04:23.490 --> 04:26.167
as like robots walking around and performing actions.

04:26.167 --> 04:29.640
In all those cases, basic Q learning is insufficient,

04:29.640 --> 04:32.100
is not strong, is not powerful enough to

04:32.100 --> 04:34.710
be able to master those challenges.

04:34.710 --> 04:38.100
And just like pro we've seen in the deep learning course

04:38.100 --> 04:39.630
if you've been in our deep course,

04:39.630 --> 04:42.240
or if you've done the annex sections

04:42.240 --> 04:44.550
annex number one and X two, you will already know

04:44.550 --> 04:48.510
that deep learning is by far superior to any type

04:48.510 --> 04:51.660
of machine learning, let alone based simple Q learning.

04:51.660 --> 04:53.250
And that's why we're leveraging the power

04:53.250 --> 04:54.210
of deep learning here.

04:54.210 --> 04:55.800
So we're feeding in the information

04:55.800 --> 04:58.530
about the environment as a vector of values.

04:58.530 --> 05:01.350
So in this case just to values into a deep neural network.

05:01.350 --> 05:04.290
And then we're using that to perform the actions

05:04.290 --> 05:05.490
that we want to decide

05:05.490 --> 05:07.410
which actions the agent's going to take.

05:07.410 --> 05:08.243
So that's kind of

05:08.243 --> 05:11.850
like a high level overview of why we're doing this.

05:11.850 --> 05:13.080
And now let's have a look at

05:13.080 --> 05:15.270
in a bit more detail at what happens

05:15.270 --> 05:17.949
to the concepts of key learning when we

05:17.949 --> 05:21.480
make this transformation from or transition from

05:21.480 --> 05:24.120
simple key learning into deep key learning.

05:24.120 --> 05:27.123
So as you saw in the previous intuition tutorials

05:27.123 --> 05:29.130
we had a slide like this

05:29.130 --> 05:33.660
which is the foundation of temporal difference learning.

05:33.660 --> 05:35.790
This is the formula for temporal difference.

05:35.790 --> 05:37.500
And basically, so let's go through this.

05:37.500 --> 05:40.500
So basically we had an agent who was

05:40.500 --> 05:43.260
in this state over here, which is

05:43.260 --> 05:45.060
indicated by the blue arrow.

05:45.060 --> 05:48.660
And we were understanding how temporal difference works

05:48.660 --> 05:51.750
for this Q value of for instance, going up.

05:51.750 --> 05:54.330
And so what we saw here was before

05:54.330 --> 05:55.620
this is in the simple key learning

05:55.620 --> 05:57.630
not the deep key learning this in the simple key learning.

05:57.630 --> 05:58.463
What we saw was

05:58.463 --> 06:03.463
before the agent had a certain Q value that he had learned

06:03.602 --> 06:06.017
about this action of going up.

06:06.017 --> 06:08.967
And so then he decides to take this action to go up.

06:08.967 --> 06:10.830
And right after he takes this action

06:10.830 --> 06:14.820
he gets a reward for taking this action in this state.

06:14.820 --> 06:16.500
And that is that reward.

06:16.500 --> 06:18.780
Plus, now he can evaluate the value

06:18.780 --> 06:21.690
of the current state he's in, which is the maximum

06:21.690 --> 06:23.910
of all of the new Q values

06:23.910 --> 06:25.860
of all of the q valves of the new actions

06:25.860 --> 06:29.190
He can take a prime in the new state S prime.

06:29.190 --> 06:32.460
and we multiplied by the DK factor of gamma.

06:32.460 --> 06:35.488
So that is essentially the Q

06:35.488 --> 06:38.457
the new Q value or kind of like the

06:38.457 --> 06:42.030
the empirical Q value that he has just received

06:42.030 --> 06:43.230
for taking that action.

06:43.230 --> 06:45.630
And ideally these two two should be the same.

06:45.630 --> 06:49.230
So the Q value that he had in his memory

06:49.230 --> 06:51.480
about this action in this state should equate

06:51.480 --> 06:54.750
to the actual reward plus the gamma

06:54.750 --> 06:57.600
times the value of the state that he ended up in.

06:57.600 --> 06:59.070
And therefore that's how we calculated the

06:59.070 --> 06:59.903
temporal difference.

06:59.903 --> 07:03.750
We take what he got after minus what he got, what he had in

07:03.750 --> 07:06.000
mind, what he was expecting, you'd subtract one

07:06.000 --> 07:07.710
from the other and that's your temporal difference.

07:07.710 --> 07:11.130
And then you use your learning rate alpha to

07:11.130 --> 07:14.250
adjust your Q value your new Q value

07:14.250 --> 07:17.100
by the temporal difference, but with a coefficient of alpha.

07:17.100 --> 07:20.490
So that is the essence of the simple key learning.

07:20.490 --> 07:22.800
Now let's have a look at how it changes

07:22.800 --> 07:24.540
in deep key learning.

07:24.540 --> 07:26.040
And so we're still gonna work with the slide,

07:26.040 --> 07:29.580
but we are going to just see exactly what's happening.

07:29.580 --> 07:34.050
So in deep Q learning, the neural network will predict

07:34.050 --> 07:35.520
for valves as we saw in the previous slide,

07:35.520 --> 07:36.992
and as we'll see further down in this tutorial,

07:36.992 --> 07:40.350
the neural network will predict four values,

07:40.350 --> 07:41.760
or it might predict more values

07:41.760 --> 07:44.790
if there's more possible actions in a given state.

07:44.790 --> 07:46.230
But in this case we know

07:46.230 --> 07:48.660
that there's only four actions: up, right, left, down.

07:48.660 --> 07:52.802
And so the neural network will predict four of these values.

07:52.802 --> 07:56.790
So there will be no in a deep Q learning situation.

07:56.790 --> 07:58.980
It's important to understand there's no before or after,

07:58.980 --> 08:01.710
and this is how we'll get to know this a bit better.

08:01.710 --> 08:05.130
So the neural network will predict four of these values,

08:05.130 --> 08:09.030
and it will compare not to what will happen after,

08:09.030 --> 08:11.820
but the neural network will compare to this exact value,

08:11.820 --> 08:15.360
but it was this value which was calculated

08:15.360 --> 08:17.760
in the previous step.

08:17.760 --> 08:20.490
So in the previous time when the agent was

08:20.490 --> 08:25.110
in this exact square, so let's say, I don't know,

08:25.110 --> 08:27.540
some time ago the agent was again,

08:27.540 --> 08:29.730
was in this exact square as well,

08:29.730 --> 08:34.410
and it calculated this value previously.

08:34.410 --> 08:36.930
So in the previous time, long time ago

08:36.930 --> 08:38.700
the agent calculated this value

08:38.700 --> 08:42.060
then the agents stored this value for the future

08:42.060 --> 08:43.710
and now the future has come.

08:43.710 --> 08:45.570
So now he's in the square again

08:45.570 --> 08:48.090
and now he's got these Q values, which is predicted

08:48.090 --> 08:50.341
and one of them is for going up.

08:50.341 --> 08:51.590
So now what he's gonna do

08:51.590 --> 08:56.310
he is gonna compare the predicted value of Q to this value

08:56.310 --> 08:58.792
which he had recorded from the previous step.

08:58.792 --> 09:01.020
And we'll understand exactly why

09:01.020 --> 09:01.890
this is important right now.

09:01.890 --> 09:04.410
So just important to understand here is there's no before

09:04.410 --> 09:06.780
and offshore in this specific square

09:06.780 --> 09:09.960
this specific time, we're taking the Q value

09:09.960 --> 09:13.680
that he's predicted using the neural network this time.

09:13.680 --> 09:17.430
And we're comparing it to this value which he had

09:17.430 --> 09:20.430
from the previous time, from the previous time he was

09:20.430 --> 09:23.010
in this square assessing all of the situation.

09:23.010 --> 09:25.074
And you know, like the previous time

09:25.074 --> 09:27.882
he actually performed this action.

09:27.882 --> 09:29.310
So there we go.

09:29.310 --> 09:31.980
Now let's have a look at how this all works

09:31.980 --> 09:35.160
out in the neural network and why is it like that?

09:35.160 --> 09:36.720
I know it sounds a bit complicated right now,

09:36.720 --> 09:39.360
but we'll break it down into simple terms, right?

09:39.360 --> 09:40.193
Just in a second.

09:40.193 --> 09:42.480
So this on neural network, we're feeding in the states

09:42.480 --> 09:44.580
of the environment into the neural network is going

09:44.580 --> 09:46.500
through the hidden layers, then it's coming out

09:46.500 --> 09:50.790
with these outputs, q1, q2, q3, q4 in that specific state.

09:50.790 --> 09:52.200
These are the Q values

09:52.200 --> 09:55.200
that's the neural network is predicting

09:55.200 --> 09:57.390
for the possible actions.

09:57.390 --> 09:58.410
Those are the Q values.

09:58.410 --> 10:00.420
So then we're comparing to target,

10:00.420 --> 10:02.280
and these targets is exactly

10:02.280 --> 10:04.710
so if we go back here, this is the target.

10:04.710 --> 10:07.290
So this is the value that was predicted.

10:07.290 --> 10:09.690
And then, but also we know that we have a target

10:09.690 --> 10:11.790
from the last time we were in the square.

10:11.790 --> 10:15.150
We have a target for this same action,

10:15.150 --> 10:16.650
which is up for instance.

10:16.650 --> 10:18.840
So here we've got a target and we're going to compare

10:18.840 --> 10:20.718
So we're comparing Q1 versus that target.

10:20.718 --> 10:23.040
We're comparing Q2 versus that target

10:23.040 --> 10:24.856
the target that we had from previously,

10:24.856 --> 10:28.380
Q3 versus the target, Q4 versus the target.

10:28.380 --> 10:32.280
And so this is the part where the neural network

10:32.280 --> 10:35.100
or the agent is now learning

10:35.100 --> 10:38.670
through deep learning how to better go through space.

10:38.670 --> 10:40.530
And the key point here is

10:40.530 --> 10:42.540
that we are still applying key learning

10:42.540 --> 10:43.448
but the concepts in simple key learning

10:43.448 --> 10:47.040
you learn through temporal differences,

10:47.040 --> 10:48.360
which are pretty straightforward,

10:48.360 --> 10:50.940
which we've already discussed and we know quite well by now.

10:50.940 --> 10:53.010
But at the same time in deep learning

10:53.010 --> 10:54.600
how do neural networks learn?

10:54.600 --> 10:56.940
Well, neural networks learn through adjusting their weight.

10:56.940 --> 11:01.940
So we have to adapt the concepts of simple key learning

11:04.050 --> 11:08.360
to the way neural networks actually work.

11:08.360 --> 11:10.677
And that is through updating their weight.

11:10.677 --> 11:12.570
And so this is what we're trying to figure out here

11:12.570 --> 11:13.980
how do we adapt that concept

11:13.980 --> 11:16.710
of temporal difference to a neural network

11:16.710 --> 11:20.877
so that we can leverage the full power of neural networks?

11:20.877 --> 11:22.230
And so far we've gotten this.

11:22.230 --> 11:26.332
So we enter our environment state here as a vector goes

11:26.332 --> 11:29.550
through a neural network, we get predictions of Q values

11:29.550 --> 11:33.240
and then from the previous time the agent was in that state

11:33.240 --> 11:36.840
we have these Q target, Q target 1, 2, 3 and 4

11:36.840 --> 11:39.450
for each of these respective actions.

11:39.450 --> 11:41.070
And so now we are up to, okay

11:41.070 --> 11:42.926
let's compare each one with each one.

11:42.926 --> 11:47.250
And from here it becomes pretty straightforward

11:47.250 --> 11:50.490
if you are up to speed with neural networks.

11:50.490 --> 11:52.980
Once again, that's all in a annex number one,

11:52.980 --> 11:56.550
we're going to calculate a loss which is L here,

11:56.550 --> 12:01.137
and we going to be Q target this one minus Q minus this one,

12:01.137 --> 12:03.000
we're going to square that.

12:03.000 --> 12:03.986
So the square difference

12:03.986 --> 12:06.810
of each one of these, and we're gonna sum them.

12:06.810 --> 12:09.570
So we're gonna take the sum of the squared differences

12:09.570 --> 12:11.670
of these Q values in their targets

12:11.670 --> 12:14.010
and we're gonna sum them up and that's gonna be our loss.

12:14.010 --> 12:16.620
And so ideally, just as we had

12:16.620 --> 12:17.910
in the temporal difference learning,

12:17.910 --> 12:18.905
so if we go back for a second,

12:18.905 --> 12:23.010
remember we said ideally we want this to be equal to this

12:23.010 --> 12:25.170
so we want the temporal difference to be zero.

12:25.170 --> 12:27.480
So that's, that means basically the agent is

12:27.480 --> 12:30.960
predicting correctly what's, you know

12:30.960 --> 12:34.680
the Q values that the agent is predicting are exactly,

12:34.680 --> 12:37.140
or that he has a memory, are exactly descriptive

12:37.140 --> 12:41.314
of the environment and therefore the agent can navigate

12:41.314 --> 12:43.020
the environment pretty well, right?

12:43.020 --> 12:44.814
There's no surprises, there's no,

12:44.814 --> 12:48.900
if as long as the temporal difference is highly positive

12:48.900 --> 12:51.360
or highly negative, then then we've got some surprises.

12:51.360 --> 12:52.950
But if the temporal difference is zero

12:52.950 --> 12:54.540
then he knows the environment so well

12:54.540 --> 12:56.282
that he can predict what's going on and he can,

12:56.282 --> 12:59.730
and therefore his policy is going to be very good

12:59.730 --> 13:01.350
and he's going to be able to navigate it.

13:01.350 --> 13:02.700
So here, same thing.

13:02.700 --> 13:05.010
So we want this loss to be as close

13:05.010 --> 13:07.680
to zero as possible, as small as possible.

13:07.680 --> 13:09.390
And that's why now we're going to

13:09.390 --> 13:11.580
this is the part where we are going to

13:11.580 --> 13:15.720
leverage the real true power of neural network.

13:15.720 --> 13:17.100
So we're gonna take this loss

13:17.100 --> 13:19.140
and we're going to use back propagation

13:19.140 --> 13:24.090
or stochastic gradient descent to take this loss and pass it

13:24.090 --> 13:27.060
through the network, pass it back or back propagate it

13:27.060 --> 13:28.912
through network and through stochastic gradient descent

13:28.912 --> 13:33.510
update the weights of these synapses in the network

13:33.510 --> 13:36.900
so that next time we go through this network

13:36.900 --> 13:39.300
the weights already a bit better descriptive

13:39.300 --> 13:40.133
of the environment.

13:40.133 --> 13:41.100
And that's exactly how it works.

13:41.100 --> 13:44.670
So here you have, if we go back, this is calculated,

13:44.670 --> 13:47.610
loss is calculated and it gets product proof propagated

13:47.610 --> 13:49.290
for the network, the weights are updated

13:49.290 --> 13:51.966
then the next time we get here, this happens again

13:51.966 --> 13:54.433
and we get here, this happens again and so on.

13:54.433 --> 13:57.562
And so, and it keeps happening and that's how

13:57.562 --> 14:02.130
this agent learns or basically now it's the neural network

14:02.130 --> 14:05.100
which is the brain of the agent is learning,

14:05.100 --> 14:06.646
is becoming more

14:06.646 --> 14:09.900
and more descriptive of the environment and therefore

14:09.900 --> 14:12.150
the agent is able to navigate the environment better.

14:12.150 --> 14:14.220
When we say descriptive environment basically means

14:14.220 --> 14:16.665
that when we put in the states

14:16.665 --> 14:19.045
of the environment that the agent is in

14:19.045 --> 14:21.335
we are more likely to get closer

14:21.335 --> 14:24.554
and closer to the actual Q values.

14:24.554 --> 14:27.567
And that happens because the Q values that we want

14:27.567 --> 14:29.280
and find the right action.

14:29.280 --> 14:31.500
And that happens because these Q targets

14:31.500 --> 14:33.630
are actually empirically derived.

14:33.630 --> 14:36.870
So how does he find these Q targets?

14:36.870 --> 14:38.490
That's actually this.

14:38.490 --> 14:40.027
So he actually observes,

14:40.027 --> 14:41.700
"okay, so once I do take this step,

14:41.700 --> 14:43.020
what's the reward I get?"

14:43.020 --> 14:45.060
And then " what's the value this of this state?"

14:45.060 --> 14:46.650
So same thing as we saw previously

14:46.650 --> 14:48.840
in key learning in the simple key learning intuition.

14:48.840 --> 14:50.700
So he learns this through trial

14:50.700 --> 14:53.122
and error and then he constructs his network

14:53.122 --> 14:54.870
or updates the way his network

14:54.870 --> 14:58.328
in such a way that the predicted Q values are close

14:58.328 --> 15:01.830
and closer approximating the target Q values.

15:01.830 --> 15:05.160
So very similar to the concept we discussed here

15:05.160 --> 15:07.410
in the simple temporal difference learning

15:07.410 --> 15:09.572
of the simple Q learning algorithm.

15:09.572 --> 15:10.440
So there we go.

15:10.440 --> 15:12.540
That's how the agent learns.

15:12.540 --> 15:14.280
So we're up to here.

15:14.280 --> 15:15.530
That's the learning part.
