WEBVTT

00:00.660 --> 00:01.800
Speaker: Hello, welcome back to the course

00:01.800 --> 00:03.540
on artificial intelligence.

00:03.540 --> 00:04.560
In today's tutorial,

00:04.560 --> 00:06.900
we're going to cover off quite a complex tutorial

00:06.900 --> 00:08.310
called Eligibility Trace

00:08.310 --> 00:10.680
or N-step Q-learning.

00:10.680 --> 00:13.020
And, this is something that we're going to implement

00:13.020 --> 00:14.610
in the practical side of things.

00:14.610 --> 00:16.140
So, that's why we need to cover it off.

00:16.140 --> 00:18.450
And, at the same time, it is quite a complex topic.

00:18.450 --> 00:21.660
So, I've got a very interesting approach

00:21.660 --> 00:24.960
to getting us up to speed with the intuition behind this.

00:24.960 --> 00:25.793
So, I have like

00:25.793 --> 00:28.290
a different approach in mind than we're used to.

00:28.290 --> 00:30.780
So, let's, let's have a look at that and see how that goes.

00:30.780 --> 00:34.050
So, I'm going to give you an example to start off with,

00:34.050 --> 00:36.330
like, I'm going to give you an example in this tutorial,

00:36.330 --> 00:40.020
and, that will demonstrate the power of eligibility trace.

00:40.020 --> 00:42.510
And, it'll give us the intuition behind things.

00:42.510 --> 00:45.090
And, then, if you like to delve further

00:45.090 --> 00:46.290
into eligibility trace,

00:46.290 --> 00:48.893
I'll give you the best place where you can read about it.

00:48.893 --> 00:51.360
I'll give you a reference to a book.

00:51.360 --> 00:52.560
But, otherwise,

00:52.560 --> 00:53.730
so, why this is gonna be different?

00:53.730 --> 00:55.080
is because we're going to first,

00:55.080 --> 00:57.420
rather than delving into the intuition,

00:57.420 --> 00:58.620
we're going to look at an example,

00:58.620 --> 01:00.270
and, the intuition will become obvious

01:00.270 --> 01:01.620
after we talk about it.

01:01.620 --> 01:03.180
And, that's my hope for this tutorial.

01:03.180 --> 01:04.013
So, let's have a look.

01:04.013 --> 01:06.000
Let's see, let's see if we can do this.

01:06.000 --> 01:07.860
So, here we've got two agents.

01:07.860 --> 01:10.380
And, they're navigating the same environment.

01:10.380 --> 01:13.680
And, we are going to see how these two agents work.

01:13.680 --> 01:16.260
The first one's gonna work without eligibility trace.

01:16.260 --> 01:18.240
Second one is going to work with eligibility trace.

01:18.240 --> 01:21.900
And, hopefully we'll see why the second one

01:21.900 --> 01:24.600
is going to be so much more powerful than the first one.

01:24.600 --> 01:26.220
So, let's have a look.

01:26.220 --> 01:28.320
We're going to look at this agent first,

01:28.320 --> 01:30.090
and, the way he operates is

01:30.090 --> 01:34.530
the exact way that we've discussed deep Q-learning, so far.

01:34.530 --> 01:37.020
So, the agent is going to take a a step,

01:37.020 --> 01:38.340
or is going to move,

01:38.340 --> 01:40.230
take an action, move into a new state.

01:40.230 --> 01:41.700
It's going to get a certain reward,

01:41.700 --> 01:44.700
it's going to put that reward through its algorithm,

01:44.700 --> 01:48.120
update the neural network that's running this agent,

01:48.120 --> 01:50.610
or that's running in the mind of this agent.

01:50.610 --> 01:53.193
So, that's basically how it's learning from the environment.

01:53.193 --> 01:54.870
That is going to take a new step,

01:54.870 --> 01:57.360
So, from this new state, its gonna take a new action

01:57.360 --> 01:59.643
based on what its neural network is telling it to do.

01:59.643 --> 02:00.840
It's going to get rewards,

02:00.840 --> 02:01.777
it's gonna update, and, so on.

02:01.777 --> 02:03.900
And, it's going to keep doing that.

02:03.900 --> 02:06.630
So, obviously this agent's going to do quite a good job.

02:06.630 --> 02:08.340
And, as we've seen previously

02:08.340 --> 02:10.170
from the previous practical tutorials,

02:10.170 --> 02:12.690
we're gonna get some quite good results here.

02:12.690 --> 02:15.450
But, now we're going to add a new feature.

02:15.450 --> 02:17.580
Now, this agent number two,

02:17.580 --> 02:18.600
this guy over here,

02:18.600 --> 02:21.540
he's going to navigate the same environment,

02:21.540 --> 02:23.910
but, he's going to use eligibility trace.

02:23.910 --> 02:25.170
And, this is what it means.

02:25.170 --> 02:28.080
What he is going to do is he's going to take N-steps.

02:28.080 --> 02:30.240
He's gonna take, in this case, five, four steps.

02:30.240 --> 02:31.920
He's gonna take four steps.

02:31.920 --> 02:34.890
And, then only after taking these steps

02:34.890 --> 02:36.990
will he get,

02:36.990 --> 02:40.650
calculate the total reward that he got from those steps,

02:40.650 --> 02:42.720
and, he will put it through his network.

02:42.720 --> 02:45.180
He will put it through his neural network

02:45.180 --> 02:47.430
that's governing the decision making process.

02:47.430 --> 02:50.730
And, then, the neural network will learn from that.

02:50.730 --> 02:52.050
So, which one, right away,

02:52.050 --> 02:54.120
like, which one do you think is more powerful?

02:54.120 --> 02:56.400
The guy that is just taking one step at a time

02:56.400 --> 02:59.070
and, kind of like, poking in the blind or in the dark,

02:59.070 --> 02:59.903
and, he's like, okay,

02:59.903 --> 03:01.470
so, I'm gonna take a step, see what happens.

03:01.470 --> 03:02.700
We're gonna take a step, see what happens.

03:02.700 --> 03:03.990
We're gonna take a step, see what happens.

03:03.990 --> 03:05.220
The guy at the top.

03:05.220 --> 03:07.200
Or, the guy that takes just,

03:07.200 --> 03:11.130
very courageously marches through four steps in a row,

03:11.130 --> 03:12.090
and, then,

03:12.090 --> 03:14.400
he decides whether those were good steps or not

03:14.400 --> 03:15.600
all together.

03:15.600 --> 03:17.220
And, why you can see here,

03:17.220 --> 03:18.990
or, why you're probably getting a sense for why

03:18.990 --> 03:21.420
the second guy is better or is more powerful

03:21.420 --> 03:25.140
is because the second guy actually knows what's at the end.

03:25.140 --> 03:26.760
The first guy, when he's,

03:26.760 --> 03:28.860
when he's assessing whether this step is good or not,

03:28.860 --> 03:31.320
he's only looking at the reward that he's getting.

03:31.320 --> 03:33.060
And, so he's only guided by the reward

03:33.060 --> 03:34.440
the environment is giving him.

03:34.440 --> 03:36.120
Same thing here, he's only guided

03:36.120 --> 03:39.233
by the reward that this environment is giving him here.

03:39.233 --> 03:43.830
So, every time, that's his only kind of compass that he has.

03:43.830 --> 03:45.603
The reward, the reward, the reward.

03:46.560 --> 03:48.480
Whereas here, the,

03:48.480 --> 03:51.300
he actually can assess, after taking all these steps,

03:51.300 --> 03:52.230
he can assess, oh, okay,

03:52.230 --> 03:53.970
so, I did get to the finish line.

03:53.970 --> 03:56.730
So, this combination of steps was good.

03:56.730 --> 03:57.870
All of them were good.

03:57.870 --> 04:00.240
Or, oh no, I ended up in the fire pit.

04:00.240 --> 04:03.810
Or, oh no, I didn't win the,

04:03.810 --> 04:05.520
my car didn't get to the finish line,

04:05.520 --> 04:07.110
or, I crossed the sand wall,

04:07.110 --> 04:08.640
or, I lost the game of doom,

04:08.640 --> 04:09.473
or, something like that,

04:09.473 --> 04:11.070
and, then, he decides or something like that,

04:11.070 --> 04:13.650
this whole combination of steps is bad.

04:13.650 --> 04:14.910
And, therefore,

04:14.910 --> 04:16.950
for these steps that are earlier on,

04:16.950 --> 04:18.150
he has more information,

04:18.150 --> 04:20.340
he has more insights.

04:20.340 --> 04:23.490
Like, in in a very intuitive approach.

04:23.490 --> 04:24.930
Again, this is a much more complex topic

04:24.930 --> 04:26.010
than we're portraying here.

04:26.010 --> 04:28.950
But, in an intuitive way, for example, if we take this step

04:28.950 --> 04:31.323
this step only has information,

04:31.323 --> 04:32.460
to update it,

04:32.460 --> 04:34.722
you only have information coming back from this reward here.

04:34.722 --> 04:36.797
And, for this step, in this case,

04:36.797 --> 04:38.610
the same exact step,

04:38.610 --> 04:39.810
it has more information.

04:39.810 --> 04:41.820
It has information coming all the way from,

04:41.820 --> 04:43.920
okay, so what was the outcome after four steps

04:43.920 --> 04:45.480
or five steps or whatever.

04:45.480 --> 04:46.950
Yeah. So, that is,

04:46.950 --> 04:48.605
that is how it works.

04:48.605 --> 04:50.100
And, why it's called eligibility trace

04:50.100 --> 04:51.960
is because during this process,

04:51.960 --> 04:55.860
not only does he look at the cumulative reward of this,

04:55.860 --> 04:57.000
of what's going on,

04:57.000 --> 04:58.170
and, then, the cumulative loss,

04:58.170 --> 04:59.736
and, then, all that is

04:59.736 --> 05:00.660
(indistinct) propagated through a network.

05:00.660 --> 05:03.810
But, actually there's a trace of eligibility.

05:03.810 --> 05:05.190
That's why it's called, eligibility trace.

05:05.190 --> 05:08.700
There's a trace that is kept in,

05:08.700 --> 05:11.280
in the algorithm which says,

05:11.280 --> 05:14.220
okay, so if we do get a,

05:14.220 --> 05:15.480
let's say we get a punishment,

05:15.480 --> 05:17.340
we get a negative reward,

05:17.340 --> 05:19.110
then which of these steps

05:19.110 --> 05:23.130
is most likely to be eligible for that punishment?

05:23.130 --> 05:27.390
So, not only do we know what, overall, this whole pattern,

05:27.390 --> 05:29.100
or, these this combination of steps is,

05:29.100 --> 05:32.640
but, we also keep a trace of eligibility.

05:32.640 --> 05:36.360
Which steps are we going to update if we get a reward?

05:36.360 --> 05:37.950
So, for instance, if it's a negative reward,

05:37.950 --> 05:40.830
we might have an eligibility trace that indicates to us

05:40.830 --> 05:44.010
that this is the step that is most responsible

05:44.010 --> 05:45.840
for what we got in the end.

05:45.840 --> 05:47.130
Or, if it's a positive reward,

05:47.130 --> 05:48.629
again, we might know,

05:48.629 --> 05:51.720
the algorithm helps us keep track.

05:51.720 --> 05:53.790
This eligibility trace algorithm

05:53.790 --> 05:56.190
helps us keep track of what's,

05:56.190 --> 05:59.281
what step or what action needs to be,

05:59.281 --> 06:02.040
is eligible to be updated

06:02.040 --> 06:03.840
based on that reward that we get?

06:03.840 --> 06:06.150
And, that's why it's called eligibility trace.

06:06.150 --> 06:07.980
And, so, that's the basic intuition

06:07.980 --> 06:08.910
behind eligibility trace.

06:08.910 --> 06:11.070
And, hopefully these two examples

06:11.070 --> 06:13.980
of these two agents make it quite obvious

06:13.980 --> 06:15.690
or quite intuitive

06:15.690 --> 06:18.390
in why eligibility trace can be so powerful.

06:18.390 --> 06:21.660
And, if, as promised, if you'd like to delve further

06:21.660 --> 06:25.410
into topic of eligibility traces or N-step Q-learning,

06:25.410 --> 06:29.220
then a wonderful amazing book is, which is,

06:29.220 --> 06:30.270
which you can find is called

06:30.270 --> 06:32.010
Reinforcement Learning: An Introduction.

06:32.010 --> 06:36.690
It's by Richard Sutton and Andrew Barto, 1998.

06:36.690 --> 06:39.600
I think they're in the process of creating a second edition

06:39.600 --> 06:40.830
or, they're creating a second edition,

06:40.830 --> 06:45.830
but, this is the most common, or the most popular,

06:45.840 --> 06:49.320
or the most referenced book on reinforcement learning.

06:49.320 --> 06:53.160
It's got a ridiculous number of citations,

06:53.160 --> 06:56.820
I think like tens of thousands, if I'm not mistaken.

06:56.820 --> 07:01.110
And, also the chapter you need for this is chapter seven.

07:01.110 --> 07:04.740
So, in order to look at eligibility traces,

07:04.740 --> 07:05.820
there's whole chapter about it.

07:05.820 --> 07:06.900
Chapter seven.

07:06.900 --> 07:08.190
You can read about it and

07:08.190 --> 07:10.260
it goes into lots of detail,

07:10.260 --> 07:13.680
forward, backward eligibility traces,

07:13.680 --> 07:14.910
and, also, how, you know,

07:14.910 --> 07:16.950
you've got temporal difference on one hand,

07:16.950 --> 07:18.360
and, on the other end of the spectrum

07:18.360 --> 07:20.160
you have Monte Carlo Methods,

07:20.160 --> 07:22.440
in between, you have eligibility traces.

07:22.440 --> 07:24.420
So, eligibility traces are your link

07:24.420 --> 07:26.010
to go from temporal differences,

07:26.010 --> 07:27.270
to Monte Carlo Methods.

07:27.270 --> 07:28.680
Very interesting read.

07:28.680 --> 07:31.620
Lots of pictures, which I really, really appreciated.

07:31.620 --> 07:34.230
Very intuitive explanations.

07:34.230 --> 07:36.870
So, there's lots of things that you can learn from this book

07:36.870 --> 07:41.250
about artificial intelligence and reinforcement learning.

07:41.250 --> 07:44.280
But, specifically eligibility traces are,

07:44.280 --> 07:46.710
like, a very good place to go to

07:46.710 --> 07:49.320
is this book for eligibility traces.

07:49.320 --> 07:51.810
And, the second reference for today

07:51.810 --> 07:54.120
is something that (indistinct) going to show you

07:54.120 --> 07:56.640
in the practical tutorials,

07:56.640 --> 07:58.800
the deep learning,

07:58.800 --> 08:02.130
or, the Google DeepMind research paper

08:02.130 --> 08:05.280
on asynchronous methods for a deep reinforcement learning.

08:05.280 --> 08:06.780
Yes, that's the paper.

08:06.780 --> 08:08.340
That's the one paper that,

08:08.340 --> 08:11.040
the A3C paper that we're going to be discussing

08:11.040 --> 08:12.270
further down in this course.

08:12.270 --> 08:14.520
We're getting closer and closer to it.

08:14.520 --> 08:15.353
And,

08:15.353 --> 08:18.390
as you can tell, we're pretty excited about this.

08:18.390 --> 08:22.073
So, this is going to be looking a little bit at about,

08:22.073 --> 08:25.590
at how they implemented eligibility traces in this paper.

08:25.590 --> 08:27.750
So, we're going to be using this more

08:27.750 --> 08:29.400
for the practical side of things.

08:29.400 --> 08:31.050
So, hopefully you enjoyed today's tutorial,

08:31.050 --> 08:32.340
and, now you're a bit more comfortable

08:32.340 --> 08:33.990
with eligibility traces.

08:33.990 --> 08:35.910
And, I can't wait to see you next time.

08:35.910 --> 08:37.803
Until then, enjoy AI.