WEBVTT

00:00.930 --> 00:01.763
-: Hello and welcome back

00:01.763 --> 00:04.140
to the course on Artificial Intelligence.

00:04.140 --> 00:07.050
Alright, so I hope you're enjoying the tutorial so far.

00:07.050 --> 00:08.550
We're nearly done with the intuition.

00:08.550 --> 00:10.560
You'll soon, very soon get to the practical side of things.

00:10.560 --> 00:12.570
We've just got a few little things that we need

00:12.570 --> 00:13.500
to cover off.

00:13.500 --> 00:15.330
Alright, so previously we talked

00:15.330 --> 00:20.010
about how we add neural networks into this whole equation

00:20.010 --> 00:22.230
of Q-Learning and take Q-Learning

00:22.230 --> 00:25.680
to the next step and turn it into deep Q-Learning.

00:25.680 --> 00:28.020
And today we're going to add

00:28.020 --> 00:32.250
an extra important feature which you will be coding

00:32.250 --> 00:33.420
in the practical side of things.

00:33.420 --> 00:35.820
So Hadlan and I decided

00:35.820 --> 00:38.070
that it's important for us to cover it off

00:38.070 --> 00:40.440
in the intuition side of things so that you're more prepared

00:40.440 --> 00:42.420
for it when it comes in the coding side of things.

00:42.420 --> 00:46.470
So as we discussed, we've got the network there.

00:46.470 --> 00:47.970
There's two parts that happen.

00:47.970 --> 00:49.140
First of all, it's the learning.

00:49.140 --> 00:52.920
So the network actually learns with every new state.

00:52.920 --> 00:55.680
It slowly updates its weights to get better

00:55.680 --> 00:58.830
and better and better at dealing with this environment.

00:58.830 --> 01:01.980
And then there is the acting inside the state.

01:01.980 --> 01:06.600
So after the Q-values have been counted in the state

01:06.600 --> 01:08.220
then one Q-value selected.

01:08.220 --> 01:11.820
So today we're still going to talk about the learning part.

01:11.820 --> 01:13.350
We're going to come up

01:13.350 --> 01:16.620
with a interesting feature that's going to, well,

01:16.620 --> 01:19.140
we're not going to come up with this feature ourselves,

01:19.140 --> 01:23.910
but we'll talk about a feature that is very important

01:23.910 --> 01:26.040
for deep Q-Learning

01:26.040 --> 01:29.700
and that feature is called Experience Replay.

01:29.700 --> 01:31.860
Alright, so here is our network.

01:31.860 --> 01:34.590
So we've just copied it over here.

01:34.590 --> 01:36.870
We've got that loss that is calculated

01:36.870 --> 01:39.090
at the bottom is back propagated through our network.

01:39.090 --> 01:41.190
And let's have a look at an example

01:41.190 --> 01:44.640
of what happens to understand the problem that we're dealing

01:44.640 --> 01:45.660
with a bit better.

01:45.660 --> 01:49.140
So here is an example actually from this course.

01:49.140 --> 01:53.100
This is a screenshot exactly from this course.

01:53.100 --> 01:54.780
This is what you'll be programming.

01:54.780 --> 01:56.700
This is a self-driving car,

01:56.700 --> 02:00.750
that is driving through this, along this road.

02:00.750 --> 02:03.780
And it has to learn how to navigate this road.

02:03.780 --> 02:06.750
And so what is, as we discussed previously,

02:06.750 --> 02:09.300
what is this state?

02:09.300 --> 02:12.269
And of course the state is not just gonna be x1 and x2.

02:12.269 --> 02:14.160
Hadlan will describe a bit

02:14.160 --> 02:16.620
in a lot more detail what the state is.

02:16.620 --> 02:20.370
It is gonna be a couple of parameters which relate

02:20.370 --> 02:24.960
to the angle of the car and some relative parameters,

02:24.960 --> 02:26.490
what the sensors are reading and so on.

02:26.490 --> 02:28.770
So there's gonna be more parameters than that

02:28.770 --> 02:29.850
to describe the state,

02:29.850 --> 02:31.620
but nevertheless, there's going to be a vector of values.

02:31.620 --> 02:33.780
It's gonna go through a neural network.

02:33.780 --> 02:36.480
And then on the output, you're gonna have some Q-values.

02:36.480 --> 02:38.310
Again, there'll be a different,

02:38.310 --> 02:40.085
depending on the environment,

02:40.085 --> 02:43.260
there can be a different number of actions,

02:43.260 --> 02:44.430
possible actions.

02:44.430 --> 02:46.200
But we're just going to, for simplicity's sake,

02:46.200 --> 02:48.090
leave it at four just for us to be able

02:48.090 --> 02:50.790
to understand a bit better what's going on here.

02:50.790 --> 02:52.173
So in this case,

02:53.520 --> 02:54.480
the question is so far,

02:54.480 --> 02:58.560
what is this input into this neural network,

02:58.560 --> 03:00.390
or more specifically,

03:00.390 --> 03:03.510
how often do we trigger this neural network?

03:03.510 --> 03:05.130
How often does this neural network go through?

03:05.130 --> 03:07.350
Well, every time the car ends up in a new state,

03:07.350 --> 03:10.830
so the car makes a move, it ends up in a new state,

03:10.830 --> 03:13.500
and then everything goes, all that data,

03:13.500 --> 03:15.330
all that information from about the state goes

03:15.330 --> 03:16.290
through the network,

03:16.290 --> 03:18.264
Q-values are calculated,

03:18.264 --> 03:20.021
this error is calculated,

03:20.021 --> 03:22.589
based on what we discussed in previous tutorials.

03:22.589 --> 03:25.320
This error is back propagated through a network,

03:25.320 --> 03:26.153
weights are updated,

03:26.153 --> 03:28.260
then the car selects which action wants to take,

03:28.260 --> 03:31.387
makes that move, ends up in a new state,

03:31.387 --> 03:34.440
in the new state, everything starts over again.

03:34.440 --> 03:37.310
And so basically this happens every time the car is

03:37.310 --> 03:38.430
in a new state.

03:38.430 --> 03:39.900
Well have a look at this example.

03:39.900 --> 03:42.630
I specifically took this screenshot because it looks,

03:42.630 --> 03:46.980
it's very well illustrates the problem that is addressed

03:46.980 --> 03:48.480
through Experience Replay.

03:48.480 --> 03:50.700
And Experience Replay is not just something that we use

03:50.700 --> 03:52.770
in this course or in this specific problem.

03:52.770 --> 03:56.645
It is something that you will see used throughout,

03:56.645 --> 03:59.490
like on and on and over and over again,

03:59.490 --> 04:02.850
in artificial intelligence algorithms,

04:02.850 --> 04:05.130
because it is so powerful and it's so important.

04:05.130 --> 04:06.360
So look at this car,

04:06.360 --> 04:09.600
this car in this problem or in this environment,

04:09.600 --> 04:12.420
its goal is to go from here to here and back.

04:12.420 --> 04:14.310
Its goal is to navigate its way here to here

04:14.310 --> 04:17.790
without crossing these walls, which are made of sand.

04:17.790 --> 04:19.650
And so the car start over here,

04:19.650 --> 04:23.790
it went down and like its reward is based on, you know,

04:23.790 --> 04:25.140
how close it is to start.

04:25.140 --> 04:27.570
So the car went from here, it went down,

04:27.570 --> 04:29.610
and kept going like this, like this, like this, like this.

04:29.610 --> 04:31.560
Along this wall, along this wall.

04:31.560 --> 04:33.480
And what is going to do next is going to turn,

04:33.480 --> 04:34.920
it's going to keep going.

04:34.920 --> 04:37.710
Well, what you want it to do is keep going here.

04:37.710 --> 04:39.570
But let's think about it for a second.

04:39.570 --> 04:41.430
Once it got to this wall,

04:41.430 --> 04:43.440
every single time it moves forward,

04:43.440 --> 04:45.030
it moves forward, it moves forward,

04:45.030 --> 04:46.830
it moves forward, it moves forward, it moves forward

04:46.830 --> 04:48.570
it moves forward and so on, it moves forward.

04:48.570 --> 04:50.100
So there might be like,

04:50.100 --> 04:51.630
depending on the structure of environment,

04:51.630 --> 04:54.663
it could be like 100 moves here, or 50 moves here,

04:54.663 --> 04:56.100
that just keeps moving forward,

04:56.100 --> 04:57.750
forward, forward, forward, forward, forward.

04:57.750 --> 05:00.390
And nothing changes, nothing really changes.

05:00.390 --> 05:02.400
Yes, it gets further away from this target,

05:02.400 --> 05:04.230
closer to this target, that's lovely,

05:04.230 --> 05:06.750
but in terms of the surrounding environment,

05:06.750 --> 05:08.430
not many things are changing.

05:08.430 --> 05:10.050
It's still that same wall.

05:10.050 --> 05:11.460
If you are sitting in the car,

05:11.460 --> 05:12.750
you've probably seen this situation,

05:12.750 --> 05:16.050
when you're driving, whatever you're seeing,

05:16.050 --> 05:18.660
is like the environment is so monotonous,

05:18.660 --> 05:20.940
that you're just seeing kind of the same thing

05:20.940 --> 05:21.870
as just passing by.

05:21.870 --> 05:24.660
But like imagine you're driving through a desert

05:24.660 --> 05:26.190
and you're just seeing the same thing,

05:26.190 --> 05:27.750
it's the same sand, it's the same sand,

05:27.750 --> 05:30.510
nothing is happening, nothing is changing.

05:30.510 --> 05:34.530
And so every single time we're putting that state,

05:34.530 --> 05:36.960
that new state, into here.

05:36.960 --> 05:38.970
Yes, of course, something might be changing.

05:38.970 --> 05:40.110
For instance, you're driving the car,

05:40.110 --> 05:43.530
and your GPS is showing you're closer to your destination.

05:43.530 --> 05:45.990
So one of these inputs is changing,

05:45.990 --> 05:47.520
but a lot of these other inputs,

05:47.520 --> 05:50.640
the sensors for instance which are on the car,

05:50.640 --> 05:51.930
they're not changing.

05:51.930 --> 05:53.310
And therefore as you're driving,

05:53.310 --> 05:55.860
so in this state you input the inputs

05:55.860 --> 05:57.060
into your neural network here,

05:57.060 --> 06:00.900
here, here, here, here, here, here, here and here,

06:00.900 --> 06:03.240
all the time, the inputs are pretty much the same.

06:03.240 --> 06:07.470
And so if you keep inputting the same inputs,

06:07.470 --> 06:08.820
the same values, the same vector,

06:08.820 --> 06:12.180
or very similar vectors into your network,

06:12.180 --> 06:14.280
because there is no variety,

06:14.280 --> 06:17.400
the car will learn very well one thing.

06:17.400 --> 06:19.950
It'll learn very well how to drive along this wall,

06:19.950 --> 06:21.720
which is on its right.

06:21.720 --> 06:23.970
And so that's how the network will update,

06:23.970 --> 06:24.960
and it will get rewarded,

06:24.960 --> 06:27.960
will slowly start getting rewarded for driving so well.

06:27.960 --> 06:28.890
It'll be like, oh, okay.

06:28.890 --> 06:30.907
So from here it'll be start learning,

06:30.907 --> 06:33.060
"Oh, I'm doing so good, I'm doing even better.

06:33.060 --> 06:33.990
I'm doing it better."

06:33.990 --> 06:38.220
It will have this false perception

06:38.220 --> 06:40.350
that it's actually doing very well,

06:40.350 --> 06:43.530
even though it only learned how to drive along this wall.

06:43.530 --> 06:46.470
And so the neural network will become very adapted

06:46.470 --> 06:47.550
to driving along this wall.

06:47.550 --> 06:49.290
And then all of a sudden there's this curve

06:49.290 --> 06:51.300
and the car doesn't know what to do.

06:51.300 --> 06:55.410
And it completely doesn't fit in with this neural network.

06:55.410 --> 06:58.860
And even if it does adjust, somehow,

06:58.860 --> 07:00.900
let's hypothetically say it passes this part,

07:00.900 --> 07:02.250
and then it ends up on this wall,

07:02.250 --> 07:03.210
same thing is gonna happen,

07:03.210 --> 07:05.310
it's gonna drive from here, here, here.

07:05.310 --> 07:07.920
Okay. Now the neural network is restructuring itself

07:07.920 --> 07:10.890
to adapt to this wall, and then bam, this thing happens.

07:10.890 --> 07:13.320
And then even if somehow it gets past that,

07:13.320 --> 07:14.670
it'll drive past this thing.

07:14.670 --> 07:16.290
And then same thing along these lines.

07:16.290 --> 07:20.250
So basically it's like a very vivid example

07:20.250 --> 07:22.800
of the problem that we have is that,

07:22.800 --> 07:25.500
because the way we're using the neural network,

07:25.500 --> 07:27.870
updating it with every single state,

07:27.870 --> 07:29.820
once we have lots of consecutive states,

07:29.820 --> 07:30.900
they don't even have to be the same,

07:30.900 --> 07:34.530
but there is, in environments it's normal

07:34.530 --> 07:39.240
that consecutive states are somehow correlated,

07:39.240 --> 07:40.980
or somehow interdependent.

07:40.980 --> 07:45.570
And we don't want that interdependency to bias our network.

07:45.570 --> 07:49.224
We don't want the car to just learn how to drive along

07:49.224 --> 07:52.290
like a straight line or along curved line,

07:52.290 --> 07:57.290
or like anything that you can think of in life

07:59.070 --> 08:01.800
where an agent would be navigating environment

08:01.800 --> 08:03.900
wherever you can think of correlated

08:03.900 --> 08:08.010
or interdependent states that come after another,

08:08.010 --> 08:12.150
that can really mess up your neural network,

08:12.150 --> 08:15.390
if you're just going to let the agent learn from that.

08:15.390 --> 08:18.120
And that's where Experience Replay comes in.

08:18.120 --> 08:20.730
What happens in Experience Replay is,

08:20.730 --> 08:23.940
these experiences, so these states that it's in

08:23.940 --> 08:26.850
one, two, three, however many 50 states here in a row,

08:26.850 --> 08:31.470
they don't get put through the network right away.

08:31.470 --> 08:35.612
They're actually saved into memory of the agent.

08:35.612 --> 08:37.860
And so for instance, it saves all these,

08:37.860 --> 08:38.790
and saves all these,

08:38.790 --> 08:40.050
and at some point,

08:40.050 --> 08:42.180
once it reaches a certain threshold,

08:42.180 --> 08:43.350
which you'll be able to code

08:43.350 --> 08:45.120
and Hadlan will show how to do that,

08:45.120 --> 08:47.250
once it reaches a certain threshold,

08:47.250 --> 08:50.287
then the agent decides for itself,

08:50.287 --> 08:51.300
"Okay, it's time to learn."

08:51.300 --> 08:54.780
I have this batch of experiences that I have

08:54.780 --> 08:56.580
and now I'm going to learn from that batch.

08:56.580 --> 09:00.450
And so it randomly selects a uniformly distributed

09:00.450 --> 09:02.910
and uniformly is key is important here

09:02.910 --> 09:05.913
because that's something we'll talk about on the next slide.

09:06.768 --> 09:08.160
We'll mention that,

09:08.160 --> 09:12.510
but it takes a uniformly distributed sample,

09:12.510 --> 09:15.660
so basically all experiences are considered to be equal.

09:15.660 --> 09:17.670
It takes a uniformly distributed sample

09:17.670 --> 09:20.580
from that batch of experiences that it has

09:20.580 --> 09:24.750
and then it goes through them and it learns from them.

09:24.750 --> 09:26.630
So it doesn't take all the experiences

09:26.630 --> 09:28.350
it just takes a uniformly distributed sample.

09:28.350 --> 09:29.580
So it might take a couple from here,

09:29.580 --> 09:31.554
couple from here, couple from here.

09:31.554 --> 09:34.950
And each experience is characterized by the state

09:34.950 --> 09:37.500
it was in, the action that it took,

09:37.500 --> 09:40.099
the state it ended up in, and the reward

09:40.099 --> 09:44.820
it achieved through that action in that specific state.

09:44.820 --> 09:46.770
So four elements in each experience,

09:46.770 --> 09:50.160
state one, actions, state two and reward.

09:50.160 --> 09:52.050
And so it takes all those experiences

09:52.050 --> 09:54.660
and then it passes them through the network and it learns,

09:54.660 --> 09:59.660
and that way it breaks the pattern of that bias

10:00.180 --> 10:04.110
which comes from the sequential nature

10:04.110 --> 10:06.210
of the experiences if you were to put them

10:06.210 --> 10:08.340
through the network one after the other.

10:08.340 --> 10:11.910
So that's the main focus of Experience Replay.

10:11.910 --> 10:14.370
That's the problem it addresses.

10:14.370 --> 10:16.650
And another benefit of experience replay is

10:16.650 --> 10:19.650
that sometimes in an environment like this,

10:19.650 --> 10:22.410
you might have very valuable rare experiences.

10:22.410 --> 10:24.960
So for instance, I don't know, let's say,

10:24.960 --> 10:26.370
let's look at this corner, right?

10:26.370 --> 10:28.740
This is a right corner, right?

10:28.740 --> 10:29.760
And a very sharp one.

10:29.760 --> 10:30.870
How many sharp?

10:30.870 --> 10:32.580
So it'll be coming from here,

10:32.580 --> 10:35.640
assuming it's going to be hugging this corner.

10:35.640 --> 10:38.220
So how many sharp right corners do we have

10:38.220 --> 10:39.330
in this whole environment?

10:39.330 --> 10:40.950
We only have one right corner here,

10:40.950 --> 10:42.273
and one right corner here,

10:43.650 --> 10:45.060
Right? So when it's coming this way,

10:45.060 --> 10:46.380
that's the right corner.

10:46.380 --> 10:47.400
And then when it's going back,

10:47.400 --> 10:48.630
it's a sharp right corner here.

10:48.630 --> 10:50.190
So, and this one's not sharp, this one is sharp.

10:50.190 --> 10:51.750
So there's only one opportunity

10:51.750 --> 10:56.750
in the whole environment to learn from a sharp right corner.

10:56.940 --> 10:59.970
And that's an important experience

10:59.970 --> 11:01.650
because it might get really good

11:01.650 --> 11:03.090
at driving along straight lines,

11:03.090 --> 11:05.190
get really good at doing like soft corners

11:05.190 --> 11:06.480
like that, like that,

11:06.480 --> 11:08.910
but then it'll keep messing up

11:08.910 --> 11:10.410
this sharp right corner simply

11:10.410 --> 11:14.970
because it doesn't have that much opportunity

11:14.970 --> 11:15.803
to learn from it.

11:15.803 --> 11:17.940
And so therefore it'll learn everything else very quickly

11:17.940 --> 11:18.990
but it'll take a long time

11:18.990 --> 11:20.220
to learn this right corner.

11:20.220 --> 11:21.900
It's a very simplified example,

11:21.900 --> 11:24.180
it's a very simplified explanation,

11:24.180 --> 11:25.860
but it illustrates the concept

11:25.860 --> 11:28.350
that sometimes there are rare experiences

11:28.350 --> 11:30.240
which can be valuable.

11:30.240 --> 11:32.670
And if you are just doing a simple neural network

11:32.670 --> 11:35.130
where you're putting in your values here,

11:35.130 --> 11:37.890
and you know they're going through, and you know,

11:37.890 --> 11:39.900
like even if we forget about that problem

11:39.900 --> 11:41.670
of the sequential nature of experiences

11:41.670 --> 11:44.438
and how they can be interdependent or correlated,

11:44.438 --> 11:46.770
even if we forget about that for a second,

11:46.770 --> 11:49.170
what happens is once you put an experience in,

11:49.170 --> 11:51.030
it goes through, network's updated,

11:51.030 --> 11:53.340
then you instantly forget about that experience

11:53.340 --> 11:54.420
and you move on to the next one.

11:54.420 --> 11:56.220
That's just how the neural network works.

11:56.220 --> 11:57.840
Then you move on to the next state, the next state,

11:57.840 --> 11:59.070
the next state, the next experience, the next experience,

11:59.070 --> 12:01.170
the next experience and so on.

12:01.170 --> 12:02.430
So this right corner,

12:02.430 --> 12:04.230
as soon as it goes through the network, it's gone.

12:04.230 --> 12:07.560
And you don't have any memory of that valuable experience.

12:07.560 --> 12:09.030
Whereas with Experienced Replay,

12:09.030 --> 12:12.063
because you're putting these experiences into batches,

12:13.140 --> 12:15.660
you can organize your batch as a rolling window.

12:15.660 --> 12:18.270
So for instance, you could have like 100 batches,

12:18.270 --> 12:20.370
so 100 experiences in your batch,

12:20.370 --> 12:22.120
so when it's coming back from here,

12:23.160 --> 12:27.360
as soon as it has recorded this experience in its batch,

12:27.360 --> 12:32.280
then like at some point it takes a uniform distribution

12:32.280 --> 12:33.900
from its batch of experiences,

12:33.900 --> 12:35.070
and then there's a rolling window.

12:35.070 --> 12:36.420
So it forgets these experiences,

12:36.420 --> 12:37.980
but then it keeps these experiences,

12:37.980 --> 12:40.440
and then again it learns from, once it's here,

12:40.440 --> 12:42.570
it learns from this batch.

12:42.570 --> 12:45.420
And then once it's here, it forgets all the way up to here.

12:45.420 --> 12:48.030
But then it has a batch of experiences like that.

12:48.030 --> 12:50.700
So therefore now it learns from these experiences.

12:50.700 --> 12:54.180
And that way what you're getting is that

12:54.180 --> 12:58.140
this right hand corner might come up several times

12:58.140 --> 13:00.840
in its learning process because it was in that batch

13:00.840 --> 13:02.430
when the batch was like this,

13:02.430 --> 13:04.440
around there, then it was in the batch here,

13:04.440 --> 13:05.273
in the batch here.

13:05.273 --> 13:07.200
So it came up in several batches,

13:07.200 --> 13:08.790
because a batch might be updated

13:08.790 --> 13:11.520
as a rolling window of experiences.

13:11.520 --> 13:13.047
So the older experiences get kicked out,

13:13.047 --> 13:14.610
the newer experiences are added,

13:14.610 --> 13:16.410
and then again, older experience kicked out,

13:16.410 --> 13:17.610
so an experience,

13:17.610 --> 13:19.590
it stays in the batch for quite some time

13:19.590 --> 13:22.860
and the car or agent can learn

13:22.860 --> 13:24.210
from that experience several times.

13:24.210 --> 13:27.540
So that's another advantage of Experience Replay.

13:27.540 --> 13:29.400
And of course the final advantage is

13:29.400 --> 13:32.280
Experience Replay gives you an opportunity to learn

13:32.280 --> 13:35.190
from more experiences than if you were just learning

13:35.190 --> 13:36.360
for one at a time,

13:36.360 --> 13:37.800
because you have that batch

13:37.800 --> 13:40.230
and therefore, and it's a rolling window,

13:40.230 --> 13:43.200
and therefore, even if your environment is limited

13:43.200 --> 13:47.220
to experience, your Experience Replay approach

13:47.220 --> 13:49.380
can help you learn faster.

13:49.380 --> 13:51.990
And instead of just redoing their environment,

13:51.990 --> 13:54.090
many many times, you can learn faster

13:54.090 --> 13:55.710
because you you don't have to redo it.

13:55.710 --> 13:57.810
You have those experiences saved.

13:57.810 --> 13:59.910
So those are the main advantages of Experience Replay.

13:59.910 --> 14:01.542
Let's recap on then we've got the,

14:01.542 --> 14:03.540
we're breaking the pattern of independence

14:03.540 --> 14:07.514
and correlation of sequential experiences.

14:07.514 --> 14:10.680
We save rare experiences, which might be important,

14:10.680 --> 14:13.140
and therefore we can learn from them more often.

14:13.140 --> 14:16.890
And we can learn in environments,

14:16.890 --> 14:21.360
we can learn foster environments which are experience,

14:21.360 --> 14:24.030
which have a shortage of experiences

14:24.030 --> 14:26.370
or which don't have that many experiences

14:26.370 --> 14:27.300
that the agent goes through.

14:27.300 --> 14:29.480
And still we can be able to learn that.

14:29.480 --> 14:32.017
So that is what Experience Replay is all about.

14:32.017 --> 14:34.115
If you'd like to read a bit more,

14:34.115 --> 14:36.660
there's an interesting article published

14:36.660 --> 14:39.000
by Deep Mind in 2016.

14:39.000 --> 14:41.550
It's called "Prioritized Experience Replay".

14:41.550 --> 14:42.880
And it talks about

14:43.836 --> 14:46.710
why are we using a uniform distribution

14:46.710 --> 14:50.520
to select our experiences from the experienced batch,

14:50.520 --> 14:53.460
why don't we find a better way to select our experiences

14:53.460 --> 14:55.350
and prioritize some of the experiences

14:55.350 --> 14:57.030
which we feel that are important.

14:57.030 --> 14:58.170
And so it's quite an interesting thing.

14:58.170 --> 15:03.170
So in this case, you will be able to not only reinforce,

15:03.270 --> 15:08.270
not only reinforce your knowledge on Experience Replay,

15:08.340 --> 15:10.290
but you'll actually be able to move

15:10.290 --> 15:12.660
with the cutting edge of technology.

15:12.660 --> 15:15.060
So this is 2016 and published by Deep Mind.

15:15.060 --> 15:17.550
So it's a very recent, very powerful paper.

15:17.550 --> 15:20.640
So you'll be able to actually explore the limits,

15:20.640 --> 15:23.130
or explore even further this algorithm,

15:23.130 --> 15:24.540
and take it to the next level.

15:24.540 --> 15:26.010
So I'll leave it up to you to find out

15:26.010 --> 15:30.150
why and how we can change the uniform distribution

15:30.150 --> 15:32.070
to a different approach to Experience Replay

15:32.070 --> 15:33.930
from this paper if you'd like.

15:33.930 --> 15:35.670
And I hope you enjoy today's tutorial,

15:35.670 --> 15:37.710
and now we know what Experience Replay is

15:37.710 --> 15:41.430
and we can confidently use it in our practical tutorials.

15:41.430 --> 15:42.900
And I look forward to seeing you next time.

15:42.900 --> 15:44.733
Until then, enjoy AI.
