WEBVTT

00:00.630 --> 00:01.463
-: Hello and welcome back

00:01.463 --> 00:03.990
to the course on artificial intelligence.

00:03.990 --> 00:05.940
In today's tutorial, we're going to have some fun.

00:05.940 --> 00:07.470
We're going to have a look

00:07.470 --> 00:11.460
and artificial intelligence actually going through that maze

00:11.460 --> 00:13.680
that we've been talking about so long.

00:13.680 --> 00:17.130
And is going to use key learning to navigate its way,

00:17.130 --> 00:18.450
and find the way out,

00:18.450 --> 00:21.570
and we'll see what happens to the Q values

00:21.570 --> 00:24.360
what's gonna happen to the policy, and so on.

00:24.360 --> 00:26.310
So, let's have a look.

00:26.310 --> 00:29.760
We're going to be using some materials kindly provided

00:29.760 --> 00:31.890
by the Berkeley University.

00:31.890 --> 00:36.890
So if you go to ai.berkeley b e r k e l e y.edu

00:38.190 --> 00:40.770
if you just go to that link, click enter,

00:40.770 --> 00:42.450
you'll see this website.

00:42.450 --> 00:46.470
And here, what we're going to be looking at is,

00:46.470 --> 00:48.270
you need to go to, where do we need to go to?

00:48.270 --> 00:49.743
PacMan projects, I think.

00:50.610 --> 00:52.200
Yeah, PacMan projects.

00:52.200 --> 00:55.650
And here, if you scroll down and you look

00:55.650 --> 00:59.160
at Reinforcement Learning, this is what we're working with.

00:59.160 --> 01:01.710
So here you can download the ZIP Archive.

01:01.710 --> 01:04.320
So, that's if you want to, so don't,

01:04.320 --> 01:06.360
you don't have to, this is, we're not going to go

01:06.360 --> 01:08.160
through the solution together in this tutorial.

01:08.160 --> 01:09.780
Just letting you know where this is all from

01:09.780 --> 01:13.290
because we are very like, we really appreciate that

01:13.290 --> 01:16.200
UC Berkeley has made these materials available.

01:16.200 --> 01:19.320
But, if you do wish to experiment with this on your own

01:19.320 --> 01:21.330
just bear in mind this is not part, it's not gonna be part

01:21.330 --> 01:23.280
of our course, this is part of the Berkeley course.

01:23.280 --> 01:24.990
I'm just gonna show you how it works

01:24.990 --> 01:26.160
for illustration purposes.

01:26.160 --> 01:27.660
But if you do wanna experiment with this

01:27.660 --> 01:28.830
you can find it here,

01:28.830 --> 01:31.380
the ZIP archive and all the instructions as well.

01:31.380 --> 01:35.070
And we are just going to go into Python right away.

01:35.070 --> 01:37.500
And first thing I wanted to show you is

01:37.500 --> 01:41.130
that here we've got the licensing information.

01:41.130 --> 01:43.590
So, this is what I mean we're very lucky

01:43.590 --> 01:46.530
that they said we are free to use or extend these projects

01:46.530 --> 01:48.510
for educational purposes provided, you know

01:48.510 --> 01:51.180
distribute published solutions, which we're not going to do.

01:51.180 --> 01:53.190
You retain this notice, which we have

01:53.190 --> 01:56.010
and you provide a clear attribution to UC Berkeley

01:56.010 --> 01:57.870
including a link to, which we also have.

01:57.870 --> 02:00.120
So once again, if you'd like to learn more

02:00.120 --> 02:02.490
there's a link you can have a look and thank you very much

02:02.490 --> 02:05.460
to all of these people who have worked on this project.

02:05.460 --> 02:08.280
So, here's the grid world that we're gonna be working with.

02:08.280 --> 02:09.450
There's a solution there.

02:09.450 --> 02:11.310
You would have to, in order to make it work

02:11.310 --> 02:12.540
you'd have to either sold it yourself

02:12.540 --> 02:15.090
or possibly find a solution.

02:15.090 --> 02:16.740
Maybe some of your, some people,

02:16.740 --> 02:19.140
somebody you know might help you out of that.

02:19.140 --> 02:20.970
If again, you want to, you don't have to

02:20.970 --> 02:22.140
because we are just going to look

02:22.140 --> 02:25.140
at it on this screen right now.

02:25.140 --> 02:27.870
So, after we've created all those files,

02:27.870 --> 02:29.700
we can just launch it over here.

02:29.700 --> 02:32.730
So, there are some parameters that are involved

02:32.730 --> 02:34.860
in this whole world.

02:34.860 --> 02:37.410
And now I'm going to just show you how,

02:37.410 --> 02:39.090
what it looks like if we launch it.

02:39.090 --> 02:41.580
So, let's try launch it in manual mode.

02:41.580 --> 02:45.125
So, if I go minus M one of these parameters here manual,

02:45.125 --> 02:47.100
so I can manually control the agent.

02:47.100 --> 02:50.190
So here you can see how grid, so I can go up, up,

02:50.190 --> 02:52.290
so you can see that it's taking action,

02:52.290 --> 02:54.000
starting and started in state.

02:54.000 --> 02:56.280
So where I was, and then you saw

02:56.280 --> 02:58.380
that I pressed up, took action north

02:58.380 --> 03:01.500
and first time I ended up in 01, so I did go up.

03:01.500 --> 03:03.450
But second time I took action north,

03:03.450 --> 03:05.010
and I ended in the same, so I didn't move.

03:05.010 --> 03:07.260
So something happened, you know, the randomness happened

03:07.260 --> 03:08.760
I either went left or right,

03:08.760 --> 03:10.890
and by default the parameters are set.

03:10.890 --> 03:12.780
You can see here by default they're set

03:12.780 --> 03:14.460
to exactly what we discussed,

03:14.460 --> 03:16.950
that how often action results

03:16.950 --> 03:19.450
in unintended direction 20% of the time,

03:19.450 --> 03:21.210
10% to the left, 10% to the right.

03:21.210 --> 03:22.080
So if I go up

03:22.080 --> 03:25.140
you see I went up, I go right, I went right, right,

03:25.140 --> 03:29.790
no, didn't happen right again and right, and I'm finished.

03:29.790 --> 03:31.080
But in this implementation,

03:31.080 --> 03:34.290
you have to click again to get out of this final output.

03:34.290 --> 03:37.170
So out of the exit, just click again and you're finished.

03:37.170 --> 03:38.640
That's a terminal state.

03:38.640 --> 03:40.710
So, we can run again, manual.

03:40.710 --> 03:45.710
You can see that if I go right, right, right, left, up.

03:45.720 --> 03:47.520
So here what we saw previously

03:47.520 --> 03:50.070
that the agent wouldn't go straight up, right?

03:50.070 --> 03:51.510
What's the point of going up

03:51.510 --> 03:53.310
if there's a chance of going into the pit.

03:53.310 --> 03:54.600
So let's see, what would the agent would do?

03:54.600 --> 03:56.160
It would go left, you'd go west here.

03:56.160 --> 03:59.220
So it'd go west and you see I clicked left, but it went up.

03:59.220 --> 04:00.840
And here I would click right,

04:00.840 --> 04:02.850
and I end up in the final exit sync.

04:02.850 --> 04:05.370
And you can see it got reward equal to one.

04:05.370 --> 04:07.170
So that's what it looks like manually.

04:07.170 --> 04:09.720
Now, let's actually hook up an AI

04:09.720 --> 04:12.540
to this and let it go through.

04:12.540 --> 04:16.830
So let's do an H here and let's add some parameters.

04:16.830 --> 04:19.050
So, let me just see what I typed here.

04:19.050 --> 04:23.173
So hopefully you can see Python, gridworld.pui

04:23.173 --> 04:27.960
then here, minus R means that's the reward for living.

04:27.960 --> 04:30.300
So I've got two of them,

04:30.300 --> 04:32.160
so I probably should remove this one.

04:32.160 --> 04:35.040
So minus K is how many iterations?

04:35.040 --> 04:36.690
That's way too many iterations.

04:36.690 --> 04:39.930
Let's do less, Let's do like 10 iterations,

04:39.930 --> 04:41.160
that should be enough.

04:41.160 --> 04:42.720
Minus A is agent.

04:42.720 --> 04:43.867
What type of agent do I want,

04:43.867 --> 04:47.070
do I want some random agent, some value agent, or a Q?

04:47.070 --> 04:51.450
Q, so by, I want a Q learning agent doing this.

04:51.450 --> 04:54.420
Minus S is, what is S?

04:54.420 --> 04:55.253
Speed.

04:55.253 --> 04:59.040
So that's way too fast, lets just use default speed for now.

04:59.040 --> 05:04.040
Minus R is the living penalty, so by default is zero.

05:04.800 --> 05:06.510
So remember at the very start we started

05:06.510 --> 05:07.710
with zero living penalty.

05:07.710 --> 05:10.250
So let's call it also zero here,

05:10.250 --> 05:12.060
we can just remove this parameter.

05:12.060 --> 05:14.910
And D is, what is D?

05:14.910 --> 05:16.020
Discount.

05:16.020 --> 05:18.420
So our discount factor, so let's keep it at 0.9.

05:18.420 --> 05:20.460
So very similar to what we were starting

05:20.460 --> 05:21.690
off in this section of the course,

05:21.690 --> 05:23.340
so let's run that.

05:23.340 --> 05:26.640
Okay, way too fast again, I think.

05:26.640 --> 05:27.690
Oh, actually it's pretty okay.

05:27.690 --> 05:29.390
So you can see how he's exploring.

05:30.570 --> 05:33.450
And so, so far he's hit the negative three times,

05:33.450 --> 05:35.400
and you can see how the Q values have been updated

05:35.400 --> 05:36.660
in these squares.

05:36.660 --> 05:39.270
So these are Q values, they started with zero.

05:39.270 --> 05:40.740
You can see now the Q values,

05:40.740 --> 05:42.390
so he's learned that again,

05:42.390 --> 05:43.800
this one is a bit different implemented

05:43.800 --> 05:45.150
because once you get to the final stage,

05:45.150 --> 05:46.650
you have to get out of it.

05:46.650 --> 05:48.990
You have to just click one more button to exit,

05:48.990 --> 05:51.780
and so it's very close to one, but not exactly one.

05:51.780 --> 05:54.067
But at the same time, you can see that here

05:54.067 --> 05:58.716
you know, the value is slowly kind of crystallizing in 0.8.

05:58.716 --> 06:00.180
It's slowly getting somewhere,

06:00.180 --> 06:01.770
but the rest so far they're kind of zeros

06:01.770 --> 06:03.330
because he doesn't have enough information

06:03.330 --> 06:05.460
to understand what's going on.

06:05.460 --> 06:08.703
Okay, so let's see what happens here.

06:10.140 --> 06:12.420
Exploring, exploring, exploring.

06:12.420 --> 06:13.770
What's gonna happen?

06:13.770 --> 06:15.600
Whoa, it's been a while.

06:15.600 --> 06:18.090
And don't forget this, some randomness involved here.

06:18.090 --> 06:21.060
So there's, he hit that good one a few times now.

06:21.060 --> 06:24.153
He only gets 10 iterations, so he's gotta learn fast.

06:25.290 --> 06:29.310
Okay, need you there, let's see what's going on.

06:29.310 --> 06:31.803
Come on, get out of that maze already.

06:32.850 --> 06:35.730
And yes, 10 episodes.

06:35.730 --> 06:38.910
So, average returns, that doesn't, that's,

06:38.910 --> 06:40.440
we're not really interested in that.

06:40.440 --> 06:42.690
So here, let's see, I've never seen them before.

06:42.690 --> 06:43.830
I click right, there we go,

06:43.830 --> 06:48.000
so you can see this is the policy that he came up with.

06:48.000 --> 06:50.700
Even through just 10 episodes, he's already got a policy.

06:50.700 --> 06:52.650
Okay, I'm gonna go "buh, buh, buh, bum",

06:52.650 --> 06:53.970
and here I'm gonna go down,

06:53.970 --> 06:55.380
here I'm gonna go down,

06:55.380 --> 06:56.790
here I'm gonna go into the wall,

06:56.790 --> 06:58.560
and then I'm gonna bounce over here.

06:58.560 --> 06:59.970
That's pretty cool.

06:59.970 --> 07:02.640
Okay, so now let's increase the speed.

07:02.640 --> 07:04.200
What was the parameter S there?

07:04.200 --> 07:08.400
And let's like double, well, let's quadruple the speed,

07:08.400 --> 07:11.340
and let's increase the number of iterations.

07:11.340 --> 07:13.527
So, let's say 20 iterations this time.

07:13.527 --> 07:16.770
And let's see if he can get through a bit more now.

07:16.770 --> 07:18.720
So you can see he's going a bit faster,

07:19.590 --> 07:22.800
and he's learning, he's learning that it's not really,

07:22.800 --> 07:24.480
you know, out of this state that's,

07:24.480 --> 07:26.310
there's not many good actions

07:26.310 --> 07:28.260
or you know these actions that right

07:28.260 --> 07:30.270
and straight are not that good.

07:30.270 --> 07:32.400
Definitely this one's definitely not good.

07:32.400 --> 07:33.480
He still needs to learn that.

07:33.480 --> 07:34.437
So from here it's also not good.

07:34.437 --> 07:36.810
You can see that this action's pretty good.

07:36.810 --> 07:38.520
All right, what did he get?

07:38.520 --> 07:40.710
Okay, so interesting policy here.

07:40.710 --> 07:43.260
He decided to go up, just not enough information.

07:43.260 --> 07:45.573
So let's redo that,

07:46.830 --> 07:50.490
and let's increase the speed to like a hundred.

07:50.490 --> 07:52.530
So it's super fast and the number of iterations,

07:52.530 --> 07:54.960
we'll give him a hundred iterations this time.

07:54.960 --> 07:55.920
Let's run that.

07:55.920 --> 07:58.140
See he's like crazy fast.

07:58.140 --> 07:58.973
And you can see that

07:58.973 --> 08:01.200
because there's so many more iterations,

08:01.200 --> 08:04.470
he's got more information, more opportunity to experiment

08:04.470 --> 08:07.440
and to actually build out this matrix,

08:07.440 --> 08:09.090
or not matrix, these Q values

08:09.090 --> 08:10.860
for every single state he now knows.

08:10.860 --> 08:13.230
Okay, you can see that 0.89,

08:13.230 --> 08:16.080
what did we say in our example is like 0.86?

08:16.080 --> 08:18.420
Another thing to remember this that the value

08:18.420 --> 08:21.990
of any given state, remember that the formula we had,

08:21.990 --> 08:24.240
is the maximum of the Q values.

08:24.240 --> 08:27.150
Remember that thing that we came up with, shortcut formula.

08:27.150 --> 08:29.520
So what is, what would the value in this state be?

08:29.520 --> 08:30.900
The V of this state?

08:30.900 --> 08:32.070
It would be 0.89

08:32.070 --> 08:34.800
cause that's the highest out of the four.

08:34.800 --> 08:37.110
Here, the value of this state is 0.71.

08:37.110 --> 08:40.380
The value of this state is 0.61, and so on.

08:40.380 --> 08:41.430
So that's something to remember.

08:41.430 --> 08:42.390
So, and, remember in our example

08:42.390 --> 08:45.750
I think we had like 0.86 or something, so pretty close.

08:45.750 --> 08:48.600
And so, if we go next here,

08:48.600 --> 08:51.690
oh, it just disappeared, where did it disappear?

08:51.690 --> 08:54.513
Let's do it again, and let's make it come back.

08:55.405 --> 08:58.907
Okay, okay, slowly, slowly, slowly filling up some spaces.

09:01.140 --> 09:02.880
I see, and it's also pretty random

09:02.880 --> 09:04.800
because not only the environment has randomness,

09:04.800 --> 09:06.720
but also the way he explores at the star

09:06.720 --> 09:10.173
when he doesn't know the policy is he's exploring at random.

09:11.190 --> 09:13.650
Just keeps disappearing, I don't understand why.

09:13.650 --> 09:14.790
Anyways, so let's see what happens

09:14.790 --> 09:17.760
if we increase the number here and here.

09:17.760 --> 09:20.820
Should pretty much take the same amount of time

09:20.820 --> 09:23.490
if the speed doesn't have a cap on it.

09:23.490 --> 09:24.870
Okay, so you can see it's like,

09:24.870 --> 09:27.600
it has more opportunity to explore things.

09:27.600 --> 09:31.230
Okay, let's see how it all goes.

09:31.230 --> 09:32.580
And you can see the values are converging,

09:32.580 --> 09:34.110
they go up and down depending, you know,

09:34.110 --> 09:36.300
because there's some randomness and he might end up like

09:36.300 --> 09:38.640
in the pit even though he goes like this way.

09:38.640 --> 09:40.860
But at the same time they're slowly starting to converge

09:40.860 --> 09:44.910
to some sort of end values and Q values.

09:44.910 --> 09:47.730
Okay, probably a thousand is a bit too much

09:47.730 --> 09:48.563
in terms of time.

09:48.563 --> 09:50.665
It doesn't look like the speed

09:50.665 --> 09:53.610
is proportionally increasing as well.

09:53.610 --> 09:55.620
So, it might cut that part.

09:55.620 --> 09:57.570
I mean, like, reduce the speed.

09:57.570 --> 09:59.043
Yeah, wow, this is very long.

10:00.030 --> 10:01.933
You don't have to watch through the end of this tutorial.

10:01.933 --> 10:03.420
I just wanna experiment with quite a bit.

10:03.420 --> 10:05.190
So, to give you some

10:05.190 --> 10:07.440
like examples of what we've been working through.

10:07.440 --> 10:10.920
But you get the point that it goes through all of this,

10:10.920 --> 10:12.660
it has some randomness,

10:12.660 --> 10:14.790
like, randomness built into its behavior.

10:14.790 --> 10:16.740
So even when it has like a policy,

10:16.740 --> 10:18.630
it will still continue exploring.

10:18.630 --> 10:21.480
So it won't just like once it has the basic policy

10:21.480 --> 10:23.430
it won't just continue following its policy.

10:23.430 --> 10:25.950
It will still experiment with other variations once

10:25.950 --> 10:28.830
in a while in order to enhance its policy.

10:28.830 --> 10:31.350
Maybe it hasn't found the best policy already right away.

10:31.350 --> 10:33.300
Maybe it can improve the policy,

10:33.300 --> 10:36.300
and that's why even after so many iterations

10:36.300 --> 10:38.820
you can still see some random effects.

10:38.820 --> 10:40.879
It sometimes jumps into random states,

10:40.879 --> 10:43.530
not just because of the randomness in the environment,

10:43.530 --> 10:47.010
but also because there is some level, like a parameter

10:47.010 --> 10:48.960
which you could control, which you could set up

10:48.960 --> 10:52.590
for your agent saying that you know, most of the time,

10:52.590 --> 10:55.500
80% of the time do whatever your policy tells you to do,

10:55.500 --> 10:57.630
but 20% of the time, you know, just have some fun,

10:57.630 --> 10:59.313
experiment, and see what happens.

10:59.313 --> 11:01.620
And, use that information that you gather

11:01.620 --> 11:03.360
to update your policy.

11:03.360 --> 11:05.310
Okay, this is taking way too long.

11:05.310 --> 11:06.540
Let's try that again.

11:06.540 --> 11:11.540
Yeah, so that's how the agent learns in different states.

11:11.640 --> 11:14.280
Maybe let's just run one more just out of curiosity.

11:14.280 --> 11:17.510
So, is there anything else we can change about iterations?

11:19.590 --> 11:24.540
Duh, duh, duh, okay, okay, let's have a look.

11:24.540 --> 11:26.700
Yeah, well we could change the discount for example.

11:26.700 --> 11:31.700
So, in this case we could say K minus 100

11:33.600 --> 11:38.040
minus A Q minus S to the minus R.

11:38.040 --> 11:39.900
Okay, 2000.

11:39.900 --> 11:43.710
So reward, we want to keep it, maybe let's keep it at 0.04,

11:43.710 --> 11:46.110
but let's say, say it again,

11:46.110 --> 11:49.290
let's keep the reward at minus 0.04 every time.

11:49.290 --> 11:52.740
And then here we're going to say that

11:52.740 --> 11:57.470
D, the discount is not 0.9 but it's like 0.5.

11:59.040 --> 12:01.020
So, it gets discounted quite a lot

12:01.020 --> 12:02.580
as you go through the game.

12:02.580 --> 12:06.300
So it actually, now it'll be incentivized to be closer

12:06.300 --> 12:08.130
to the finish rather than further.

12:08.130 --> 12:10.440
Well, the state's closer to finish will get high value.

12:10.440 --> 12:12.870
So you can see that the values it quickly drops off.

12:12.870 --> 12:15.393
It's not as as green as it was before.

12:16.350 --> 12:20.370
So here you can see that this is the policy now.

12:20.370 --> 12:23.280
So it goes like that, like that, like that, like that.

12:23.280 --> 12:25.170
Very similar to what we saw before.

12:25.170 --> 12:26.520
Just probably the only differences

12:26.520 --> 12:28.830
from here is jumping straight into here.

12:28.830 --> 12:29.970
So that's that one.

12:29.970 --> 12:33.570
And okay, let's just run one more, this is so much fun.

12:33.570 --> 12:34.500
Let's just run one more.

12:34.500 --> 12:39.090
So, K minus K 100, AQ, discount,

12:39.090 --> 12:41.790
keep it as it was, original.

12:41.790 --> 12:45.783
So like just run this basic vanilla setup.

12:46.650 --> 12:49.590
Okay, okay, okay, it's going,

12:49.590 --> 12:52.140
let's see if it will show us the policy at the end.

12:53.460 --> 12:54.870
Yes, we got the policy.

12:54.870 --> 12:56.340
Yes, good finish.

12:56.340 --> 12:58.890
So here we've got the policy.

12:58.890 --> 12:59.850
You know, this is familiar.

12:59.850 --> 13:02.250
Remember that time when we saw that the AI outsmarted

13:02.250 --> 13:05.310
the human went boom into the wall to go there and boom

13:05.310 --> 13:08.580
into the wall to go like that, to increase the probability.

13:08.580 --> 13:09.413
So there we go.

13:09.413 --> 13:13.860
That is an example of Artificial Intelligence in action.

13:13.860 --> 13:16.290
Very, very basic, simple, key learning.

13:16.290 --> 13:18.630
So no deep learning at this stage,

13:18.630 --> 13:22.080
but at the same time it's already pretty smart.

13:22.080 --> 13:23.640
And I hope you enjoyed today's tutorial,

13:23.640 --> 13:26.700
and once again thank you to UC Berkeley,

13:26.700 --> 13:28.290
and I hope you enjoyed today's tutorial,

13:28.290 --> 13:29.610
and I look forward to seeing you next time.

13:29.610 --> 13:31.323
Until then, enjoy AI.