WEBVTT

00:00.990 --> 00:02.160
Instructor: Hello and welcome back

00:02.160 --> 00:04.980
to the course on artificial intelligence.

00:04.980 --> 00:08.670
Previously we had quite a strenuous and long tutorial

00:08.670 --> 00:10.230
on market decision processes

00:10.230 --> 00:13.750
and hopefully you got along well with that

00:14.622 --> 00:16.230
and hopefully I could explain things

00:16.230 --> 00:19.110
in a approachable and engaging way.

00:19.110 --> 00:22.740
And today we're going to talk about policies versus plans.

00:22.740 --> 00:25.620
It's gonna be a quick and fun tutorial

00:25.620 --> 00:28.101
because now we're entering into a new world.

00:28.101 --> 00:30.660
We're entering into a world of stochastic search,

00:30.660 --> 00:32.040
not deterministic search.

00:32.040 --> 00:34.170
When you, it's not just about getting through the maze

00:34.170 --> 00:35.735
but also accounting

00:35.735 --> 00:37.820
for random factors that might just hit you

00:37.820 --> 00:39.390
in the head when you're going through this maze.

00:39.390 --> 00:41.130
And you need to be prepared for them.

00:41.130 --> 00:44.520
That's the world our agent is living in.

00:44.520 --> 00:46.830
And it's more fun, but it's also more dangerous.

00:46.830 --> 00:48.660
It's more, it's less predictable.

00:48.660 --> 00:50.970
So how is our agent going to behave?

00:50.970 --> 00:52.230
Let's have a look.

00:52.230 --> 00:55.080
There's our mark of decision process framework

00:55.080 --> 00:58.230
which is once again our favorite bellman equation.

00:58.230 --> 01:01.410
However, the more advanced version of the bellman equation

01:01.410 --> 01:02.243
we were working with.

01:02.243 --> 01:03.541
So from now on

01:03.541 --> 01:04.770
we're just gonna call this the bellman equation.

01:04.770 --> 01:07.920
And here we've got our maximum across all actions.

01:07.920 --> 01:11.010
So the value of a state, any S state S is the maximum

01:11.010 --> 01:13.200
across all actions that an agent can possibly perform

01:13.200 --> 01:14.100
in that state.

01:14.100 --> 01:17.610
And the maximum was taken from the reward

01:17.610 --> 01:20.040
that the agent will get by perform action A

01:20.040 --> 01:22.620
in state S plus a discount factor multiplied

01:22.620 --> 01:26.820
by the expected value of the new state it will be in.

01:26.820 --> 01:29.550
And the expected value is taken here because there are

01:29.550 --> 01:31.890
it doesn't know exactly what state will end up in.

01:31.890 --> 01:35.940
There are some random effects that are present

01:35.940 --> 01:40.020
in the environment that might alter the state

01:40.020 --> 01:42.630
and you might not end up in the desired state.

01:42.630 --> 01:44.190
It might up end up in a different state.

01:44.190 --> 01:46.380
That's why we're taking the expected value over here.

01:46.380 --> 01:48.000
This sum over here.

01:48.000 --> 01:50.400
So let's have a look at this

01:50.400 --> 01:53.760
as our example or in our example of the maze.

01:53.760 --> 01:56.730
So this is what we had previously.

01:56.730 --> 01:58.950
So previously we're dealing with

01:58.950 --> 02:00.240
we've deterministic search.

02:00.240 --> 02:03.690
So we knew that. All right, so if I'm here

02:03.690 --> 02:04.830
I definitely need to go here.

02:04.830 --> 02:06.630
If I'm here, I definitely need to go here.

02:06.630 --> 02:09.239
If I'm here, I definitely need to go here.

02:09.239 --> 02:10.072
If I'm here, I'm here.

02:10.072 --> 02:12.600
So it was all pretty straightforward once you have this map

02:12.600 --> 02:14.135
and remember we called it a,

02:14.135 --> 02:14.968
we called it a plan.

02:14.968 --> 02:16.830
Once you have the plan, it's pretty straightforward,

02:16.830 --> 02:18.030
you need to do.

02:18.030 --> 02:19.050
there are our arrows.

02:19.050 --> 02:20.490
So that's the plan with arrows.

02:20.490 --> 02:22.110
And from here it was very straightforward.

02:22.110 --> 02:24.000
We're, this is, these are the routes

02:24.000 --> 02:24.833
that the agent would take

02:24.833 --> 02:26.400
wherever you start on this blue line.

02:26.400 --> 02:28.680
That's exactly the way you would go.

02:28.680 --> 02:31.110
However, now we don't have a plan anymore.

02:31.110 --> 02:33.330
We can't have a plan because you know,

02:33.330 --> 02:36.570
whatever we plan might not happen.

02:36.570 --> 02:37.650
It's not under a control.

02:37.650 --> 02:40.950
A plan is when you know exactly what you need to do next.

02:40.950 --> 02:41.850
You know the steps.

02:41.850 --> 02:44.550
So you have a starting point, you have a goal

02:44.550 --> 02:45.930
and you know every single step.

02:45.930 --> 02:47.040
So you can plan them out.

02:47.040 --> 02:49.244
You're like, I'll do this one, I'll do this one

02:49.244 --> 02:50.077
I'll do this one.

02:50.077 --> 02:50.910
Like your life, like a plan.

02:50.910 --> 02:51.960
But at the same time,

02:51.960 --> 02:54.870
there's so much now randomness going on,

02:54.870 --> 02:55.920
you can't have a plan

02:55.920 --> 02:58.710
because what if you get here and then you click to

02:58.710 --> 03:00.660
the right and actually takes you down.

03:00.660 --> 03:02.370
So that's not part of your plan.

03:02.370 --> 03:04.170
So that's why it's not called a plan anymore.

03:04.170 --> 03:06.510
And here we're going to calculate the values

03:06.510 --> 03:07.740
or we're actually going to just look at

03:07.740 --> 03:12.060
the calculated values for this same problem,

03:12.060 --> 03:14.790
but based on based

03:14.790 --> 03:16.710
or given that we have this randomness inside.

03:16.710 --> 03:18.780
So these are the new values.

03:18.780 --> 03:21.150
And so why are these values different?

03:21.150 --> 03:22.860
So let's just compare to what we had previously.

03:22.860 --> 03:24.720
This is what we had previously.

03:24.720 --> 03:25.710
These are the new values.

03:25.710 --> 03:28.957
So once again, we had previously, you can see 1, 0.9

03:29.826 --> 03:33.870
0.81, 73, 66, and this is what we have now, 86.

03:33.870 --> 03:34.800
So less than 1,

03:34.800 --> 03:36.898
74, 71, 6, 3.

03:36.898 --> 03:39.486
And so, and by the way, these are not exactly

03:39.486 --> 03:41.730
the correct values are off the top of my head.

03:41.730 --> 03:44.700
But if we were to run an agent,

03:44.700 --> 03:46.950
values would be something similar to this.

03:46.950 --> 03:49.230
And the values could change because depending

03:49.230 --> 03:51.960
on the gamma that we choose, 0.9 or other value.

03:51.960 --> 03:54.360
But nevertheless, for arguments sake,

03:54.360 --> 03:56.610
these are the values that we're dealing with now.

03:56.610 --> 03:57.840
And there approximate,

03:57.840 --> 04:01.020
they convey the whole notion in the correct way.

04:01.020 --> 04:02.310
So let's have a look at them.

04:02.310 --> 04:03.420
Why have they changed?

04:03.420 --> 04:06.183
Well, why is here, Let's start with this one here.

04:07.493 --> 04:09.496
The value was one, why is it all of a sudden

04:09.496 --> 04:11.301
in 0.86, why is it less than one?

04:11.301 --> 04:12.134
Can't we just go from here?

04:12.134 --> 04:15.149
Well, we actually call it, because from here

04:15.149 --> 04:17.820
if we go right, which is our intention

04:17.820 --> 04:19.860
if we go right, we would actually

04:19.860 --> 04:22.350
with a 10% chance, we'd end up here.

04:22.350 --> 04:25.110
So we'd hit the wall and we'd be back in this state.

04:25.110 --> 04:26.310
And remember, we have a gamma

04:26.310 --> 04:28.470
so the value would be discounted.

04:28.470 --> 04:30.470
And or with a 10% chance,

04:30.470 --> 04:32.190
we would end up here in this state.

04:32.190 --> 04:34.950
So it's not a hundred percent probability I would get here.

04:34.950 --> 04:37.440
So therefore this value can no longer be a one.

04:37.440 --> 04:38.310
It's something less.

04:38.310 --> 04:41.580
And it's, let's say is 0.86.

04:41.580 --> 04:43.707
So that's an example of why it's like this.

04:43.707 --> 04:46.140
And you could get the exact value

04:46.140 --> 04:48.150
if you calculated the bellman equation

04:48.150 --> 04:50.255
the full bellman equation that we have.

04:50.255 --> 04:51.300
Now, the only problem is that there's

04:51.300 --> 04:52.320
there'd be some recursion

04:52.320 --> 04:54.655
because you would need to know the value

04:54.655 --> 04:55.950
for this and then you would need to know the value for this.

04:55.950 --> 04:56.850
It's quite complex

04:56.850 --> 04:57.683
and that's why we're not doing

04:57.683 --> 04:59.190
the calculations manually here.

04:59.190 --> 05:01.140
That's why the AI, but the AI can do them

05:01.140 --> 05:03.060
as it's going through, through all this.

05:03.060 --> 05:06.000
It's like nothing too complex

05:06.000 --> 05:08.520
for the AI to calculate these things.

05:08.520 --> 05:10.110
So that's our value here.

05:10.110 --> 05:12.370
But now, let's take a look at different one.

05:12.370 --> 05:13.500
So here it used to be 0.9 just

05:13.500 --> 05:15.792
because the discounting factory.

05:15.792 --> 05:17.946
Remember from here to here again, now from here

05:17.946 --> 05:19.050
we can't just jump from here to here,

05:19.050 --> 05:22.470
simply because even if we jump, if we go like this,

05:22.470 --> 05:24.930
we might end up back here, back here, right?

05:24.930 --> 05:26.760
There's 20% chance that we'll still stay

05:26.760 --> 05:29.730
in the square because we'll hit a wall and again and so on.

05:29.730 --> 05:32.820
So the value of being here is 0.71.

05:32.820 --> 05:35.370
Again, this the discounting factor.

05:35.370 --> 05:38.040
You know, this might look odd to you that this is even

05:38.040 --> 05:40.020
with the discounting factor, this is too high.

05:40.020 --> 05:42.438
Maybe the discounting factor in this example is not 0.9

05:42.438 --> 05:44.640
maybe it's 0.99 or something like that.

05:44.640 --> 05:46.350
So don't worry about that.

05:46.350 --> 05:47.294
Just kind of like focus on

05:47.294 --> 05:50.177
that the values have indeed changed

05:50.177 --> 05:53.460
that the values are now less

05:53.460 --> 05:56.940
mostly because it's not a hundred percent probability

05:56.940 --> 05:59.220
to get to the state that you want to get.

05:59.220 --> 06:01.560
And what you will find an interesting one is here

06:01.560 --> 06:03.420
that here it used to be 0.9

06:03.420 --> 06:05.310
and it actually has dropped very much.

06:05.310 --> 06:06.630
It's dropped substantially.

06:06.630 --> 06:07.463
Why is that?

06:07.463 --> 06:10.800
Well, because if you go from here up, which is our intention

06:10.800 --> 06:12.360
there's a 10% chance of hitting the wall

06:12.360 --> 06:14.970
but there's a 10% chance of actually ending up

06:14.970 --> 06:18.690
in the fire pit and losing a minus one to reward.

06:18.690 --> 06:21.818
And basically that means for the agent,

06:21.818 --> 06:23.130
that's end of the game.

06:23.130 --> 06:25.650
And so this is a very bad state to be in.

06:25.650 --> 06:26.670
So all of a sudden,

06:26.670 --> 06:29.692
remember we had 0.9 here, zero point (indistinct).

06:29.692 --> 06:30.747
So they were equivalent.

06:30.747 --> 06:31.653
It doesn't matter you're here

06:31.653 --> 06:33.184
or here they're pretty much equal

06:33.184 --> 06:34.980
in terms of value of being in each of these states.

06:34.980 --> 06:37.080
But now all of a sudden, bam.

06:37.080 --> 06:41.550
This state is like nearly twice as good as this one.

06:41.550 --> 06:45.027
Simply just because here, if you go straight to

06:45.027 --> 06:46.980
you go right where you want to go.

06:46.980 --> 06:48.180
The, you know, the consequences

06:48.180 --> 06:51.210
of the randomness occurring is you just stay here.

06:51.210 --> 06:52.710
Here, the one of the consequences

06:52.710 --> 06:55.080
the 10% chance is you end up in the pit.

06:55.080 --> 06:56.460
So as you can see,

06:56.460 --> 06:59.670
this is no longer such a good state anymore.

06:59.670 --> 07:02.190
Simply because of something that fluctuation

07:02.190 --> 07:03.540
that could happen.

07:03.540 --> 07:05.370
As you can see, this one is also very bad

07:05.370 --> 07:08.850
because it's as bad as this one in terms of, you know,

07:08.850 --> 07:10.407
it's only 10% chance of ending up in the pit

07:10.407 --> 07:12.690
and 10% chance of ending up in the wall.

07:12.690 --> 07:15.030
But at the same time, there's the discounting factor.

07:15.030 --> 07:16.800
So first of all, the discounting factor.

07:16.800 --> 07:20.670
And also after this one, you'd have to go here

07:20.670 --> 07:22.560
and even if you hypothetically went here,

07:22.560 --> 07:24.686
you could end up in the pit again.

07:24.686 --> 07:26.670
So that chance would also be taking into account

07:26.670 --> 07:28.740
because remember this value is derived from this value

07:28.740 --> 07:32.370
and this value is derived from this value, right?

07:32.370 --> 07:34.080
And therefore it's small.

07:34.080 --> 07:37.530
But in reality, actually that what I said there was wrong.

07:37.530 --> 07:39.780
This value is not derive from this value.

07:39.780 --> 07:41.490
So if you just have a look now,

07:41.490 --> 07:45.900
you will notice that this value via here is actually

07:45.900 --> 07:47.610
greater than this one.

07:47.610 --> 07:50.520
You will notice that for the agent

07:50.520 --> 07:53.148
it's better to go around this way than this way.

07:53.148 --> 07:54.810
And it makes sense, right?

07:54.810 --> 07:57.180
Because this way it doesn't lose,

07:57.180 --> 07:58.590
there's no chance of getting in the pit.

07:58.590 --> 08:00.000
Yes it is a bit longer,

08:00.000 --> 08:03.510
and therefore the discounting factor has a bigger effect.

08:03.510 --> 08:04.560
But at the same time,

08:04.560 --> 08:06.660
simply because there's a chance of getting in the pit here,

08:06.660 --> 08:08.070
if it goes straight, it will

08:08.070 --> 08:09.240
there's a chance of it jumping into the pit.

08:09.240 --> 08:11.400
So it'll take, it'd rather take its time

08:11.400 --> 08:12.840
and it'll just like go around

08:12.840 --> 08:13.980
because that way

08:13.980 --> 08:15.720
there's a much lesser chance of it getting into the pit.

08:15.720 --> 08:17.186
There still is.

08:17.186 --> 08:19.560
So if from here it goes there, from here, it goes there

08:19.560 --> 08:22.051
it could potentially get into the pit

08:22.051 --> 08:23.715
because it could end up there

08:23.715 --> 08:25.413
and then it could end up in the pit,

08:25.413 --> 08:27.036
but nevertheless, it's a lesser chance.

08:27.036 --> 08:28.301
So it would just go around like that.

08:28.301 --> 08:30.240
So very interesting to see how they all changed.

08:30.240 --> 08:32.870
Remember previously from here you would go like that.

08:32.870 --> 08:33.871
From here you'd go like that

08:33.871 --> 08:34.980
and from here you go like that.

08:34.980 --> 08:36.870
And now all of a sudden you can see it's changed.

08:36.870 --> 08:39.540
So let's draw the arrows and see what it looks like now.

08:39.540 --> 08:41.010
And overall, yeah

08:41.010 --> 08:43.770
you see even more random thing, right?

08:43.770 --> 08:46.530
So yes, this is true, but look at what happened here.

08:46.530 --> 08:49.050
Look at this one, look at this one.

08:49.050 --> 08:50.490
Were you expecting that?

08:50.490 --> 08:51.810
That's something definitely like

08:51.810 --> 08:54.445
when I saw this for the first time, I was very impressed.

08:54.445 --> 08:55.922
I was not,

08:55.922 --> 08:59.970
I was surprised and I was not expecting this at all.

08:59.970 --> 09:02.670
And this is, this is an example of, you know

09:02.670 --> 09:05.100
when AI can outsmart a human.

09:05.100 --> 09:07.620
It's like something you can't even, you could

09:07.620 --> 09:09.360
I can't even predict, but the AI

09:09.360 --> 09:10.470
through reinforcement learning,

09:10.470 --> 09:11.790
remember that example of the dogs

09:11.790 --> 09:13.710
that can actually sometimes work better

09:14.782 --> 09:17.070
than normal real life dogs or pre-programmed robot dogs

09:17.070 --> 09:18.420
or can play soccer,

09:18.420 --> 09:20.832
simply because they come up with these ideas

09:20.832 --> 09:22.470
that even we can't see.

09:22.470 --> 09:23.820
And so that's a great example, right?

09:23.820 --> 09:25.590
So you probably weren't expecting that as well.

09:25.590 --> 09:28.295
That the agent instead of going up is like,

09:28.295 --> 09:31.260
why would I, like, if I go up

09:31.260 --> 09:33.150
then there's a 10% chance I'll jump into the pit

09:33.150 --> 09:35.250
but what is it achieved by going into the wall?

09:35.250 --> 09:38.520
Well, 80% of the time it'll bump back and stay in this state

09:38.520 --> 09:40.290
but 10% of the time it'll go here

09:40.290 --> 09:42.330
and 10% of the time it'll go here.

09:42.330 --> 09:46.650
So all of a sudden you can see that now it's actually

09:46.650 --> 09:49.110
in this new approach of jumping into the wall

09:49.110 --> 09:51.480
there is a 0% chance it will go into

09:51.480 --> 09:53.100
the fire pit from this spot.

09:53.100 --> 09:53.933
So, and it's like

09:53.933 --> 09:55.620
it really doesn't want to go into the fire pit.

09:55.620 --> 09:57.360
So it rather bounces into the wall

09:57.360 --> 10:01.080
a couple of times and then it will go either right or left

10:01.080 --> 10:03.090
at some point because that randomness is gonna happen.

10:03.090 --> 10:05.730
And so it learned that through experimentation

10:05.730 --> 10:08.580
it learned that, okay, when I go forward

10:08.580 --> 10:11.490
the results are not as good as when I go to the wall.

10:11.490 --> 10:14.430
And if you think about it, it's like this, this robot

10:14.430 --> 10:16.672
if you think about it, like there's a fire pit

10:16.672 --> 10:18.810
there's a very, this is the, this is like a, this square is

10:19.743 --> 10:20.743
like a very tiny ledge

10:20.743 --> 10:22.464
and then this is like a mountain like a cliff,

10:22.464 --> 10:23.940
and this robot is just hugging the cliff

10:23.940 --> 10:26.220
and just, you know, like trying to waiting

10:26.220 --> 10:28.890
until it like pushes it right or left.

10:28.890 --> 10:31.170
Because, well, as a human, you'd probably do the same.

10:31.170 --> 10:33.030
You wouldn't be standing facing that way

10:33.911 --> 10:36.100
that way you'd be hugging the the cliff, right?

10:36.100 --> 10:36.933
Or something like that.

10:36.933 --> 10:38.544
And hopefully you never need to end up in a

10:38.544 --> 10:39.780
you never end up in situation like that.

10:39.780 --> 10:42.150
But like visually, just visually

10:42.150 --> 10:44.732
if you think about it, same thing here.

10:44.732 --> 10:46.470
And so that's pretty intense, right?

10:46.470 --> 10:48.480
So that, that the AI came up with this idea.

10:48.480 --> 10:50.910
And same here, that instead of going left

10:50.910 --> 10:51.930
and risking getting a fire,

10:51.930 --> 10:53.640
I'm just gonna try balance off the wall.

10:53.640 --> 10:56.130
Like, you know, hug the wall, try to jump into the wall.

10:56.130 --> 10:58.500
And at some point I know that, you know

10:58.500 --> 11:00.120
just there's a probability

11:00.120 --> 11:02.040
there's a 10% chance every time I do that I'll go here

11:02.040 --> 11:03.780
and sometimes it'll happen and I'll end up here

11:03.780 --> 11:06.810
and I'll be safe and then I'll just keep going like that.

11:06.810 --> 11:10.860
So very, very interesting approach that the AI took here.

11:10.860 --> 11:13.080
And as you can see the, the routes are like this.

11:13.080 --> 11:15.120
So from here it might go right and then it'll go right

11:15.120 --> 11:17.670
to the exit or here it'll go left like that.

11:17.670 --> 11:21.150
And here it will at some point it will go left

11:21.150 --> 11:22.290
and it'll go like that again,

11:22.290 --> 11:24.120
this is important to understand it's not a policy.

11:24.120 --> 11:28.230
So even when it jumps from here, it will go here maybe.

11:28.230 --> 11:29.550
And then from here it might actually,

11:29.550 --> 11:30.420
instead of going straight,

11:30.420 --> 11:32.490
it might actually go back to the right

11:32.490 --> 11:33.690
and then from here it might go to the left,

11:33.690 --> 11:34.530
it might go to the right.

11:34.530 --> 11:36.450
So there's lots of different options for it to go.

11:36.450 --> 11:38.910
So it might not follow exactly the siren go the other way.

11:38.910 --> 11:40.860
This is just the desired routes

11:40.860 --> 11:42.600
that it's designed for itself.

11:42.600 --> 11:44.700
But the way it'll work out is actually could be different.

11:44.700 --> 11:46.320
It depends on the real world.

11:46.320 --> 11:47.934
So there we go.

11:47.934 --> 11:50.100
That's the world of artificial intelligence.

11:50.100 --> 11:52.710
That's what a policy versus a plan is.

11:52.710 --> 11:54.697
And hopefully you're getting slowly

11:54.697 --> 11:58.230
getting excited by what the AI can do,

11:58.230 --> 12:01.320
especially given what we saw over here.

12:01.320 --> 12:05.850
These are some very virtual also type of decisions

12:05.850 --> 12:07.560
that the AI is coming up with.

12:07.560 --> 12:10.110
And as you can see when you apply AI,

12:10.110 --> 12:12.030
even from this small example, you can see that

12:12.030 --> 12:14.370
even when you apply AI in real world

12:14.370 --> 12:17.010
maybe you'll come up with ideas and decisions

12:17.010 --> 12:19.290
that even sometimes humans can't come up with.

12:19.290 --> 12:21.390
And that's exactly kind of like what happened

12:21.390 --> 12:25.500
in those games where the Google AlphaGo was playing

12:25.500 --> 12:30.500
versus Lee Sedol champion of Go in Korea back in, you know

12:31.200 --> 12:32.370
the world champion of Go.

12:32.370 --> 12:35.700
And they were playing in Korea back, back in 2016

12:35.700 --> 12:36.990
I think it was March, 2016,

12:36.990 --> 12:39.660
it came up with some moves that humans had never played

12:39.660 --> 12:42.360
in 3000 years or humans were not used to playing.

12:42.360 --> 12:45.720
And this is exactly an example of that.

12:45.720 --> 12:47.880
So once again, hope you are getting excited

12:47.880 --> 12:49.080
and pumped about this course

12:49.080 --> 12:51.188
and about what we're going to create

12:51.188 --> 12:52.710
and I look forward to seeing you next time.

12:52.710 --> 12:54.813
Until then, enjoy AI.
