WEBVTT

00:01.080 --> 00:02.610
-: Hello and welcome back to the course

00:02.610 --> 00:04.260
on artificial intelligence.

00:04.260 --> 00:07.590
Today we're talking about the living penalty.

00:07.590 --> 00:09.750
All right, so here we've got our Bellman equation

00:09.750 --> 00:13.110
and as we've been going through this course

00:13.110 --> 00:15.990
we've been slowly making more and more complex.

00:15.990 --> 00:19.260
So so far we've already added these probabilities

00:19.260 --> 00:22.890
in here and also we've added the discounting factor.

00:22.890 --> 00:24.900
Now we're going to look in more detail

00:24.900 --> 00:28.200
at this side of the equation where we have the reward.

00:28.200 --> 00:30.300
Now remember previously

00:30.300 --> 00:33.240
when we talked about how reinforcement learning works

00:33.240 --> 00:34.440
we said we have an agent

00:34.440 --> 00:37.170
and it performs actions in the environment and

00:37.170 --> 00:41.310
in exchange or as a result of that, it gets a new state

00:41.310 --> 00:45.630
in which it's now in and a reward for that action.

00:45.630 --> 00:47.610
Well, so far in our example

00:47.610 --> 00:50.250
we've only been getting rewards at the very end.

00:50.250 --> 00:52.560
Either if we get to the finish line

00:52.560 --> 00:55.680
or if we for the agent ends up in the fire pit

00:55.680 --> 00:58.950
he gets a plus one or a minus one reward.

00:58.950 --> 01:01.500
But that is a very simplistic approach

01:01.500 --> 01:02.700
to reinforcement learning.

01:02.700 --> 01:05.280
And in more realistic scenarios

01:05.280 --> 01:08.730
you will likely have rewards throughout the journey

01:08.730 --> 01:09.900
not just at the very end.

01:09.900 --> 01:11.400
You might have rewards throughout the journey.

01:11.400 --> 01:14.700
For instance, if it's an AI playing a game

01:14.700 --> 01:19.700
and if for example, it's like shooting somebody in Doom

01:20.160 --> 01:23.550
it might get points for killing that enemy.

01:23.550 --> 01:26.430
Or it might be in a different other game

01:26.430 --> 01:30.210
if it overtakes another car or something like that.

01:30.210 --> 01:31.800
Just because of the rules of the game

01:31.800 --> 01:35.430
not because of its way of analyzing the game

01:35.430 --> 01:36.990
but actually the game is structured

01:36.990 --> 01:40.320
in a way that it's reinforcing, it's giving points

01:40.320 --> 01:43.530
for doing certain actions even before the game is over.

01:43.530 --> 01:45.990
So scenarios like that are very common

01:45.990 --> 01:48.600
not just in games and also in real life.

01:48.600 --> 01:51.450
And that's why we're going to introduce something similar

01:51.450 --> 01:54.390
into our example, a simplified version of that

01:54.390 --> 01:57.990
but nevertheless, a reward that is continuously given

01:57.990 --> 02:00.541
to the agent throughout the game, not at just at the end.

02:00.541 --> 02:02.141
And the way we're going to do it

02:03.121 --> 02:04.440
is by looking at the other tiles.

02:04.440 --> 02:07.620
So right now we only have reward plus one at the final tile

02:07.620 --> 02:11.820
and reward minus one at the other final tile, the fire pit.

02:11.820 --> 02:14.160
But now we're going to add rewards in every single tile.

02:14.160 --> 02:17.790
We'll add a very small reward, it'll be minus 0.04,

02:17.790 --> 02:18.930
and as you can see, it's negative.

02:18.930 --> 02:23.190
So every time the agent moves, he'll get a negative reward.

02:23.190 --> 02:24.840
And that's why it's called a living penalty.

02:24.840 --> 02:27.300
Because no matter where he goes, he will always

02:27.300 --> 02:29.610
get this negative reward except for these final tiles

02:29.610 --> 02:31.290
because that's the end of the game.

02:31.290 --> 02:33.240
And so here you can see the reward

02:33.240 --> 02:35.190
even on this tile is minus 0.04.

02:35.190 --> 02:37.950
But that doesn't mean that he starts with that reward.

02:37.950 --> 02:41.040
He only gets this reward, and this is important to him,

02:41.040 --> 02:43.770
remember, he only gets this reward when he enters a tile.

02:43.770 --> 02:46.500
So whenever he promises an action, he goes here

02:46.500 --> 02:49.860
then he will get this reward minus 0.04,

02:49.860 --> 02:51.870
and then if he comes back to this tile, he'll get another

02:51.870 --> 02:53.760
minus 0.04 reward.

02:53.760 --> 02:56.430
And so the longer he walks around, the more

02:56.430 --> 02:58.230
he accumulates this negative reward,

02:59.071 --> 03:00.341
and therefore it is an incentive for him

03:00.341 --> 03:03.840
to finish the game earlier as quickly as possible.

03:03.840 --> 03:06.270
And so now let's have a look at how

03:06.270 --> 03:08.680
our policy or how the agent's policy is going

03:09.701 --> 03:14.400
to change depending on what value we set for this reward.

03:14.400 --> 03:16.380
So here are four environments,

03:16.380 --> 03:18.900
and in each one we're going to explore a different reward.

03:18.900 --> 03:21.120
Now, we're not going to do the calculations

03:21.120 --> 03:23.220
we're just going to project results.

03:23.220 --> 03:25.740
And you will see that intuitively they make total sense.

03:25.740 --> 03:29.100
So here we've got a reward for any step

03:29.100 --> 03:32.850
or for any for getting into any state is equal to zero,

03:32.850 --> 03:35.370
just as what we've seen before here, the reward is going

03:35.370 --> 03:38.421
to be minus 0.04, what we just introduced now.

03:38.421 --> 03:40.509
Here, the reward will be minus 0.5,

03:40.509 --> 03:44.190
or the living penalty will be minus 0.5.

03:44.190 --> 03:45.936
So much higher you can see than here,

03:45.936 --> 03:47.760
more than 10 times greater.

03:47.760 --> 03:50.160
And here the living penalty will be minus two.

03:50.160 --> 03:54.900
So even more than the reward you get

03:54.900 --> 03:57.810
for jumping or even less than the reward

03:57.810 --> 04:00.720
that the agent gets for ending up in the fire pit.

04:00.720 --> 04:02.670
So let's have a look at how the actions

04:02.670 --> 04:06.960
or the optimal policy for passing this environment

04:06.960 --> 04:09.180
will change depending on this reward.

04:09.180 --> 04:13.140
So this is our original policy, and as you can remember

04:13.140 --> 04:15.840
we had these two very interesting

04:15.840 --> 04:18.180
and even a little bit weird decisions

04:18.180 --> 04:20.160
by the agent but which totally makes sense

04:20.160 --> 04:23.970
if he can live for as long as he likes,

04:23.970 --> 04:26.310
if he can just travel around for as long as he wants

04:26.310 --> 04:30.791
without being penalized for, staying alive very long,

04:30.791 --> 04:34.530
why wouldn't he just go into the corner here

04:34.530 --> 04:38.490
into the wall and just keep doing that until it happens,

04:38.490 --> 04:40.050
it so happens that he goes this way

04:40.050 --> 04:41.460
and then he will walk around?

04:41.460 --> 04:43.380
And same thing here, it's much safer for him

04:43.380 --> 04:46.290
to jump into the wall hoping that one of these will come up

04:46.290 --> 04:49.440
eventually, and then he'll go to the finish line anyway

04:49.440 --> 04:51.690
because in the, by choosing these two actions, he

04:51.690 --> 04:53.640
doesn't risk getting into the fire pit.

04:53.640 --> 04:56.670
Now let's see what happens if we add a reward,

04:56.670 --> 04:58.890
a negative reward for just being alive,

04:58.890 --> 05:01.140
for making a step, making a move.

05:01.140 --> 05:04.980
So here you can see that instantly these two changed.

05:04.980 --> 05:07.920
Now the agent doesn't want to jump into the wall.

05:07.920 --> 05:10.560
He is more likely to risk getting to the fire pit

05:10.560 --> 05:13.500
having a 10% chance of jumping in here, but he

05:13.500 --> 05:16.350
will go forward because every time he jumps

05:16.350 --> 05:18.930
into the wall here, if he was going to be doing it here

05:18.930 --> 05:20.610
as well, every time he jumps

05:20.610 --> 05:22.620
into wall he performs an action he ends up

05:22.620 --> 05:25.020
into in this state with an 80% chance.

05:25.020 --> 05:28.530
And that means with an 80% chance, he'll get a minus 0.04

05:28.530 --> 05:31.380
reward, meaning that a lot of the time he's going

05:31.380 --> 05:34.950
to be getting this accumulating, this negative reward.

05:34.950 --> 05:35.783
Same thing here.

05:35.783 --> 05:39.120
If he jumps into the wall waiting for that moment

05:39.120 --> 05:42.960
when he will actually be randomly moved to the right

05:42.960 --> 05:45.000
if he keeps doing that, he will accumulate this

05:45.000 --> 05:48.030
negative reward and that, the result

05:48.030 --> 05:50.460
of that if you perform the calculations, you'll see

05:50.460 --> 05:54.390
that the result of that, the expected value, like

05:54.390 --> 05:56.240
of that approach jumping to the wall,

05:57.211 --> 05:58.860
is worse than taking the risk

05:58.860 --> 06:02.850
of going forward and actually ending up in the fire pit.

06:02.850 --> 06:04.500
So he changes his decisions

06:04.500 --> 06:09.450
in these two blocks to instead move forward and here move

06:09.450 --> 06:11.310
to the left even though there's a risk of jumping

06:11.310 --> 06:14.700
in the fire pit, simply because now, the longer he's alive

06:14.700 --> 06:17.760
the longer accumulate this living penalty

06:17.760 --> 06:18.840
in the next environment.

06:18.840 --> 06:20.580
Now we're increasing the living penalty

06:20.580 --> 06:23.310
to even a greater number minus 0.5.

06:23.310 --> 06:24.840
And let's see what changes here.

06:24.840 --> 06:26.160
So now you can see that compared

06:26.160 --> 06:28.920
to this environment the only thing that changed here

06:28.920 --> 06:31.691
is that this arrow is pointing to the right.

06:31.691 --> 06:33.021
And what that means is

06:33.021 --> 06:37.020
that now it's no longer a good option for the agent.

06:37.020 --> 06:38.880
Oh, actually also this arrow is pointing

06:38.880 --> 06:39.960
was pointing to the left

06:39.960 --> 06:42.330
and now it's pointing upwards.

06:42.330 --> 06:45.000
So now it's no longer a good idea

06:45.000 --> 06:48.270
for the agent to go around from here, go around all the way

06:48.270 --> 06:51.150
because if he goes around all the way, yes, he's safer

06:51.150 --> 06:52.740
there's a lesser chance, there's no chance

06:52.740 --> 06:55.740
of getting to the fire pit, but at the same time

06:55.740 --> 06:57.720
or there's a less chance of getting to the fire.

06:57.720 --> 06:58.620
But at the same time

06:58.620 --> 07:02.070
he will accumulate quite a substantial negative reward

07:02.070 --> 07:03.150
as he walks around.

07:03.150 --> 07:05.520
So it's just, it's the path is too long.

07:05.520 --> 07:07.800
So that forces him, even whether he's here

07:07.800 --> 07:09.660
or here to take the shorter route

07:09.660 --> 07:13.860
to get here even though he has a much higher risk

07:13.860 --> 07:15.390
of getting into the fire pit, because

07:15.390 --> 07:18.120
as soon as he ends up in the square, there's a 10% chance

07:18.120 --> 07:20.100
of getting to the fire pit.

07:20.100 --> 07:21.810
According to his calculations,

07:21.810 --> 07:25.800
it's just the expected value of this approach is better

07:25.800 --> 07:27.330
than the expected value of going

07:27.330 --> 07:30.720
around simply because we've increased this living penalty.

07:30.720 --> 07:32.220
And finally, we are getting

07:32.220 --> 07:37.110
to the example with the living penalty of minus 2.0.

07:37.110 --> 07:40.350
So here I would encourage you to pause the video

07:40.350 --> 07:42.480
now that you've seen how the policy has changed

07:42.480 --> 07:44.430
as we increased the living penalty.

07:44.430 --> 07:45.750
I encourage you to pause the video

07:45.750 --> 07:49.890
and to think for yourself what will happen in this scenario?

07:49.890 --> 07:53.400
What do you think the optimal policy will be given

07:53.400 --> 07:55.980
that the living penalty is so high?

07:55.980 --> 07:58.470
So I'll let you pause the video if you'd like to.

07:58.470 --> 08:02.400
And now I'm going to jump into showing you the solution.

08:02.400 --> 08:04.960
So in this case, if you increase the P penalty

08:06.461 --> 08:07.830
to minus 2.0 it's so high.

08:07.830 --> 08:10.890
Remember that the penalty here is only minus 1.0.

08:10.890 --> 08:14.100
It's so high that the agent just wants

08:14.100 --> 08:16.410
to get out of the game in any way possible, even

08:16.410 --> 08:18.540
if it's just by jumping into the fire pit.

08:18.540 --> 08:19.373
He will do it.

08:19.373 --> 08:23.400
He will be like, every time I make a step, every time I end

08:23.400 --> 08:25.930
up in a new, in a new state, or every time I make

08:26.821 --> 08:30.030
an action I end up getting a minus two reward.

08:30.030 --> 08:32.850
So what's the point of trying to get to the finish line

08:32.850 --> 08:36.750
if from here will take me two extra steps, I'm just going

08:36.750 --> 08:39.060
to go here and then straight into the fire pit

08:39.060 --> 08:42.171
because that way my reward is going to be less,

08:42.171 --> 08:46.230
the negative reward is not gonna be as bad

08:46.230 --> 08:49.050
as in the case of just making an additional steps.

08:49.050 --> 08:51.450
So you can see that adding this living reward

08:51.450 --> 08:55.980
and depending on the value of the living reward,

08:55.980 --> 08:57.450
that we are adding,

08:57.450 --> 08:59.280
the results are going to be different.

08:59.280 --> 09:02.670
And the agent is going to select different policies.

09:02.670 --> 09:06.930
And that's basically what's, how the reward value, can be,

09:06.930 --> 09:09.600
is incorporated by the bellman equation

09:09.600 --> 09:11.550
even when it's not just at the finish line

09:11.550 --> 09:13.770
or at the end of the game, but even throughout the game.

09:13.770 --> 09:15.360
And again, once again, it doesn't have

09:15.360 --> 09:18.210
to be on every single in every single state

09:18.210 --> 09:21.480
depending on the environment itself, it might be given

09:21.480 --> 09:25.200
to the agent at certain specific states, not at every state

09:25.200 --> 09:28.530
but in our simplistic example, we're just using rewards

09:28.530 --> 09:32.910
at every given state to illustrate this concept.

09:32.910 --> 09:35.250
So I hope you enjoy today's tutorial, and as you can see

09:35.250 --> 09:38.790
we've already made our Bellman equation quite sophisticated

09:38.790 --> 09:42.270
and now it can be applied to many different scenarios

09:42.270 --> 09:44.340
and I can't wait to see you in the next tutorial.

09:44.340 --> 09:46.353
And until then, enjoy AI.