WEBVTT

00:01.020 --> 00:01.890
-: Hello, and welcome back

00:01.890 --> 00:04.020
to the course on artificial intelligence.

00:04.020 --> 00:07.050
Today we are finally talking about Q-learning.

00:07.050 --> 00:09.660
All right, so we've already got this equation,

00:09.660 --> 00:10.530
the Bellman Equation,

00:10.530 --> 00:13.110
which we've added lots of components to.

00:13.110 --> 00:15.510
We've got the reward here,

00:15.510 --> 00:17.340
which can be not just at the very end,

00:17.340 --> 00:19.950
but it can be at any given step.

00:19.950 --> 00:23.490
We've got the discount factor, we've got the probability,

00:23.490 --> 00:26.910
because now we're looking at mark of decision processes,

00:26.910 --> 00:28.800
and here we've got the probability

00:28.800 --> 00:31.050
of ending up in a different state,

00:31.050 --> 00:33.330
regardless of what action we take,

00:33.330 --> 00:35.220
or actually given the action we take,

00:35.220 --> 00:38.430
there can be multiple states that we can end up in,

00:38.430 --> 00:40.740
and then we've got the value of the next states,

00:40.740 --> 00:42.990
so you can see this kind of like a recursive function

00:42.990 --> 00:43.823
and so on.

00:43.823 --> 00:46.800
But you probably still have one question.

00:46.800 --> 00:51.800
The question is, where in all of this is the letter Q?

00:51.870 --> 00:54.330
Why is it all called Q-learning?

00:54.330 --> 00:55.860
So where's the Q?

00:55.860 --> 00:56.730
And that's the question

00:56.730 --> 00:58.920
that we're going to be answering today.

00:58.920 --> 01:01.050
So far we've been dealing with values,

01:01.050 --> 01:04.650
the value of being in a certain state,

01:04.650 --> 01:06.630
and now we're going to look

01:06.630 --> 01:10.050
at how Q fits into all of that as well.

01:10.050 --> 01:12.270
So here we've got two examples.

01:12.270 --> 01:14.580
On the left is what we've been doing so far.

01:14.580 --> 01:16.350
Our agent has been analyzing,

01:16.350 --> 01:18.210
okay, I'm over here.

01:18.210 --> 01:19.830
This is a mark of decision process,

01:19.830 --> 01:21.750
so it doesn't matter how I got here.

01:21.750 --> 01:25.140
The rest of the environment doesn't care of the steps

01:25.140 --> 01:26.400
that it took me to get here.

01:26.400 --> 01:30.150
From now on, I have to make the optimal decision

01:30.150 --> 01:30.983
where to go.

01:30.983 --> 01:34.200
Here, here, or here, based on the current state

01:34.200 --> 01:36.000
and all the future states that come from here,

01:36.000 --> 01:37.470
but not from the past.

01:37.470 --> 01:39.690
And so he can see that there's three options.

01:39.690 --> 01:42.240
There's state one, state two, state three,

01:42.240 --> 01:43.890
and based on his experience,

01:43.890 --> 01:46.898
he has calculated the values in these states.

01:46.898 --> 01:49.890
And now he's going to using the Bellman Equation

01:49.890 --> 01:52.140
so even though this is a stochastic process,

01:52.140 --> 01:53.430
so he knows that he'll go here,

01:53.430 --> 01:56.130
but there's a chance that he'll go left to right and so on.

01:56.130 --> 01:58.950
So based on these values, going to make a decision.

01:58.950 --> 02:00.180
That's what we've been doing so far,

02:00.180 --> 02:03.570
and that is totally the legitimate approach here.

02:03.570 --> 02:05.670
But now we're going to modify it a little bit.

02:05.670 --> 02:08.550
We're going to take the same exact concept,

02:08.550 --> 02:10.440
same exact problem,

02:10.440 --> 02:13.689
but here, instead of looking at the values of each state

02:13.689 --> 02:15.360
that he can end up in,

02:15.360 --> 02:18.128
we're going to look at the values

02:18.128 --> 02:21.450
or the value of each action.

02:21.450 --> 02:23.640
So we're not gonna use the letter V anymore

02:23.640 --> 02:25.440
because V is for the value of the state,

02:25.440 --> 02:27.180
we're gonna use Q.

02:27.180 --> 02:30.000
And you might have a question, why the letter Q?

02:30.000 --> 02:32.970
Well, Q, some people speculate that Q,

02:32.970 --> 02:35.430
well, I read this I think on Quora,

02:35.430 --> 02:38.670
somebody mentioned that Q is because of quality,

02:38.670 --> 02:40.020
but at the same time,

02:40.020 --> 02:41.760
I couldn't find any other references to that,

02:41.760 --> 02:42.870
so it might not be because of that,

02:42.870 --> 02:44.460
might be just because that's the letter

02:44.460 --> 02:45.900
that was used at the time

02:45.900 --> 02:47.550
and now it became super popular

02:47.550 --> 02:50.760
because it's all called Q-learning because of that.

02:50.760 --> 02:52.800
So no exact reason why it's called Q,

02:52.800 --> 02:56.190
but nevertheless, at least it helps us distinguish

02:56.190 --> 02:57.090
between V and Q.

02:57.090 --> 03:01.380
So Q here represents, rather than the value of the state,

03:01.380 --> 03:03.390
it represents, let's go with quality.

03:03.390 --> 03:05.491
It represents the quality of the action.

03:05.491 --> 03:07.831
Okay, so I've got four actions.

03:07.831 --> 03:10.860
What are the different qualities of these actions?

03:10.860 --> 03:13.320
What is the value of the action

03:13.320 --> 03:14.280
or the quality of the action?

03:14.280 --> 03:15.750
Which action is more lucrative?

03:15.750 --> 03:17.790
So I need a metric telling me,

03:17.790 --> 03:19.680
all right, how do I quantify this action?

03:19.680 --> 03:20.880
And then I can compare them.

03:20.880 --> 03:22.809
And that is exactly what Q is.

03:22.809 --> 03:25.913
And so here he's got four possible actions.

03:25.913 --> 03:29.220
As always, go up, right, left, or down.

03:29.220 --> 03:32.610
And based on the action, this is gonna be a formula

03:32.610 --> 03:35.490
which tells us the quantifiable value of that action

03:35.490 --> 03:38.610
which we're calling the Q, the Q-value of that action.

03:38.610 --> 03:39.443
So let's have a look

03:39.443 --> 03:42.630
at how we're going to derive this formula for Q.

03:42.630 --> 03:44.520
How does it actually relate to these?

03:44.520 --> 03:46.033
Because as you can imagine,

03:46.033 --> 03:49.470
because actions lead to states,

03:49.470 --> 03:52.350
there has to be some sort of link between the two, right?

03:52.350 --> 03:54.750
We've already determined how to calculate this

03:54.750 --> 03:56.070
and we're pretty good at it.

03:56.070 --> 03:57.600
We know how to use the Bellman Equation

03:57.600 --> 03:59.310
and very different environments

03:59.310 --> 04:01.893
with lots of different complications.

04:01.893 --> 04:03.930
Well, let's leverage that knowledge

04:03.930 --> 04:07.260
to understand how we can now calculate Q

04:07.260 --> 04:09.150
in order to make the same predictions.

04:09.150 --> 04:11.651
Because as you can imagine, the environment doesn't change

04:11.651 --> 04:14.310
depending on what approach we use.

04:14.310 --> 04:16.560
The environment's gonna be the same regardless.

04:16.560 --> 04:18.780
So therefore this approach,

04:18.780 --> 04:21.300
and this approach should always give the same result,

04:21.300 --> 04:23.070
and therefore that's another reason

04:23.070 --> 04:25.050
why these two should be linked.

04:25.050 --> 04:26.280
So let's have a look.

04:26.280 --> 04:28.230
So here's our V approach

04:28.230 --> 04:29.580
where we're just going to look at the value

04:29.580 --> 04:32.400
of any given state, this state or any other state.

04:32.400 --> 04:33.340
And here we're going to,

04:33.340 --> 04:34.830
we're just using the letter S here

04:34.830 --> 04:36.987
because that's the current state,

04:36.987 --> 04:38.880
and so therefore the terminology

04:38.880 --> 04:40.590
will be the same in both equations.

04:40.590 --> 04:42.067
And here we're using QSA.

04:43.009 --> 04:45.840
Q is of the state S and the action A,

04:45.840 --> 04:47.073
because action is up,

04:48.390 --> 04:49.500
but in which state did we perform that action?

04:49.500 --> 04:52.950
We performed that action in state S.

04:52.950 --> 04:55.170
Okay, so now we're going to write out the Bellman Equation

04:55.170 --> 04:56.460
for the first approach.

04:56.460 --> 04:59.390
As you can see here, we've got V of S.

04:59.390 --> 05:02.160
So the value of any given state S

05:02.160 --> 05:05.640
is the maximum of the reward that you get,

05:05.640 --> 05:07.830
so maximum based on the actions.

05:07.830 --> 05:08.700
So you have three,

05:08.700 --> 05:10.620
in this case you actually have four actions.

05:10.620 --> 05:12.510
So maximum out of all the possible actions,

05:12.510 --> 05:14.040
and then of this part,

05:14.040 --> 05:15.360
which we've already discussed many times

05:15.360 --> 05:17.910
So this is our reward that we get

05:17.910 --> 05:20.940
from performing that action in that state,

05:20.940 --> 05:22.740
plus a discounted factor multiplied

05:22.740 --> 05:26.220
by the expected value of the new state

05:26.220 --> 05:27.120
that we're going to be in.

05:27.120 --> 05:29.430
And expected value because it is a stochastic process.

05:29.430 --> 05:31.380
We don't know exactly for sure

05:31.380 --> 05:32.870
that we are going to end up over here.

05:32.870 --> 05:34.770
We might end up on the left or the right

05:34.770 --> 05:36.000
with certain probability.

05:36.000 --> 05:38.220
That's why these probabilities are in here.

05:38.220 --> 05:40.290
All right, so that's our value,

05:40.290 --> 05:41.520
and now let's look at Q.

05:41.520 --> 05:43.560
So Q is going to be defined,

05:43.560 --> 05:45.300
we're going to use this to define Q.

05:45.300 --> 05:49.200
So let's say the agent from this location, from this state,

05:49.200 --> 05:50.850
performed the action up.

05:50.850 --> 05:54.510
What is the Q-value going to be equal to?

05:54.510 --> 05:57.570
Well, first of all, let's see what he'll get in in return.

05:57.570 --> 05:59.400
For performing this action up,

05:59.400 --> 06:01.992
first thing that you'll get is a reward, right?

06:01.992 --> 06:04.200
No doubt about it.

06:04.200 --> 06:05.610
There's going to be some sort of reward,

06:05.610 --> 06:07.530
might be zero, but we know

06:07.530 --> 06:10.846
that the way the reinforcement learning process works

06:10.846 --> 06:12.630
is that sometimes

06:12.630 --> 06:15.270
for performing certain actions from a given state,

06:15.270 --> 06:16.103
there's a reward.

06:16.103 --> 06:17.460
So we're gonna add that in here.

06:17.460 --> 06:19.830
And then we're going to add, what are we going to add?

06:19.830 --> 06:21.120
Well, let's think about it.

06:21.120 --> 06:23.370
What is the next thing that happens

06:23.370 --> 06:24.870
after he is gotten the reward?

06:24.870 --> 06:25.890
Well, next thing that happens

06:25.890 --> 06:30.000
is that now the agent is in a certain state.

06:30.000 --> 06:33.420
He could end up here with a 80% probability,

06:33.420 --> 06:34.740
or some probability,

06:34.740 --> 06:36.810
but actually he can end up here or here.

06:36.810 --> 06:38.370
But wherever he ends up,

06:38.370 --> 06:41.730
now we already have a quantified metric

06:41.730 --> 06:43.881
for that state he's in,

06:43.881 --> 06:47.190
and that is actually the V-value of that state.

06:47.190 --> 06:49.560
But because he came up in many different states,

06:49.560 --> 06:51.810
in three of the possible different states,

06:51.810 --> 06:54.180
we have to look at the expected value

06:54.180 --> 06:55.690
of the state that he'll be in.

06:55.690 --> 06:57.810
And so we're going to add that in.

06:57.810 --> 07:00.000
We're going to add, of course the discounted factor

07:00.000 --> 07:01.620
as we previously had,

07:01.620 --> 07:03.851
because that is somewhere in the future.

07:03.851 --> 07:07.230
And then we're going to add the sum

07:07.230 --> 07:09.570
of across all possible states,

07:09.570 --> 07:12.000
across all possible states that it could end up in

07:12.000 --> 07:14.220
by taking this action, times a probability.

07:14.220 --> 07:16.380
So what we're saying here is that,

07:16.380 --> 07:17.880
okay, so by performing an action,

07:17.880 --> 07:19.380
you're going to get a reward,

07:19.380 --> 07:21.630
plus, which is a quantified metric,

07:21.630 --> 07:23.790
plus you're gonna get, you end up in a state,

07:23.790 --> 07:24.990
we don't know which one.

07:24.990 --> 07:27.030
It could be here, it could be here, it could be here.

07:27.030 --> 07:30.780
But here is the expected value of the state

07:30.780 --> 07:32.250
that you're going to end up in.

07:32.250 --> 07:34.200
And now we're going to multiply it by discounting factor

07:34.200 --> 07:36.360
because that is one move away.

07:36.360 --> 07:40.859
So that is our Q-value for this performance section.

07:40.859 --> 07:45.859
And what you will notice here right away is that the Q-value

07:46.290 --> 07:48.660
is actually exactly identical

07:48.660 --> 07:51.399
to what's inside these brackets over here.

07:51.399 --> 07:52.710
And why is that?

07:52.710 --> 07:54.270
Well, if you think about it,

07:54.270 --> 07:58.696
here we're taking the maximum of the result we will get,

07:58.696 --> 08:01.020
the maximum across all possible actions,

08:01.020 --> 08:01.890
so we've got four actions,

08:01.890 --> 08:04.290
and we're taking the maximum across all possible actions

08:04.290 --> 08:05.848
of the result that we'll get

08:05.848 --> 08:08.340
by taking each of those actions.

08:08.340 --> 08:10.114
And in Q, we're defining,

08:10.114 --> 08:13.980
interesting, what will we get by taking a certain action?

08:13.980 --> 08:16.290
So if you think about it, it makes sense

08:16.290 --> 08:21.090
that the value of a state, so for instance, this state,

08:21.090 --> 08:26.010
is the maximum of all of the possible Q-values, right?

08:26.010 --> 08:28.260
So here in this state, by being in the state,

08:28.260 --> 08:31.260
the agent has one Q-value, two Q-value,

08:31.260 --> 08:32.880
three Q-value, four Q-value.

08:32.880 --> 08:35.070
So he has four possible Q-values.

08:35.070 --> 08:36.300
Well, the value of this state,

08:36.300 --> 08:38.580
it makes sense that the value of the state

08:38.580 --> 08:42.177
is the maximum of all of those four Q-values,

08:42.177 --> 08:44.430
and that is exactly what we can see here.

08:44.430 --> 08:46.200
That's a good confirmation

08:46.200 --> 08:48.090
of this new formula that we derived.

08:48.090 --> 08:51.120
If that wasn't the case, if that didn't match up,

08:51.120 --> 08:52.260
then we would have questions.

08:52.260 --> 08:56.970
We'd be like, so why doesn't it match up

08:56.970 --> 09:01.970
if Q-value is a quantified metric of performing an action

09:03.000 --> 09:05.745
and V depends on the four,

09:05.745 --> 09:10.745
is the maximum of the possible results of the four actions

09:10.890 --> 09:12.090
that it can perform.

09:12.090 --> 09:12.960
Hopefully that makes sense,

09:12.960 --> 09:17.370
and that confirms the formula that we've just derived.

09:17.370 --> 09:21.060
And now we are going to make it even more interesting.

09:21.060 --> 09:22.950
We're going to get rid of the V entirely.

09:22.950 --> 09:24.450
Because you can see here you got V

09:24.450 --> 09:27.120
is a recursive function of V.

09:27.120 --> 09:28.980
And then you got V, and then V, and then V,

09:28.980 --> 09:29.813
and then V, and so on.

09:29.813 --> 09:33.660
So you can express this V through all of the following V's,

09:33.660 --> 09:36.210
the most optimal V's that will come up.

09:36.210 --> 09:40.710
Here, we're expressing Q as a recursive function of V

09:40.710 --> 09:42.840
or as a function of the next V.

09:42.840 --> 09:44.190
And then we'd have to plug in this V

09:44.190 --> 09:45.210
and then we'd get back to the V.

09:45.210 --> 09:47.130
So what we are going to do

09:47.130 --> 09:49.620
is we're actually going to take this V

09:49.620 --> 09:53.190
and we're going to replace it with a Q, right?

09:53.190 --> 09:55.110
So let's have a look at that.

09:55.110 --> 09:58.020
We're going to take this V of the next state

09:58.020 --> 10:01.560
and we're going to plug this into that formula over here.

10:01.560 --> 10:05.580
And as you can see now, so this part doesn't change,

10:05.580 --> 10:07.200
this probability doesn't change,

10:07.200 --> 10:12.200
but as we just discussed, V of S is the maximum

10:12.866 --> 10:16.950
by all actions of Q of S and A, right, over here.

10:16.950 --> 10:19.200
So that's what we're going to replace in here.

10:19.200 --> 10:21.630
So we're going to say a maximum of,

10:21.630 --> 10:22.950
of course it's the new action,

10:22.950 --> 10:24.000
the action that we're going to take,

10:24.000 --> 10:26.760
because here we've got V of S prime.

10:26.760 --> 10:30.720
So here now we've got the maximum across all A prime,

10:30.720 --> 10:32.670
so the actions that we're going to take from this state,

10:32.670 --> 10:35.490
or from whichever other state we end up in.

10:35.490 --> 10:37.800
But the action that we're going to take from there,

10:37.800 --> 10:39.720
and maximum across all those,

10:39.720 --> 10:43.250
and the maximum is of all of the Q-values

10:43.250 --> 10:46.500
that are available to us

10:46.500 --> 10:51.500
in that new state as prime, A prime, and that's the action.

10:51.711 --> 10:54.570
So there's gonna be another four Q-values there.

10:54.570 --> 10:56.586
So now as you can see, let's go through that again.

10:56.586 --> 11:00.030
So as from what we derived, from what we discussed,

11:00.030 --> 11:01.953
just through logic and intuition,

11:01.953 --> 11:05.340
so that we can see that V and S are actually V of S

11:05.340 --> 11:07.350
and Q of S and A are linked.

11:07.350 --> 11:10.230
V of S is the maximum across all actions of Q of S and A.

11:10.230 --> 11:11.160
You can see it right here,

11:11.160 --> 11:14.280
so this part is identical to this part.

11:14.280 --> 11:16.020
And then we're going to leverage that

11:16.020 --> 11:19.770
and we're going to replace this bit with V of S from here,

11:19.770 --> 11:21.540
but not this exact formula.

11:21.540 --> 11:23.220
We're gonna take this internal part

11:23.220 --> 11:26.070
and we're going to replace it with Q of S and A.

11:26.070 --> 11:27.720
So we're gonna plug that in here.

11:27.720 --> 11:31.890
And this part is gonna be Q of S prime, A prime.

11:31.890 --> 11:36.890
So maximum of Q across all A primes of Q S prime, A prime,

11:37.050 --> 11:39.780
and now we have our formula.

11:39.780 --> 11:43.440
So now we have a recursive formula for the Q-value.

11:43.440 --> 11:47.220
So now the agent can think, what's the value of this action?

11:47.220 --> 11:48.570
What's the quality of this action?

11:48.570 --> 11:50.460
What's the Q-value of this action?

11:50.460 --> 11:52.170
Well, it depends on the reward I get

11:52.170 --> 11:54.120
in the immediate step after that,

11:54.120 --> 11:59.120
plus it depends on the discounted factor times the maximum

11:59.520 --> 12:02.430
of all the possible Q actions in that state.

12:02.430 --> 12:04.080
But I don't know if I'm gonna get there,

12:04.080 --> 12:06.510
so I need to also look at that state and that state,

12:06.510 --> 12:09.330
and that's why we have this expected value over here.

12:09.330 --> 12:11.970
So we have this sum, probability times a maximum,

12:11.970 --> 12:13.440
and that's our expected value.

12:13.440 --> 12:15.630
So very similar formula as you can see,

12:15.630 --> 12:18.480
but this time we're expressing things through the Q-values.

12:18.480 --> 12:23.370
And that's why this whole algorithm is called Q-learning,

12:23.370 --> 12:26.970
because this is what is looked at,

12:26.970 --> 12:28.530
this is what the agents actually use.

12:28.530 --> 12:29.610
They don't look at the states,

12:29.610 --> 12:31.320
so look at their possible actions,

12:31.320 --> 12:32.730
and then based on the actions,

12:32.730 --> 12:34.230
on the Q-values of the actions,

12:34.230 --> 12:35.760
they will decide which action to take.

12:35.760 --> 12:38.070
So they'll just look at the maximum Q-value

12:38.070 --> 12:39.170
in this given state.

12:39.170 --> 12:40.320
It has four actions.

12:40.320 --> 12:42.570
What is the best action to take?

12:42.570 --> 12:43.470
So it can compare.

12:43.470 --> 12:46.031
Instead of comparing the different states it can end up in,

12:46.031 --> 12:48.360
it's going to compare the possible actions

12:48.360 --> 12:49.920
that it currently has.

12:49.920 --> 12:52.230
Then by finding the optimal one,

12:52.230 --> 12:53.550
it's going to take that action

12:53.550 --> 12:56.100
and then it's going to repeat that process,

12:56.100 --> 12:57.540
repeat that process, and so on.

12:57.540 --> 12:59.610
So now you can see how all of this comes together,

12:59.610 --> 13:02.580
how the reward, the discounting factor,

13:02.580 --> 13:05.848
the stochastic mark of decision processes,

13:05.848 --> 13:09.480
and the V-values and the Q-values all come together

13:09.480 --> 13:13.410
in order to give us this one super powerful Bellman Equation

13:13.410 --> 13:16.020
for Q-values, which we can now apply

13:16.020 --> 13:20.370
and let our agents learn how to beat the environment.

13:20.370 --> 13:23.370
And so that is a intuitive explanation of what's going on.

13:23.370 --> 13:26.010
I know we went through the formulas, but it is necessary,

13:26.010 --> 13:28.020
because this is our formula

13:28.020 --> 13:31.440
that we've been going through this whole chapter,

13:31.440 --> 13:35.130
and I think it's a good transition from V to Q,

13:35.130 --> 13:38.790
and it illustrates how they're linked between each other.

13:38.790 --> 13:42.210
And if you'd like to get a bit more

13:42.210 --> 13:45.450
of a rigorous approach, mathematical approach,

13:45.450 --> 13:47.850
and you see the mathematics behind it,

13:47.850 --> 13:51.630
and learn a bit more about Q-values and how they work,

13:51.630 --> 13:54.180
then we've got some additional reading for you.

13:54.180 --> 13:56.910
This paper's called "Mark of Decision Processes,

13:56.910 --> 14:01.910
Concepts, and Algorithms" by Martin Von Otterlo, 2009.

14:02.970 --> 14:05.520
So you've got the link here as always,

14:05.520 --> 14:08.430
and here you can read in a bit more detail

14:08.430 --> 14:10.560
to understand all the nitty gritty

14:10.560 --> 14:12.450
behind Q-values and so on.

14:12.450 --> 14:14.730
And now that we've discussed all of these things

14:14.730 --> 14:16.050
relating to the Bellman Equation,

14:16.050 --> 14:19.680
now we are ready to look at something more complex

14:19.680 --> 14:21.257
such as this paper,

14:21.257 --> 14:25.200
if we want to get some additional information on this

14:25.200 --> 14:27.690
in order to get a deeper understanding.

14:27.690 --> 14:29.940
But even if you don't read through this paper,

14:29.940 --> 14:32.658
already you should have a good working knowledge

14:32.658 --> 14:34.950
of what key learning is all about

14:34.950 --> 14:37.620
and how agents come up with the actions

14:37.620 --> 14:40.860
that they need to take in a certain environment.

14:40.860 --> 14:42.330
So I hope you enjoyed today's tutorial

14:42.330 --> 14:43.950
and I look forward to seeing you next time.

14:43.950 --> 14:45.813
Until then, enjoy AI.