1
00:00:11,680 --> 00:00:17,380
In this lecture we are going to finally form the equation which represents a generic reinforcement learning

2
00:00:17,380 --> 00:00:17,950
problem.

3
00:00:18,250 --> 00:00:21,610
And from this we can derive a solution.

4
00:00:21,670 --> 00:00:24,650
Let's begin by discussing expected values.

5
00:00:25,030 --> 00:00:30,100
By the way if you've never taken a probability course then you've probably never heard of expected values

6
00:00:30,520 --> 00:00:36,610
in which case if you want to understand this you'll want to take a course in probability to think about

7
00:00:36,610 --> 00:00:39,010
it simply the expected value is just the mean.

8
00:00:39,640 --> 00:00:45,040
So for example Suppose I've measured the heights of 1000 students and I want to model this as a Gaussian

9
00:00:45,040 --> 00:00:46,400
distribution.

10
00:00:46,450 --> 00:00:51,980
I find that the average height is 70 inches with a standard deviation of 4 inches.

11
00:00:52,000 --> 00:01:02,200
In this case the mean of the distribution is 70 and thus the expected value is 70.

12
00:01:02,400 --> 00:01:06,760
Let's look at a coin as the next example a coin has a binary outcome.

13
00:01:07,240 --> 00:01:10,920
If I flip a coin I may get one or I may get zero.

14
00:01:10,930 --> 00:01:15,220
Suppose the probability of both 1 and 0 is 50 percent.

15
00:01:15,250 --> 00:01:17,140
In this case what's the expected value.

16
00:01:18,070 --> 00:01:19,290
Well since it's the mean.

17
00:01:19,330 --> 00:01:22,720
That means the expected value is zero point five.

18
00:01:22,720 --> 00:01:25,290
Sometimes this terminology confuses people.

19
00:01:25,390 --> 00:01:29,380
They ask how can the value I expect to get be zero point five.

20
00:01:29,380 --> 00:01:32,450
If the result can only be zero or 1.

21
00:01:32,470 --> 00:01:39,520
It's important to realize the expected value despite its name is not the value you expect to get.

22
00:01:39,550 --> 00:01:43,510
So if I flip a coin I don't actually expect to see the value zero point five

23
00:01:48,770 --> 00:01:55,160
the expected value actually has a precise meaning for both continuous and discrete distributions although

24
00:01:55,160 --> 00:01:59,780
for most of this section we will assume that we are working with discrete distributions.

25
00:01:59,780 --> 00:02:06,230
The idea is that the expected value is a weighted sum of all the possible values of a random variable

26
00:02:06,620 --> 00:02:09,230
where the weights are the probabilities of those values

27
00:02:14,410 --> 00:02:20,500
as a simple example we can return to our coin example if we have a biased coin so that the probability

28
00:02:20,500 --> 00:02:22,310
of heads is zero point six.

29
00:02:22,420 --> 00:02:29,110
Then the expected value becomes zero point six times one plus zero point four times zero which is zero

30
00:02:29,110 --> 00:02:35,800
point six because the probability of one is now a little higher than the probability of zero the expected

31
00:02:35,800 --> 00:02:43,090
value also becomes a little higher.

32
00:02:43,140 --> 00:02:43,460
All right.

33
00:02:43,470 --> 00:02:46,470
So what's the significance of expected values.

34
00:02:46,470 --> 00:02:52,320
Well let's remember that our reward and hence also our return which is the sum of rewards are random

35
00:02:52,320 --> 00:02:54,660
variables at a high level.

36
00:02:54,660 --> 00:03:00,840
What this means is that if I play a game one hundred times using the same policy in the same environment

37
00:03:01,110 --> 00:03:03,320
I will get different rewards.

38
00:03:03,330 --> 00:03:10,260
This is because both the environment dynamics and my policy are probabilistic and thus it makes sense

39
00:03:10,530 --> 00:03:15,970
to not think about a single return by itself but the expected value of the return.

40
00:03:16,500 --> 00:03:21,990
I would like to maximize the sum of future rewards but because the result is probabilistic.

41
00:03:22,170 --> 00:03:26,970
What I really want to do is maximize the expected sum of future rewards

42
00:03:32,060 --> 00:03:36,380
the expected sum of future rewards has a special name and reinforcement learning.

43
00:03:36,560 --> 00:03:38,960
We call it the value function.

44
00:03:38,960 --> 00:03:44,980
I always mentioned that this is a very unfortunate name because the word value is such a generic word.

45
00:03:45,200 --> 00:03:51,770
Nonetheless this is what it is called because the return is the sum of future rewards after we have

46
00:03:51,770 --> 00:03:52,940
arrived in some state.

47
00:03:53,390 --> 00:04:00,680
So too is the value function as you can see it's not just a plain expected value but rather it's conditioned

48
00:04:00,710 --> 00:04:04,170
on the fact that we arrived in the state as at times.

49
00:04:05,000 --> 00:04:12,760
Thus we denote this as V of s the value of the state s as a side note we sometimes referred to the return

50
00:04:12,820 --> 00:04:14,240
as simply the reward.

51
00:04:14,410 --> 00:04:19,840
Even though the return should be a sum of rewards so you always have to pay attention to the context

52
00:04:25,110 --> 00:04:30,420
as you recall I said that one of the most important things about the return is that it can be defined

53
00:04:30,420 --> 00:04:38,790
recursively the return at time t g of T is equal to the reward at time T plus 1 plus gamma times the

54
00:04:38,790 --> 00:04:41,050
return at time T plus 1.

55
00:04:41,430 --> 00:04:48,390
And because of this we can also define the value function recursively V of S is the expected value of

56
00:04:48,390 --> 00:04:59,830
the reward at time T plus gamma times V of as prime the value at the next date as prime.

57
00:04:59,920 --> 00:05:02,800
Now you must be wondering what is the purpose of all this.

58
00:05:02,800 --> 00:05:05,840
Are we just defining equations for the sake of it.

59
00:05:05,860 --> 00:05:07,850
Of course the answer is No.

60
00:05:07,930 --> 00:05:13,750
Just keep in mind the high level picture we are setting up a problem from which we can find a solution

61
00:05:14,590 --> 00:05:15,980
in order to find a solution.

62
00:05:15,980 --> 00:05:21,790
We must have a problem and thus our job right now is to continue building up the problem so that it

63
00:05:21,790 --> 00:05:23,800
is well-defined.

64
00:05:23,920 --> 00:05:29,200
The next step is to remember that the expected value is really just the weighted sum where the weights

65
00:05:29,230 --> 00:05:31,520
are the probabilities.

66
00:05:31,630 --> 00:05:33,160
So if we write it out in full.

67
00:05:33,220 --> 00:05:34,870
This is what we get.

68
00:05:34,900 --> 00:05:41,140
This is the sum over all possible actions a and the sum over all possible next states as prime and the

69
00:05:41,140 --> 00:05:45,250
sum over all possible rewards are inside the sum.

70
00:05:45,250 --> 00:05:52,480
We have the probability of performing action a given that we are in state s multiplied by the probability

71
00:05:52,480 --> 00:05:56,150
of landing in the next state as prime and receiving the reward R.

72
00:05:56,800 --> 00:06:03,820
And then this is multiplied by the reward R Plus gamma times V of US prime.

73
00:06:03,880 --> 00:06:09,460
In other words it's our plus gamma times V as prime multiplied by the probability of actually getting

74
00:06:09,460 --> 00:06:17,560
the reward R and arriving in the state as prime In other words the expected value of the return.

75
00:06:17,750 --> 00:06:20,020
This is called the Belmont equation.

76
00:06:20,120 --> 00:06:25,610
It is the centerpiece of all the subsequent solutions to reinforcement learning that we will discuss

77
00:06:30,690 --> 00:06:33,650
it's worth thinking about the Belmont equation a little more.

78
00:06:33,960 --> 00:06:38,940
Notice that in order to come up with this equation we only use the rules of mathematics.

79
00:06:38,940 --> 00:06:43,920
The rules of probability it seems like just a regular all expected value.

80
00:06:43,920 --> 00:06:46,680
However the implications are deep.

81
00:06:46,680 --> 00:06:52,110
It's interesting to consider that the two probabilities here Pi of a given S and P of s prime and are

82
00:06:52,130 --> 00:06:57,240
given us and a come from totally different of physical processes.

83
00:06:57,240 --> 00:07:00,050
Pi is our policy and it represents our Asian.

84
00:07:00,090 --> 00:07:02,790
So you can think of that as like an animal.

85
00:07:02,790 --> 00:07:08,370
The state transition probability represents our environment so you can think of that as the world.

86
00:07:08,370 --> 00:07:14,100
So it's interesting that while we can freely manipulate this equation mathematically these two objects

87
00:07:14,160 --> 00:07:16,590
are actually completely distinct physically

88
00:07:21,720 --> 00:07:22,370
at this point.

89
00:07:22,380 --> 00:07:24,140
You must be extremely tired.

90
00:07:24,360 --> 00:07:30,630
All we are doing is making our formulas progressively more complicated with no end in sight but in fact

91
00:07:30,720 --> 00:07:32,520
this is exactly where we want to be.

92
00:07:33,240 --> 00:07:38,640
Although this may seem complicated I will demonstrate to you that in fact what we have arrived at is

93
00:07:38,640 --> 00:07:39,360
pretty simple

94
00:07:44,470 --> 00:07:50,390
we can now define the first problem in reinforcement learning from which we can find a solution.

95
00:07:50,410 --> 00:07:55,120
Remember that in a reinforcement learning problem you can have multiple policies.

96
00:07:55,120 --> 00:07:57,910
Some may be good but some may be bad.

97
00:07:57,910 --> 00:08:01,960
How can we tell what is a good policy and what is a bad policy.

98
00:08:01,960 --> 00:08:04,190
Well how about the value function.

99
00:08:04,330 --> 00:08:10,450
In other words if I can find the value function for a given policy that tells me how good that policy

100
00:08:10,450 --> 00:08:11,570
is.

101
00:08:11,830 --> 00:08:19,770
We call these subscript Pi of s the value function given the policy PI we call the problem of finding

102
00:08:19,770 --> 00:08:21,950
the value function given a policy.

103
00:08:22,080 --> 00:08:23,220
The prediction problem

104
00:08:28,380 --> 00:08:29,090
as promised.

105
00:08:29,130 --> 00:08:31,860
I said that we've arrived at something pretty simple.

106
00:08:32,070 --> 00:08:34,160
Let's look at the Belmont equation again.

107
00:08:34,590 --> 00:08:37,200
These probabilities are a little deceptive.

108
00:08:37,350 --> 00:08:40,850
They have complicated symbols but they are not complicated.

109
00:08:40,890 --> 00:08:45,540
Remember that the policy pi is just a function which we implement in our own code.

110
00:08:45,540 --> 00:08:47,090
We know this probability.

111
00:08:47,250 --> 00:08:49,740
Therefore it's not a problem.

112
00:08:49,800 --> 00:08:52,260
How about the state transition probability.

113
00:08:52,470 --> 00:08:54,340
Let's assume this is known as well.

114
00:08:54,570 --> 00:08:56,100
This is entirely possible.

115
00:08:56,100 --> 00:09:00,430
Take for example the grid world environment.

116
00:09:00,690 --> 00:09:03,120
Now what happens if we know these two probabilities

117
00:09:08,240 --> 00:09:09,280
surprisingly.

118
00:09:09,440 --> 00:09:14,510
This just becomes a system of linear equations which you know how to solve from high school algebra

119
00:09:15,560 --> 00:09:17,600
suppose we have three states.

120
00:09:17,750 --> 00:09:23,740
Then if we simplify the summation from the Belmont equation we would arrive at something like this.

121
00:09:23,750 --> 00:09:28,930
So here at the B's in the seas are just constants which are derived from the probabilities which we've

122
00:09:28,940 --> 00:09:35,850
assumed that we know from this we can call a simple function like NPR announced that solve and this

123
00:09:35,850 --> 00:09:39,980
would tell us v of s one v of us two and a view of us.

124
00:09:40,050 --> 00:09:46,740
In other words if we know both our policy and the environment dynamics we can solve for the value function

125
00:09:46,830 --> 00:09:48,500
using only linear algebra.
