1
00:00:11,350 --> 00:00:16,660
In this lecture, we are going to finally dive into some real reinforcement learning algorithms.

2
00:00:17,400 --> 00:00:21,720
Previously, we looked at some simple and naive solutions to reinforcement learning.

3
00:00:21,750 --> 00:00:23,430
Let's recap what they were.

4
00:00:23,970 --> 00:00:29,640
First, we recall that there are two types of problems the prediction problem and the control problem.

5
00:00:29,880 --> 00:00:36,120
The prediction problem means, given a policy find V of S, the control problem means find the best

6
00:00:36,120 --> 00:00:39,150
policy which yields the maximum V of s.

7
00:00:39,570 --> 00:00:44,910
To solve the prediction problem, we noted that if we have the policy distribution and the state transition

8
00:00:44,910 --> 00:00:48,570
probabilities, it becomes a simple linear algebra problem.

9
00:00:48,720 --> 00:00:55,200
There are many algorithms and functions we can use to solve a system of linear equations for the control

10
00:00:55,200 --> 00:00:55,620
problem.

11
00:00:55,620 --> 00:01:01,170
We consider the naive method of simply looping through all the possible policies and finding which one

12
00:01:01,170 --> 00:01:02,610
is the best policy.

13
00:01:07,290 --> 00:01:11,400
Let's consider now why both of these approaches are somewhat unrealistic.

14
00:01:12,690 --> 00:01:15,390
First, let's consider the prediction problem.

15
00:01:15,600 --> 00:01:21,720
Previously, the solution required us to know both the policy distribution and the environment dynamics.

16
00:01:21,720 --> 00:01:27,930
But the reality is, if you imagine, say, any Atari game, we won't know the environment dynamics.

17
00:01:27,960 --> 00:01:32,880
All we can really do, practically speaking, is play the game a thousands of times.

18
00:01:33,030 --> 00:01:38,700
But usually the state space is so large that it would be infeasible to measure the state transition

19
00:01:38,700 --> 00:01:39,870
probabilities.

20
00:01:39,870 --> 00:01:44,970
And so unless you're playing with a toy environment, you would not be told these probabilities either.

21
00:01:49,620 --> 00:01:51,720
Now let's consider the control problem.

22
00:01:51,900 --> 00:01:55,680
Is it really possible to enumerate all possible policies?

23
00:01:56,070 --> 00:02:00,990
Consider if we have a big -- possible states and big a possible actions.

24
00:02:00,990 --> 00:02:06,000
In this case, the total number of possible policies is big s to the power big A.

25
00:02:06,180 --> 00:02:12,680
In other words, this grows exponentially and hence it is not feasible to enumerate all possible policies

26
00:02:12,690 --> 00:02:14,580
for most practical problems.

27
00:02:14,910 --> 00:02:20,370
Furthermore, this would not even work in the case where the state space or action space is infinitely

28
00:02:20,370 --> 00:02:21,150
large.

29
00:02:26,000 --> 00:02:26,390
All right.

30
00:02:26,390 --> 00:02:29,690
So now that we've realized there is a problem, what is the solution?

31
00:02:30,260 --> 00:02:35,240
The trick to this is to remember the relationship between the expected value and the mean.

32
00:02:35,660 --> 00:02:41,090
The problem with the expected value is that it requires us to know the probability distribution of the

33
00:02:41,090 --> 00:02:42,800
random variable in question.

34
00:02:43,370 --> 00:02:49,730
But importantly, there is a way for us to estimate the mean or equivalently the expected value.

35
00:02:49,940 --> 00:02:51,980
This is called the sample mean.

36
00:02:52,250 --> 00:02:55,610
This is the basis for many scientific experiments.

37
00:02:55,730 --> 00:03:02,270
For example, if we want to test a drug, we don't know the true values, but we can do an experiment

38
00:03:02,270 --> 00:03:05,930
on a set number of people and calculate the average values.

39
00:03:06,290 --> 00:03:12,530
The sample mean is simply the sum of all the samples we collect divided by the number of samples.

40
00:03:12,920 --> 00:03:18,800
The idea is that as N approaches infinity, this estimate will become more and more accurate.

41
00:03:23,650 --> 00:03:26,620
Okay, so what does this have to do with reinforcement learning?

42
00:03:26,770 --> 00:03:30,670
Well, remember that the value function is simply the expected return.

43
00:03:31,000 --> 00:03:36,490
Therefore, using the sampling approach, what we can do is sample a set of returns for each state in

44
00:03:36,490 --> 00:03:38,980
the state space and take the average.

45
00:03:39,220 --> 00:03:42,430
That will give us an estimate of the value of each state.

46
00:03:43,330 --> 00:03:48,040
You'll notice that I've abused notation a little bit here by using different indices for the return

47
00:03:48,040 --> 00:03:48,700
G.

48
00:03:49,210 --> 00:03:52,390
When I say G of T, I mean a generic return g.

49
00:03:52,420 --> 00:03:59,950
The return for the state at time t, but when I index g using I and s what I mean is this is a sample

50
00:03:59,950 --> 00:04:01,020
of the return.

51
00:04:01,030 --> 00:04:04,510
It's the ith sample return from the state's.

52
00:04:09,220 --> 00:04:12,130
Now one obvious but maybe not so obvious point.

53
00:04:12,160 --> 00:04:14,140
Where do these samples come from?

54
00:04:14,680 --> 00:04:19,690
You may recall that when we're in numpy and we want to sample from, say, the standard normal, we

55
00:04:19,690 --> 00:04:22,750
can simply call the function np.random.rand n.

56
00:04:23,260 --> 00:04:25,960
But what does it mean to sample a return?

57
00:04:26,680 --> 00:04:32,350
Well, remember that when we play an episode, even if we use the exact same policy and the exact same

58
00:04:32,350 --> 00:04:34,720
environment, the result will be different.

59
00:04:35,110 --> 00:04:39,740
This is because both the policy and the environment dynamics are probabilistic.

60
00:04:39,760 --> 00:04:43,900
So just playing the game itself results in collecting samples.

61
00:04:48,650 --> 00:04:53,240
Using this concept, let's now attempt to write some pseudocode to describe this algorithm.

62
00:04:53,750 --> 00:04:58,850
By the way, this approach is called the Monte Carlo approach, since this is a form of Monte Carlo

63
00:04:58,850 --> 00:04:59,630
sampling.

64
00:05:00,350 --> 00:05:05,600
First, we're going to describe the prediction problem or in other words, given a policy, find the

65
00:05:05,600 --> 00:05:06,560
value function.

66
00:05:07,160 --> 00:05:09,800
Let's just think about this at a high level first.

67
00:05:09,920 --> 00:05:12,470
Suppose we play one episode of a game.

68
00:05:12,590 --> 00:05:18,440
This consists of entering a series of states and corresponding rewards so we can call them s one up

69
00:05:18,440 --> 00:05:20,780
to s, t and r one up to r t.

70
00:05:21,170 --> 00:05:24,770
From this, how do we calculate the return of each state?

71
00:05:29,410 --> 00:05:31,860
Well, it's helpful to actually go backwards.

72
00:05:31,870 --> 00:05:34,300
For example, G of Big T is zero.

73
00:05:34,600 --> 00:05:38,980
Importantly, note that the return is only the sum of future rewards.

74
00:05:39,190 --> 00:05:44,200
Since a terminal state is the end of an episode, that means there are no future states.

75
00:05:44,230 --> 00:05:47,350
No future states means no future rewards.

76
00:05:47,650 --> 00:05:53,260
Therefore, the return of a terminal state is always zero, and hence the value of a terminal state

77
00:05:53,260 --> 00:05:55,210
is also always zero.

78
00:05:56,820 --> 00:05:58,740
In any case, let's continue.

79
00:05:58,920 --> 00:06:02,280
How do we calculate the return of the previous time step?

80
00:06:02,640 --> 00:06:07,170
Well, by definition, G of big T minus one is equal to R of Big T.

81
00:06:07,860 --> 00:06:08,240
Okay.

82
00:06:08,250 --> 00:06:09,000
The next one.

83
00:06:09,000 --> 00:06:11,160
How about G of big T minus two?

84
00:06:11,280 --> 00:06:13,500
It's G of big T minus two.

85
00:06:13,500 --> 00:06:18,240
Equal to R of big T, minus one plus gamma times R of t.

86
00:06:18,660 --> 00:06:22,110
But importantly, remember that the return is recursive.

87
00:06:22,140 --> 00:06:28,290
So this is also equal to R of big T minus one plus gamma times g of big T minus one.

88
00:06:28,890 --> 00:06:30,660
We can repeat this pattern.

89
00:06:30,750 --> 00:06:37,410
So G of big T minus three is equal to R of big T minus two plus gamma times G of big T minus two.

90
00:06:37,530 --> 00:06:45,000
And of course, this follows for any value of T, so we can say g of t is equal to r of t plus one plus

91
00:06:45,000 --> 00:06:46,950
gamma times g of T plus one.

92
00:06:47,220 --> 00:06:51,840
So you can keep repeating the same pattern until you calculate the return for each state.

93
00:06:56,500 --> 00:06:59,380
Practically speaking, here's how you would do it in code.

94
00:06:59,560 --> 00:07:04,870
First, you would play an episode using your policy and you get back a series of states and rewards.

95
00:07:05,110 --> 00:07:11,200
Next, we initialize a list of returns to an empty list, and we initialize the return to zero.

96
00:07:11,650 --> 00:07:17,720
Then, and this is the important part you loop through the rewards in reverse inside the loop.

97
00:07:17,740 --> 00:07:20,020
We just use the recursive definition of g.

98
00:07:20,500 --> 00:07:23,590
G equals to R plus gamma times G.

99
00:07:23,920 --> 00:07:26,740
Then we append G to our list of returns.

100
00:07:31,610 --> 00:07:32,000
All right.

101
00:07:32,000 --> 00:07:34,970
So how do we put the above algorithm into pseudocode?

102
00:07:35,120 --> 00:07:36,460
Well, that's simple.

103
00:07:36,470 --> 00:07:39,470
Everything we just did, now we just do it in a loop.

104
00:07:39,560 --> 00:07:43,450
Do the thing we just did a several hundred or several thousand times.

105
00:07:43,460 --> 00:07:46,700
Then for each state, take the average return.

106
00:07:46,970 --> 00:07:49,580
First, we assume we are given some policy.

107
00:07:49,760 --> 00:07:54,440
Then we instantiate an empty dictionary to store our sample returns.

108
00:07:54,560 --> 00:07:59,990
The key of this dictionary will be the state and the value will be a list of returns encountered throughout

109
00:07:59,990 --> 00:08:01,070
each episode.

110
00:08:01,400 --> 00:08:06,020
Then we enter a loop which goes on for a predetermined number of episodes.

111
00:08:06,320 --> 00:08:10,340
Inside the Loop we first play one episode using the given policy.

112
00:08:10,490 --> 00:08:14,210
This returns us a list of states and corresponding rewards.

113
00:08:14,930 --> 00:08:17,880
From that, we can calculate the return for each state.

114
00:08:17,900 --> 00:08:19,790
As previously described.

115
00:08:20,450 --> 00:08:24,110
Next, we loop through each state and return in corresponding order.

116
00:08:24,350 --> 00:08:28,850
For each return we encounter, we append that to the list of returns for that state.

117
00:08:28,880 --> 00:08:31,710
This loop is how we collect our samples.

118
00:08:36,400 --> 00:08:40,570
Finally, when the first loop is complete, we are done collecting our samples.

119
00:08:40,600 --> 00:08:45,610
Now we can calculate the average return, which is by definition the value function.

120
00:08:45,790 --> 00:08:49,450
So we loop through each of the items in our sample returns dictionary.

121
00:08:49,660 --> 00:08:54,970
For each iteration we get the state's and a list of sample returns g list.

122
00:08:55,240 --> 00:08:59,770
Inside the loop V of S is simply assign the sample mean of g list.

123
00:09:04,570 --> 00:09:10,390
It's important to realize that there are some complications with the algorithm as described first,

124
00:09:10,390 --> 00:09:16,240
because we're only sampling, how do we ensure that we actually encounter every possible state in the

125
00:09:16,240 --> 00:09:17,950
number of episodes we played?

126
00:09:18,100 --> 00:09:24,190
In fact, we cannot, although we may surmise that because we didn't encounter a particular state,

127
00:09:24,220 --> 00:09:28,120
we don't need to know its value because our policy does not allow us to go there.

128
00:09:28,480 --> 00:09:33,790
You could simply ignore the value of those states in any subsequent function you plug the value function

129
00:09:33,790 --> 00:09:38,260
into, but there is a better solution to understand it.

130
00:09:38,290 --> 00:09:43,600
We're going to move on to the second problem the problem of control or finding the best policy.
