1
00:00:11,630 --> 00:00:15,890
At last we are ready to study the famous Q learning algorithm.

2
00:00:15,890 --> 00:00:22,830
Let's recap what we've done so far because it's been quite a long process to even get to Q learning.

3
00:00:22,850 --> 00:00:29,540
First we defined all the relevant terms such as agent environment state action reward and so forth.

4
00:00:29,540 --> 00:00:33,970
Having these definitions allowed us to structure our problem mathematically.

5
00:00:34,280 --> 00:00:42,950
Specifically we can model reinforcement learning problems as MDD Markoff decision processes.

6
00:00:43,090 --> 00:00:47,700
Next we considered how we would solve an MVP meaning two things.

7
00:00:47,740 --> 00:00:53,180
First we solve the problem of prediction finding the value function given a policy.

8
00:00:53,200 --> 00:00:58,210
Second we solve the problem of control finding the best policy in a given environment.

9
00:00:59,260 --> 00:01:03,880
We learn that if we know all the probabilities in an MVP this is quite easy.

10
00:01:04,720 --> 00:01:14,270
But when we don't know the probabilities we can use sampling methods which we call Monte Carlo.

11
00:01:14,280 --> 00:01:20,700
There is one major problem with Monte Carlo is that in order to calculate returns we must wait until

12
00:01:20,700 --> 00:01:22,580
an episode is over.

13
00:01:22,590 --> 00:01:27,530
This is because the return is defined as the sum of all future rewards.

14
00:01:27,540 --> 00:01:33,690
We can't know the sum of future rewards until we have collected them and therefore we must reach a terminal

15
00:01:33,690 --> 00:01:37,780
state before calculating the sample returns.

16
00:01:37,970 --> 00:01:39,860
Why is this a problem.

17
00:01:39,860 --> 00:01:45,110
Well imagine the scenario where you have very long episodes or the scenario where there is no terminal

18
00:01:45,110 --> 00:01:45,560
state.

19
00:01:47,340 --> 00:01:51,480
In the latter case Monte Carlo is not an option in the former.

20
00:01:51,480 --> 00:01:56,960
It's still not ideal because that means your agent has to spend a very long time performing suboptimal

21
00:01:56,970 --> 00:02:01,710
Lee even though it has already collected a lot of data from which it may improve

22
00:02:06,820 --> 00:02:10,540
the answer to this is temporal difference methods.

23
00:02:10,540 --> 00:02:15,760
If you recall I said earlier that one of the most important features of the return is that it can be

24
00:02:15,760 --> 00:02:23,810
defined recursively the return at time t can be expressed in terms of the return at time T plus 1.

25
00:02:23,890 --> 00:02:28,360
You saw how this helped us both create and solve the Belmont equation.

26
00:02:28,360 --> 00:02:30,750
Now it's going to help us once again.

27
00:02:31,180 --> 00:02:38,720
One simple way to think about it is this Monte Carlo methods where an approximation to an expected value

28
00:02:38,720 --> 00:02:44,360
problem Temporal Difference methods are simply an approximation to Monte Carlo.

29
00:02:44,420 --> 00:02:47,690
So in other words they are an approximation of an approximation

30
00:02:52,810 --> 00:02:59,230
recall this one neat trick with Monte Carlo methods when we're updating Q or V which is the sample mean

31
00:02:59,230 --> 00:03:01,100
of all the samples we've collected.

32
00:03:01,180 --> 00:03:06,690
We don't have to express the update as a sample mean that would mean we have to keep all the samples

33
00:03:06,690 --> 00:03:11,530
around which can take up lots of memory and take lots of time to calculate.

34
00:03:11,550 --> 00:03:17,730
Instead we can express the current sample mean in terms of the past the sample mean we noted that this

35
00:03:17,730 --> 00:03:19,350
looks a lot like gradient descent

36
00:03:24,450 --> 00:03:26,760
Well why not test out this theory.

37
00:03:26,760 --> 00:03:32,550
Let's get our squared error J to be the squared error between my true target G and my value for the

38
00:03:32,550 --> 00:03:39,260
state s v of s.

39
00:03:39,390 --> 00:03:45,930
Now let's say I want to update V of s using the latest sample G which I've just collected in order to

40
00:03:45,930 --> 00:03:46,890
perform the update.

41
00:03:46,890 --> 00:03:53,010
I'm going to use gradient descent set V of us to be the old value of the of S minus the learning rate

42
00:03:53,190 --> 00:03:55,090
times the gradient.

43
00:03:55,160 --> 00:04:00,390
Well what is the gradient if we plug this gradient into our gradient descent update.

44
00:04:00,780 --> 00:04:06,360
Adding nor the two since it can be absorbed into the learning rate we get back the exact opposite equation

45
00:04:06,360 --> 00:04:09,480
we would use for the exponentially decaying average

46
00:04:14,740 --> 00:04:15,610
as a side note.

47
00:04:15,730 --> 00:04:20,470
I want to mention that there is ultimately no difference whether we call what we are doing gradient

48
00:04:20,560 --> 00:04:24,250
descent or gradient descent because there's a plus sign here.

49
00:04:24,250 --> 00:04:26,730
You might think of it as gradient descent.

50
00:04:26,890 --> 00:04:31,650
The plus sign is natural to use if you derive this expression from the sample mean update.

51
00:04:31,990 --> 00:04:36,730
But if you derive this expression from the squared error you will get a negative sign and you might

52
00:04:36,730 --> 00:04:39,610
think of that as gradient descent.

53
00:04:39,610 --> 00:04:45,100
Ultimately however this is just basic algebra and you should confirm to yourself that both expressions

54
00:04:45,100 --> 00:04:45,870
are equivalent

55
00:04:51,080 --> 00:04:53,670
so what's so significant about this.

56
00:04:53,690 --> 00:05:00,170
Well now we're going to put together these two ideas idea number one is that updating the value function

57
00:05:00,230 --> 00:05:06,770
using the exponentially the king average is just like gradient descent idea number two is that the return

58
00:05:06,770 --> 00:05:13,230
it can be defined recursively.

59
00:05:13,350 --> 00:05:14,800
So how about this.

60
00:05:14,970 --> 00:05:20,580
We're going to continue using gradient descent but instead of using the full return we are going to

61
00:05:20,580 --> 00:05:24,340
estimate the return we collect the next reward.

62
00:05:24,370 --> 00:05:30,820
Ah but instead of waiting to collect any future awards we simply guess that they will be close to the

63
00:05:30,820 --> 00:05:31,840
vest prime.

64
00:05:31,930 --> 00:05:33,920
The value of the state where we landed.

65
00:05:34,570 --> 00:05:41,200
So instead of using G equal to R Plus gamma times the X reward plus gamma squared times an X reward.

66
00:05:41,200 --> 00:05:42,200
And so on.

67
00:05:42,310 --> 00:05:51,400
We recognize that G is just equal to our plus gamma times the next return g prime but g prime has the

68
00:05:51,400 --> 00:05:59,990
expected value V of s prime so we instead just say G is approximately equal to our plus gamma times

69
00:05:59,990 --> 00:06:01,620
V of as prime.

70
00:06:01,730 --> 00:06:06,360
In this way we only have to wait a 1 step before updating our model.

71
00:06:06,440 --> 00:06:12,910
We no longer have to wait until the end of the episode we call our plus gamma times V of us prime.

72
00:06:12,980 --> 00:06:15,620
Bootstrapped estimate of the return.

73
00:06:16,040 --> 00:06:22,310
It allows us to update V of s immediately after obtaining the next reward R instead of having to wait

74
00:06:22,340 --> 00:06:25,440
until we've collected all the future rewards.

75
00:06:25,460 --> 00:06:28,120
This method is called the temporal difference method

76
00:06:33,270 --> 00:06:38,510
Here's some pseudocode so you can conceptualize how this will work as a side note.

77
00:06:38,520 --> 00:06:43,380
This should also give you some insight into what goes on inside our play episode function which we haven't

78
00:06:43,380 --> 00:06:44,540
really discussed yet.

79
00:06:45,240 --> 00:06:46,320
And this pseudocode.

80
00:06:46,320 --> 00:06:52,230
We don't have any need to abstract away the play episode function because playing the episode is part

81
00:06:52,230 --> 00:06:54,100
of this loop.

82
00:06:54,250 --> 00:07:00,280
We have things to do on each step of the episode and therefore we can encapsulate it or delegated to

83
00:07:00,280 --> 00:07:04,360
some other function to start.

84
00:07:04,450 --> 00:07:10,180
Assume we are given some environment and some policy we initialize the value function it to be random

85
00:07:11,840 --> 00:07:16,160
then we enter a loop which plays a predetermined number of episodes.

86
00:07:16,160 --> 00:07:21,020
Alternatively you could also run the loop until you find that view of US converges or in other words

87
00:07:21,080 --> 00:07:23,720
settles on some value and doesn't deviate from it.

88
00:07:25,780 --> 00:07:28,990
Inside the loop we begin to playing in episode.

89
00:07:28,990 --> 00:07:35,020
The first thing we do is call it EMV dot reset which resets the environment and puts us back into the

90
00:07:35,020 --> 00:07:40,230
initial state and returns that initial state we'll call it s.

91
00:07:40,300 --> 00:07:47,290
Next we initialize a done boolean flag to false this boolean flag will get set to true when we've completed

92
00:07:47,290 --> 00:07:50,080
an episode.

93
00:07:50,100 --> 00:07:55,380
Next we enter a while loop that completes when done becomes true inside the loop.

94
00:07:55,380 --> 00:08:02,540
We grab the action from our policy we then perform the action in the environment by calling Ian Vidot

95
00:08:02,540 --> 00:08:03,870
step.

96
00:08:03,870 --> 00:08:08,990
This returns three things the next state as prime the reward R and the next.

97
00:08:08,990 --> 00:08:10,550
Done flag.

98
00:08:10,560 --> 00:08:16,290
Note that I've designed this pseudocode to be similar to the opening game API which has become somewhat

99
00:08:16,290 --> 00:08:18,660
standard over the past few years.

100
00:08:18,720 --> 00:08:22,290
It's easy to understand and it will help you in the future.

101
00:08:22,470 --> 00:08:27,260
If you ever do start using open a gym the reverse is also true.

102
00:08:27,330 --> 00:08:32,010
If you've had experience with open a gym in the past they should make things easier to understand.

103
00:08:34,590 --> 00:08:36,470
Next we do the big update.

104
00:08:36,690 --> 00:08:41,130
The temporal difference update we've been discussing throughout this lecture.

105
00:08:41,130 --> 00:08:43,700
Finally and this is important not to forget.

106
00:08:43,920 --> 00:08:47,920
We must update the variable s for the next iteration of the loop.

107
00:08:48,210 --> 00:08:54,640
What is currently the next date as prime will become the current state s in the next iteration.

108
00:08:54,640 --> 00:08:58,240
Lastly when the loop is complete V of S has converged

109
00:09:03,380 --> 00:09:06,750
there is one odd thing about temporal difference learning.

110
00:09:06,770 --> 00:09:12,800
Look carefully at this so-called gradient descent update the target is our plus gamma times view of

111
00:09:12,800 --> 00:09:14,030
s prime.

112
00:09:14,120 --> 00:09:21,380
The prediction is v of s if we relate this back to supervised learning we notice something strange in

113
00:09:21,440 --> 00:09:22,370
supervised learning.

114
00:09:22,370 --> 00:09:28,230
We are given the target as part of the dataset but here we are doing something surprising.

115
00:09:28,340 --> 00:09:32,360
We are predicting the target itself part of the target is given.

116
00:09:32,390 --> 00:09:38,930
That's the reward R but the other part of us prime is actually a model prediction.

117
00:09:38,930 --> 00:09:43,860
Thus it's more correct to say that what we are doing is not quite gradient descent.

118
00:09:43,910 --> 00:09:48,050
It's called a semi gradient instead but this is just the name.

119
00:09:48,120 --> 00:09:50,420
It is the principle that's important.

120
00:09:50,520 --> 00:09:52,930
The principle is we don't know the true target.

121
00:09:53,070 --> 00:09:54,030
We are estimating it.

122
00:09:54,560 --> 00:10:02,050
And this makes it very different from supervised learning.

123
00:10:02,130 --> 00:10:02,490
All right.

124
00:10:02,530 --> 00:10:05,900
So what we looked at so far was the prediction problem.

125
00:10:05,900 --> 00:10:08,540
Now it's finally time for the big reveal.

126
00:10:08,660 --> 00:10:14,950
We are going to solve the control problem using the famous Q learning algorithm at this point we spent

127
00:10:14,950 --> 00:10:19,720
so long building up the prerequisites for Q learning that you should hardly be surprised that what you

128
00:10:19,720 --> 00:10:20,690
see.

129
00:10:20,950 --> 00:10:22,360
Nonetheless let's have a look

130
00:10:27,500 --> 00:10:33,410
as before because Q learning is a control algorithm we are interested in updating Q rather than updating

131
00:10:33,410 --> 00:10:35,860
V at a high level.

132
00:10:35,870 --> 00:10:39,300
We are mostly interested in the innermost part of the loop.

133
00:10:39,680 --> 00:10:44,480
That is where we choose and action and take a step in the environment and where we update the Q table.

134
00:10:45,230 --> 00:10:50,910
So these are the two pieces we are going to focus on here when we choose an action and we are again

135
00:10:51,180 --> 00:10:53,630
going to use an epsilon greedy approach.

136
00:10:53,700 --> 00:10:57,190
So with a small probability Epsilon we will choose a random action.

137
00:10:57,390 --> 00:11:04,930
Otherwise we'll take the ARG Max over Q given a state s once we choose our action we then take that

138
00:11:04,930 --> 00:11:06,270
action in the environment

139
00:11:11,400 --> 00:11:18,360
when we update Q we do something very subtle but important that is when we calculate the target we ignore

140
00:11:18,360 --> 00:11:21,220
whatever action we are going to take next.

141
00:11:21,330 --> 00:11:28,320
Instead we assume that we will take the greedy action and take the max over Q Given the state as prime.

142
00:11:28,410 --> 00:11:30,450
This has two advantages.

143
00:11:30,450 --> 00:11:36,660
First it means that in order to update Q We don't have to wait until we obtain the next action a prime

144
00:11:36,990 --> 00:11:39,720
in the next iteration of the loop.

145
00:11:39,720 --> 00:11:44,000
Second it makes Q learning what is called an off policy algorithm.

146
00:11:44,010 --> 00:11:55,770
This means that I can freely explore but my algorithm will update the Q table as if I had acted greedily.

147
00:11:55,870 --> 00:11:59,380
So here's what Q learning looks like when we put it all together.

148
00:11:59,380 --> 00:12:03,670
Luckily it looks quite similar to the prediction problem pseudocode.

149
00:12:03,680 --> 00:12:06,290
First we are given some environment object.

150
00:12:06,470 --> 00:12:09,770
Then we initialize queue with random values.

151
00:12:09,770 --> 00:12:16,010
Then we enter a loop that goes for a predetermined number of episodes Optionally you can keep going

152
00:12:16,070 --> 00:12:20,860
until Q or the policy converges next.

153
00:12:20,860 --> 00:12:27,120
It's the same as what we had before we reset the environment and start back at the initial stage.

154
00:12:27,340 --> 00:12:30,250
We initialize the done flag to false.

155
00:12:30,580 --> 00:12:34,780
Then we enter a loop for the episode quitting only when we are done.

156
00:12:34,990 --> 00:12:38,370
Next we use epsilon greedy to choose an action.

157
00:12:38,380 --> 00:12:40,530
Then we take a step in the environment.

158
00:12:40,810 --> 00:12:45,580
We get back the state as prime the reward R and the done flag.

159
00:12:45,580 --> 00:12:48,020
Next we create the target for the Q update.

160
00:12:48,220 --> 00:12:56,130
Taking the max over Q Given the state as prime next we update Q of DNA using the target we just calculate

161
00:12:56,130 --> 00:12:57,560
it.

162
00:12:57,570 --> 00:13:02,030
Lastly we update the current state s to be the next state as prime.

163
00:13:02,100 --> 00:13:05,490
Finally when we exit the loop we have found the optimal policy

164
00:13:10,660 --> 00:13:12,020
let's summarize what we just did.

165
00:13:12,040 --> 00:13:18,790
Since that was pretty law first we started by noting that Monte Carlo will not work for infinitely long

166
00:13:18,820 --> 00:13:25,270
episodes and that it's problematic in that we have to wait until an episode is over in order to do any

167
00:13:25,270 --> 00:13:25,720
learning.

168
00:13:26,860 --> 00:13:31,520
Instead we make use of the fact that the return can be defined recursively.

169
00:13:31,540 --> 00:13:39,360
This allows us to approximate the return using only the next reward and the value at the next state.

170
00:13:39,440 --> 00:13:44,930
We also learn that reinforcement learning ultimately starts to look like supervised learning where our

171
00:13:44,930 --> 00:13:50,740
target is not a real target but rather an estimation made by our own model.

172
00:13:50,760 --> 00:13:56,130
We are essentially doing gradient descent on a Q table using this approach.

173
00:13:56,130 --> 00:14:02,550
It allows us to update our Q table on every step because we only have to wait until we receive a single

174
00:14:02,550 --> 00:14:06,900
reward to make an update we call such an approach.

175
00:14:06,920 --> 00:14:10,850
Online learning because the agent learns while it collects data.
