1
00:00:11,720 --> 00:00:18,800
In this lecture we are going to extend our knowledge of basic learning to deep learning that is Q learning

2
00:00:18,830 --> 00:00:23,380
that uses neural networks to understand the motivation behind this.

3
00:00:23,390 --> 00:00:26,240
Consider what we have been doing so far.

4
00:00:26,240 --> 00:00:31,510
Q. learning involves finding the optimal Q of S and a and the corresponding optimal policy.

5
00:00:32,720 --> 00:00:38,270
As you recall just like with supervised learning targets we can think of the states as categories and

6
00:00:38,270 --> 00:00:41,380
the actions as categories in a computer.

7
00:00:41,390 --> 00:00:49,610
We would represent them with integers starting from zero.

8
00:00:49,730 --> 00:00:56,480
For this reason we call Q of DNA a Q table it's a table as in something that looks like a data frame

9
00:00:56,480 --> 00:00:57,900
or a matrix.

10
00:00:58,010 --> 00:01:02,910
Q zero zero means the action value for state zero and action zero.

11
00:01:03,020 --> 00:01:09,650
This is nothing but a two dimensional array methods which use Q tables are called tabular methods for

12
00:01:09,650 --> 00:01:11,570
obvious reasons.

13
00:01:11,570 --> 00:01:16,970
In this section we haven't spent too much time explaining in detail how to implement tabular methods

14
00:01:17,300 --> 00:01:21,580
because my goal was to get you to deep Q learning as quickly as possible.

15
00:01:23,470 --> 00:01:28,840
However it's important to realize that what we have been discussing so far was in the form of tabular

16
00:01:28,840 --> 00:01:32,020
methods so you can understand why we want to move on from them

17
00:01:37,110 --> 00:01:42,690
consider the scenario where your state space or your action space is continuous or infinite.

18
00:01:43,050 --> 00:01:47,650
It should be clear that storing a Q table is no longer a viable option.

19
00:01:47,670 --> 00:01:51,090
We cannot have an infinitely sized table.

20
00:01:51,090 --> 00:01:56,520
One option which would be relatively simple to implement is to use the building approach to force your

21
00:01:56,520 --> 00:01:59,700
state space or action space to be discreet.

22
00:01:59,730 --> 00:02:05,250
For example suppose you're playing the inverted pendulum game in one of your states is the angle of

23
00:02:05,250 --> 00:02:10,340
the pendulum the angle of course must be between zero and 2 pi.

24
00:02:11,240 --> 00:02:17,900
Thus we can do something like break this up into equal sized bins of size pi over four so any angle

25
00:02:17,930 --> 00:02:24,320
between a zero and pi over four would be considered category 0 pi over for it's a pi over 2 would be

26
00:02:24,320 --> 00:02:27,180
considered Category 1 and so on.

27
00:02:27,200 --> 00:02:30,160
This is a fine approach but it can only take you so far

28
00:02:35,320 --> 00:02:36,230
instead.

29
00:02:36,280 --> 00:02:41,100
If we want to do something more flexible we have to make use of machine learning.

30
00:02:41,170 --> 00:02:48,530
We call these approximation methods because we are using machine learning models as function as for

31
00:02:48,530 --> 00:02:49,700
deep Q learning.

32
00:02:49,790 --> 00:02:56,030
We will consider the case where the state is continuous but the action is discrete because the state

33
00:02:56,030 --> 00:02:56,840
is continuous.

34
00:02:56,840 --> 00:02:59,150
We can't use a cue table.

35
00:02:59,150 --> 00:03:06,200
Instead let's suppose our state is represented by a continuous vector s and let's say for simplicity's

36
00:03:06,200 --> 00:03:14,610
sake that we have two actions a zero and a one in this case we can represent U of S a zero as the dot

37
00:03:14,610 --> 00:03:23,890
product between W 0 and s plus B zero we can similarly represent Q of S and a 1 as w 1 transpose s plus

38
00:03:23,890 --> 00:03:25,430
b 1.

39
00:03:25,480 --> 00:03:35,570
In other words we can approximate the Q value for both actions as linear regressions.

40
00:03:35,660 --> 00:03:39,470
Of course there is no need to represent the weights separately.

41
00:03:39,470 --> 00:03:45,080
Instead we can combine all the weights into a weight matrix and combine all the biased terms into one

42
00:03:45,080 --> 00:03:46,900
bias vector.

43
00:03:47,000 --> 00:03:52,940
If the state is a continuous vector of size D and we have k different actions then the weight matrix

44
00:03:52,940 --> 00:03:58,640
will be of size D by K and the bias vector will be of size k.

45
00:03:58,640 --> 00:04:04,930
Notice how I'm abusing notation here a little bit by mixing it num pi notation into our math the coal

46
00:04:04,940 --> 00:04:08,210
in it means select all values in this dimension.

47
00:04:08,210 --> 00:04:12,800
So this is the CU value given the state s overall actions

48
00:04:17,870 --> 00:04:22,180
one example of this is the inverted pendulum task in this task.

49
00:04:22,190 --> 00:04:28,430
Your state consists of four components the position of the car the velocity of the car the angle of

50
00:04:28,430 --> 00:04:32,470
the pendulum and the angular velocity of the pendulum.

51
00:04:32,480 --> 00:04:37,660
We can write this using the notation x x Di theta theta.

52
00:04:38,180 --> 00:04:43,160
The action in this environment is the amount of force to apply to the car.

53
00:04:43,160 --> 00:04:49,310
Although this could be continuous since force is a continuous variable environments like carpool in

54
00:04:49,310 --> 00:04:52,580
open a gym usually descriptors this force.

55
00:04:52,670 --> 00:04:59,940
So you can only apply a force of either plus 1 or minus 1 in this case for the action value.

56
00:05:00,040 --> 00:05:05,480
The weight matrix would be a size 4 by 2 and the bias vector would be a size 2.

57
00:05:05,740 --> 00:05:10,530
If we're predicting the state value V of S then the output is a scalar.

58
00:05:10,690 --> 00:05:15,210
In this case the weight would be a vector of size 4 and the bias would be a scalar

59
00:05:20,280 --> 00:05:25,890
previously when we discussed tabular temporal difference loading we would update V of S and queue of

60
00:05:25,890 --> 00:05:31,170
assigned a directly because those were just entries in a table now.

61
00:05:31,200 --> 00:05:33,780
Since Q and V are not tables we can't do that.

62
00:05:34,590 --> 00:05:35,940
So what do we do instead.

63
00:05:36,900 --> 00:05:40,980
Well it's clear that what has to be updated are the weight matrix and biased vector

64
00:05:46,070 --> 00:05:52,130
let's start with the prediction problem finding V of s here is where treating the problem as a supervised

65
00:05:52,130 --> 00:05:53,900
learning problem becomes important.

66
00:05:54,770 --> 00:05:56,240
What is the target.

67
00:05:56,240 --> 00:05:59,580
What is the prediction and what are we trying to update.

68
00:05:59,600 --> 00:06:05,520
Let's list these out the target is the return or in the case of Q learning.

69
00:06:05,640 --> 00:06:11,010
It's the estimation of the return r plus gamma times V of us Prime.

70
00:06:11,010 --> 00:06:14,630
One exception to this is if s prime is a terminal state.

71
00:06:14,700 --> 00:06:18,110
We know that a terminal state represents the end of an episode.

72
00:06:18,330 --> 00:06:22,150
So there are no further states and hence no further rewards.

73
00:06:22,170 --> 00:06:25,800
Therefore if s prime is terminal the target is simply the reward.

74
00:06:25,800 --> 00:06:30,440
R next what is the prediction.

75
00:06:30,440 --> 00:06:34,310
The prediction is v of S and what do we want to update.

76
00:06:34,910 --> 00:06:38,240
We want to update the linear regression parameters W and V

77
00:06:43,380 --> 00:06:44,750
since this is a regression.

78
00:06:44,760 --> 00:06:49,800
We know that we should use the squared error and do gradient descent to update the parameters.

79
00:06:50,610 --> 00:06:51,600
Let's see how this will work.

80
00:06:52,870 --> 00:07:00,430
First we set up our loss J equals R Plus gamma times V of as prime minus V of us all squared.

81
00:07:00,430 --> 00:07:02,360
Next we find the gradient of J.

82
00:07:02,530 --> 00:07:04,240
With respect to the parameters

83
00:07:06,840 --> 00:07:12,450
one important thing to realize here is that even though both a V of s and view of us Prime depend on

84
00:07:12,450 --> 00:07:19,650
the model parameters we only differentiate V of s with respect to W and b we pretend that V of US prime

85
00:07:19,650 --> 00:07:20,550
is constant.

86
00:07:20,700 --> 00:07:24,800
Since it's part of the target and we do not differentiate with respect to the target

87
00:07:29,900 --> 00:07:35,030
so let's see what temporal difference learning for the prediction problem would look like in code.

88
00:07:35,030 --> 00:07:40,700
Basically this loop is exactly the same as what we had before except for the fact that instead of updating

89
00:07:40,700 --> 00:07:44,260
v directly we are updating W and B.

90
00:07:44,470 --> 00:07:50,860
So first assume we are given an environment and a policy assume that the policy is a function which

91
00:07:50,860 --> 00:07:59,970
accepts as input a state and returns an action next we initialize the weight and bias to be random.

92
00:07:59,990 --> 00:08:04,820
Next we enter a loop for a predetermined number of episodes inside the loop.

93
00:08:04,820 --> 00:08:12,610
We play one episode we call Ian VDI reset to get the first state s which is now a continuous vector.

94
00:08:12,680 --> 00:08:20,180
As usual we set the Dunn flag to false and enter a loop which exits when done becomes true inside the

95
00:08:20,180 --> 00:08:20,540
loop.

96
00:08:20,540 --> 00:08:27,320
We use our given policy function to get the action a we perform this action in the environment to get

97
00:08:27,320 --> 00:08:28,730
the next state as prime.

98
00:08:28,850 --> 00:08:31,510
The reward R and the done flag.

99
00:08:31,820 --> 00:08:33,490
Then it's time for our big update

100
00:08:38,630 --> 00:08:39,440
in this update.

101
00:08:39,560 --> 00:08:45,230
We'll need the values of both V of s envy of US prime which we can calculate using our linear regression

102
00:08:45,230 --> 00:08:52,600
model then we run our gradient descent step using the gradients we'd arrived earlier.

103
00:08:52,610 --> 00:08:59,770
Lastly we update us to be as prime when the loop is done w and B will have converged

104
00:09:05,000 --> 00:09:10,970
as you'll see very shortly approximating V of us is not that useful for the prediction problem.

105
00:09:10,970 --> 00:09:12,900
We would be given the policy.

106
00:09:13,040 --> 00:09:19,210
But what if we are trying to use the value function to help us choose the action then what is the policy.

107
00:09:19,220 --> 00:09:26,870
Well normally it would involve taking the ARG Max over Q of S and a overall actions but if we have only

108
00:09:26,870 --> 00:09:30,420
V of S this is not indexed by the action a.

109
00:09:30,650 --> 00:09:35,270
This is problematic because how can we figure out what the best action to take is

110
00:09:40,360 --> 00:09:43,060
here's one way we could do it which is not optimal.

111
00:09:44,020 --> 00:09:49,710
Imagine you are in the status and you want to know what is the best action to perform.

112
00:09:49,720 --> 00:09:55,240
Imagine that you are in a very special environment where you can try an action and then go back one

113
00:09:55,240 --> 00:09:58,780
step to put yourself back into the status.

114
00:09:58,900 --> 00:10:04,990
In this case what you can do is try every single action so you arrive at all the possible next states

115
00:10:05,020 --> 00:10:11,650
as prime and receive the reward R then out of all of those actions you choose the one that gives you

116
00:10:11,650 --> 00:10:16,660
the maximum value of our plus gamma times V of US prime.

117
00:10:16,660 --> 00:10:22,750
Of course this approach is not at all ideal because in any realistic environment you cannot simply try

118
00:10:22,750 --> 00:10:25,970
actions and then go back to your original state.

119
00:10:26,050 --> 00:10:34,960
It would certainly be nice if such functionality existed in the real world.

120
00:10:35,160 --> 00:10:37,100
So what do we do instead.

121
00:10:37,170 --> 00:10:41,610
Of course the answer is to use the action value Q In this case.

122
00:10:41,610 --> 00:10:49,390
What is the target as before if as prime is terminal then the return is just are the reward.

123
00:10:49,530 --> 00:10:56,910
If s prime is not terminal then the target is R Plus gamma times the max over all actions a prime of

124
00:10:56,910 --> 00:11:04,090
Q of s prime a prime the prediction is Q of S and a and the parameters we want to update are still w

125
00:11:04,090 --> 00:11:09,610
N.B..

126
00:11:09,820 --> 00:11:12,920
So here's what the update looks like in code.

127
00:11:12,920 --> 00:11:15,740
First the squared error loss is our plus Gamma.

128
00:11:15,760 --> 00:11:24,380
Times the maximum are a prime few of us prime a prime minus Q of s in a all squared the gradient is

129
00:11:24,380 --> 00:11:28,230
Q of S and a minus the target times S.

130
00:11:28,490 --> 00:11:35,900
However an important caveat is that this gradient only applies to the action a which was the action

131
00:11:35,930 --> 00:11:40,010
actually taken since that's where the prediction came from.

132
00:11:40,010 --> 00:11:43,280
The weights for the other actions didn't make predictions.

133
00:11:43,310 --> 00:11:45,630
So they are not updated.

134
00:11:45,680 --> 00:11:50,540
This is of course equivalent to saying that their error is zero which will come into play during the

135
00:11:50,540 --> 00:11:52,400
implementation.

136
00:11:52,580 --> 00:11:58,850
We do the same thing for the bias term only the bias term which influences the prediction for the action

137
00:11:58,880 --> 00:12:05,470
a is updated in other words we index both w N.B. by the action a

138
00:12:10,690 --> 00:12:15,940
finally here's what Q learning looks like when we use approximation methods.

139
00:12:16,150 --> 00:12:20,060
Let's start again with what the choose action function would look like.

140
00:12:20,080 --> 00:12:27,610
Again we use epsilon greedy for exploration First we generate a random number between 0 and 1.

141
00:12:27,610 --> 00:12:31,580
If this number is less than epsilon we perform a random action.

142
00:12:31,690 --> 00:12:38,650
If it's not then we calculate the model outputs for Q Given the state s to choose the action and we

143
00:12:38,650 --> 00:12:41,920
take the ARG Max over Q at the status.

144
00:12:42,190 --> 00:12:47,390
Again this works because the actions are encoded as integers starting from 0

145
00:12:52,500 --> 00:12:53,030
next.

146
00:12:53,040 --> 00:13:00,090
Let's look at the Q learning loop as before we start by initializing W and b to be random.

147
00:13:00,090 --> 00:13:07,580
Then we enter a loop and play one episode per iteration we reset the environment and set the Dunn flag

148
00:13:07,580 --> 00:13:10,600
to false inside the second loop.

149
00:13:10,610 --> 00:13:13,720
We choose an action using the function we defined earlier.

150
00:13:14,990 --> 00:13:21,240
Then we take a step in the environment to arrive in the next state and receive our reward R.

151
00:13:21,980 --> 00:13:23,940
Next is our big update.

152
00:13:24,020 --> 00:13:30,040
It helps the first calculate the target since it looks pretty complicated so we'll call that y it's

153
00:13:30,040 --> 00:13:35,990
equal to our plus gamma times the max over all actions a prime of Q of s primate prime.

154
00:13:36,250 --> 00:13:38,730
And remember AQ itself is a model prediction.

155
00:13:38,770 --> 00:13:45,250
So anytime you see Q here that means we're calling model that predict next.

156
00:13:45,300 --> 00:13:52,440
We run our gradient descent updates for WSB but importantly we only update the components of W and B

157
00:13:52,770 --> 00:14:00,540
which correspond to the chosen action a lastly we update the state s to become as prime.
