1
00:00:11,680 --> 00:00:18,130
In this lecture we are going to continue our discussion of deep Q learning and approximation methods.

2
00:00:18,130 --> 00:00:23,080
You'll notice that in the last lecture we didn't invoke the use of deep neural networks at all.

3
00:00:23,890 --> 00:00:26,140
Well first it hopes to do something simple.

4
00:00:26,140 --> 00:00:31,330
Before we do something complicated lay down the foundation before you start adding all the bells and

5
00:00:31,330 --> 00:00:37,100
whistles.

6
00:00:37,140 --> 00:00:42,170
Now you might be wondering is it not trivial to use neural networks for Cuellar.

7
00:00:42,260 --> 00:00:48,150
Is it not the case that all we have to do is replace the linear model with a neural network regression

8
00:00:48,150 --> 00:00:49,200
model.

9
00:00:49,200 --> 00:00:53,230
This is exactly the rule all machine learning interfaces are the same.

10
00:00:53,970 --> 00:00:58,920
So instead of W transpose x plus B we could have a two layer neural network.

11
00:00:58,950 --> 00:01:03,050
As I've written down here this seems perfectly reasonable.

12
00:01:03,120 --> 00:01:09,090
In fact it seems like it would be better than a linear model a linear model can only approximate linear

13
00:01:09,090 --> 00:01:10,940
functions.

14
00:01:11,010 --> 00:01:15,250
How do we know that Q is a linear function of the state s.

15
00:01:15,810 --> 00:01:19,430
In fact it's not clear at all that we can make this assumption.

16
00:01:19,560 --> 00:01:22,430
It doesn't even seem like a good assumption.

17
00:01:22,470 --> 00:01:28,590
Neural networks are more flexible and they can handle all kinds of inputs such as images text time series

18
00:01:28,590 --> 00:01:29,360
and so forth.

19
00:01:34,600 --> 00:01:35,760
So here's what it would look like.

20
00:01:35,770 --> 00:01:38,590
Conceptually our loss is still the same.

21
00:01:38,600 --> 00:01:40,080
It's just the squared error.

22
00:01:40,420 --> 00:01:45,210
Any parameters of the model would be updated using gradient descent.

23
00:01:45,220 --> 00:01:50,610
The only difference now is that the gradient will be different since in this case the gradients are

24
00:01:50,610 --> 00:01:51,940
more complicated.

25
00:01:51,960 --> 00:01:54,580
We don't want to derive them ourselves.

26
00:01:54,600 --> 00:01:59,990
Instead we would like to make use of automatic differentiation which is included in all modern Deep

27
00:01:59,990 --> 00:02:01,950
Learning libraries.

28
00:02:01,950 --> 00:02:07,380
In this case I've written the gradient in terms of a generic parameter theta but you can imagine that

29
00:02:07,650 --> 00:02:17,440
it applies to all the weights and biases of whatever neural network we decide to use.

30
00:02:17,440 --> 00:02:21,790
The problem is training neural networks is inherently unstable.

31
00:02:21,880 --> 00:02:24,130
You probably haven't noticed this yourself.

32
00:02:24,130 --> 00:02:30,550
If you only just started studying deep learning but after a while you'll realize you can't just choose

33
00:02:30,580 --> 00:02:35,200
any learning rate and any optimizer in any model architecture.

34
00:02:35,200 --> 00:02:37,480
Sometimes your cost will explode.

35
00:02:37,480 --> 00:02:40,340
Sometimes your model will not train at all.

36
00:02:40,380 --> 00:02:46,630
Neural networks are pretty sensitive to these choices whereas linear regression usually isn't.

37
00:02:46,760 --> 00:02:51,770
Now if you're a beginner and you're just taking a course on deep learning then the instructor has figured

38
00:02:51,770 --> 00:02:56,300
out all this stuff for you so you don't really recognize the hard work that goes into this

39
00:03:01,480 --> 00:03:04,060
on top of this we have another problem.

40
00:03:04,150 --> 00:03:08,410
Temporal Difference learning itself is also inherently unstable.

41
00:03:08,620 --> 00:03:14,400
Technically we are doing gradient descent but remember that the target is not a real target.

42
00:03:14,440 --> 00:03:20,200
Therefore I lost is not a real loss and our gradient is not a real gradient of a real loss with a real

43
00:03:20,200 --> 00:03:21,120
target.

44
00:03:21,220 --> 00:03:23,200
It's approximations all the way through

45
00:03:28,370 --> 00:03:32,900
what happens when you put together these two inherently unstable algorithms.

46
00:03:32,930 --> 00:03:36,740
Well you get something that barely works if it works at all.

47
00:03:36,770 --> 00:03:41,690
Now that's not to say there won't be instances where you can simply plug in a neuron that work into

48
00:03:41,690 --> 00:03:48,020
your Q learning script and make it work but generally speaking this is not going to work as nicely as

49
00:03:48,230 --> 00:03:51,740
say linear regression for a normal supervised learning dataset

50
00:03:56,910 --> 00:04:02,010
so there are actually a number of approaches called Deep Kulkarni that have been developed over the

51
00:04:02,010 --> 00:04:02,770
years.

52
00:04:02,910 --> 00:04:07,890
But in this lecture my goal is to introduce you to one of these approaches.

53
00:04:07,950 --> 00:04:13,920
The idea is we're going to use something called an experience replay buffer or an experienced replay

54
00:04:13,920 --> 00:04:15,540
memory.

55
00:04:15,540 --> 00:04:20,690
You will notice that in our previous updates we were doing stochastic gradient descent.

56
00:04:20,850 --> 00:04:27,150
In other words looking at one sample at a time what we want to do in order to make this process a little

57
00:04:27,150 --> 00:04:32,390
more stable is Batch gradient descent which looks at multiple samples at a time.

58
00:04:32,550 --> 00:04:34,830
In this way we hope to stabilize the gradient

59
00:04:38,110 --> 00:04:44,010
as a side note in libraries like tensor flow it's called Stochastic gradient descent or SGI.

60
00:04:44,260 --> 00:04:50,140
Even when you use batches however I like to make the distinction that batch gradient descent refers

61
00:04:50,140 --> 00:04:55,570
to looking at multiple samples whereas stochastic gradient descent refers to looking at one sample

62
00:05:00,700 --> 00:05:07,120
you're incurs to try this yourself to gain some intuition grab any supervised learning dataset and compare

63
00:05:07,120 --> 00:05:12,330
the learning curve you get when you look at one sample at a time versus when you look at say thirty

64
00:05:12,330 --> 00:05:13,420
two samples at a time.

65
00:05:14,530 --> 00:05:20,420
You can also compare an entire range of samples from one up to the size of the dataset.

66
00:05:20,650 --> 00:05:26,650
You should find that the lost per iteration is much more stable for batch gradient descent than stochastic

67
00:05:26,650 --> 00:05:28,630
gradient descent.

68
00:05:28,640 --> 00:05:34,130
The other extreme is that you can use the full data set on each round of gradient descent in which case

69
00:05:34,130 --> 00:05:36,440
your last function should decrease Mont atomically

70
00:05:41,700 --> 00:05:47,670
another point is that it's not good to have each sample be very correlated to the previous samples when

71
00:05:47,670 --> 00:05:49,740
you are doing gradient descent.

72
00:05:49,740 --> 00:05:56,700
This is why we like to shuffle the data on each epoch when we are doing gradient descent and reinforcement

73
00:05:56,700 --> 00:05:57,050
learning.

74
00:05:57,060 --> 00:06:02,640
The data is very correlated because you're going from one state to the next in a Markoff process so

75
00:06:02,640 --> 00:06:05,310
you can almost even think of it as like a time series.

76
00:06:05,430 --> 00:06:11,110
Every sample is correlated with the previous sample we'll see how we can randomize the data from the

77
00:06:11,110 --> 00:06:12,570
replay buffer later on

78
00:06:17,750 --> 00:06:18,380
next.

79
00:06:18,410 --> 00:06:20,410
Let's look at how a replay buffer would work.

80
00:06:20,420 --> 00:06:27,740
Conceptually you can imagine it like a python list that stores tuples of the form as they are as prime

81
00:06:27,740 --> 00:06:30,570
and done as we encounter them.

82
00:06:30,680 --> 00:06:34,690
We can call these tuples transitions on this slide.

83
00:06:34,700 --> 00:06:40,480
I visualize what such a list might look like in memory and how we might update it in code.

84
00:06:40,490 --> 00:06:46,100
Basically it's just a one line addition in our loop every time we take a step in the environment.

85
00:06:46,190 --> 00:06:50,450
We just add the latest SARS prime done to our replay list

86
00:06:55,560 --> 00:07:02,170
one important characteristic of the replay buffer is that it has a fixed size at some point.

87
00:07:02,280 --> 00:07:07,650
Some of the transitions in the replay buffer will be old and they will correspond to a policy that we

88
00:07:07,650 --> 00:07:09,610
are no longer following.

89
00:07:09,780 --> 00:07:15,830
In this case we may not want to use these to form batches of data when we perform gradient descent.

90
00:07:15,930 --> 00:07:18,510
So here's a new addition we want to add.

91
00:07:18,990 --> 00:07:25,020
Instead of only appending new transitions to the replay buffer we are also going to remove transitions

92
00:07:25,380 --> 00:07:28,830
when the replay buffer reaches its maximum size.

93
00:07:28,950 --> 00:07:34,290
Since this is a list where we are always appending to the end the transition we want to remove will

94
00:07:34,290 --> 00:07:36,110
always be located at the beginning.

95
00:07:36,130 --> 00:07:37,500
The oldest transition

96
00:07:42,610 --> 00:07:48,350
the last thing we have to consider is what happens when we actually want to perform the update.

97
00:07:48,520 --> 00:07:53,860
In this slide I will describe conceptually what we want to do although in code we will make this a little

98
00:07:53,860 --> 00:07:55,520
more efficient.

99
00:07:55,630 --> 00:08:00,130
So let's suppose we have some function called train in this function.

100
00:08:00,130 --> 00:08:04,700
We first to sample a batch of transitions from our replay buffer.

101
00:08:04,720 --> 00:08:12,000
This is now a list of tuples where each tuple Contains SARS prime done transitions.

102
00:08:12,070 --> 00:08:16,490
Next we initialize our inputs and targets as empty lists.

103
00:08:16,600 --> 00:08:20,280
Then we loop through each transition if done is true.

104
00:08:20,290 --> 00:08:21,960
That means as prime is terminal.

105
00:08:22,060 --> 00:08:23,900
So the return is just ah.

106
00:08:24,340 --> 00:08:31,710
Otherwise the target is our plus gamma times the max over Q of s prime a private after we've created

107
00:08:31,710 --> 00:08:38,760
the target we can append that to our targets list then we add the input the state s to our inputs list

108
00:08:39,720 --> 00:08:42,720
and then once we have a batch of inputs and targets.

109
00:08:42,810 --> 00:08:45,110
This is nothing more than gradient descent.

110
00:08:45,240 --> 00:08:56,010
So all we need to do is the usual gradient descent update parameter minus learning rate times gradient.

111
00:08:56,060 --> 00:08:59,520
So here's what the pseudocode for deep Q learning would look like.

112
00:08:59,780 --> 00:09:05,030
It's very similar to Q learning with approximation methods except now we are going to make use of our

113
00:09:05,030 --> 00:09:06,970
replay buffer.

114
00:09:06,980 --> 00:09:12,050
In fact this code is a little simpler because we already describe the functions to update the replay

115
00:09:12,050 --> 00:09:13,550
buffer and train the model.

116
00:09:14,540 --> 00:09:18,750
So to start we initialize our replay buffer to an empty list.

117
00:09:18,800 --> 00:09:24,240
We also instantiate a model which of course comes with weights which are randomly initialized.

118
00:09:24,290 --> 00:09:30,290
You can obviously picture it as a neuron that we're next we enter a loop that runs for a predetermined

119
00:09:30,350 --> 00:09:33,260
number of episodes inside the loop.

120
00:09:33,260 --> 00:09:38,580
We start a new episode by calling in VDI reset and setting done defaults.

121
00:09:38,810 --> 00:09:44,610
Then we loop through each episode first we choose an action using Epsilon greedy.

122
00:09:44,750 --> 00:09:50,100
Next we take that action in the environment to get s prime R and done.

123
00:09:50,210 --> 00:09:53,390
Then we add the latest transition as a r as prime.

124
00:09:53,390 --> 00:10:01,540
Done to our replay buffer this function also removes the oldest transition if the buffer has reached

125
00:10:01,570 --> 00:10:04,280
its maximum size.

126
00:10:04,310 --> 00:10:09,920
Next we call the train function which samples a batch of data from the replay buffer calculates the

127
00:10:09,920 --> 00:10:14,540
corresponding targets and runs one step of gradient descent.

128
00:10:14,540 --> 00:10:19,900
Lastly we set the state s to the next state as prime for the next iteration of the loop.