1
00:00:02,310 --> 00:00:07,720
Hey everyone and welcome back to this class modern deep learning and Python Deep Learning in Python.

2
00:00:07,720 --> 00:00:08,380
Part two

3
00:00:12,230 --> 00:00:13,080
in this lecture.

4
00:00:13,100 --> 00:00:16,400
We are going to discuss variable and adaptive learning rates.

5
00:00:17,240 --> 00:00:23,750
So far we've looked at momentum as a modification to vanilla back propagation that can greatly speed

6
00:00:23,750 --> 00:00:24,840
up training.

7
00:00:24,980 --> 00:00:29,090
And from my perspective I find momentum to be the most impactful.

8
00:00:29,360 --> 00:00:35,450
If you compare the last per iteration for a standard gradient descent versus gradient descent with momentum

9
00:00:35,750 --> 00:00:37,490
the difference is huge.

10
00:00:37,670 --> 00:00:42,740
At the same time momentum is nice because you don't really have to play with the hyper parameters that

11
00:00:42,740 --> 00:00:45,250
much just go with zero point nine.

12
00:00:45,350 --> 00:00:46,380
And it's probably fine.

13
00:00:47,120 --> 00:00:51,140
So momentum is often my go to at the same time.

14
00:00:51,260 --> 00:00:54,320
Some of these adaptive learning rate techniques are very powerful.

15
00:00:54,380 --> 00:01:00,670
So let's have a look.

16
00:01:00,800 --> 00:01:06,770
The first thing we'll talk about is variable learning rates or in other words lending rates as a function

17
00:01:06,920 --> 00:01:08,940
of iteration or time.

18
00:01:09,050 --> 00:01:12,230
Sometimes people call this learning rates scheduling.

19
00:01:12,520 --> 00:01:15,850
They are very simple but they give you a lot to play with.

20
00:01:15,860 --> 00:01:19,650
So the first one we're going to look at is called step decay.

21
00:01:19,730 --> 00:01:26,450
Essentially what we do is periodically say every 100 steps we reduce the learning rate by a constant

22
00:01:26,450 --> 00:01:27,320
factor.

23
00:01:27,320 --> 00:01:31,650
So for example we divide by two or have it each time.

24
00:01:31,850 --> 00:01:34,970
If we plot it you can see that it kind of looks like a staircase

25
00:01:40,300 --> 00:01:44,520
a second method is called exponential decay in this method.

26
00:01:44,630 --> 00:01:51,740
The learning rate follows an exponential curve.

27
00:01:51,780 --> 00:01:56,460
The third method is for the lending rate to decay proportionately to one over time.

28
00:01:56,820 --> 00:02:04,930
You can control how fast or slow the lending rate decays by changing the proportionality constant.

29
00:02:04,960 --> 00:02:13,430
Notice how the drop off is slower than exponential decay.

30
00:02:13,460 --> 00:02:16,400
So what do all of these have in common.

31
00:02:16,400 --> 00:02:19,750
Well you can see that they all decrease the learning rate with time.

32
00:02:19,760 --> 00:02:20,960
Why would we want to do that.

33
00:02:21,920 --> 00:02:27,050
Well generally speaking when we initialize the weights of a neuron that we're randomly they're going

34
00:02:27,050 --> 00:02:29,550
to be very far from the optimal weights.

35
00:02:29,600 --> 00:02:35,540
So it's good to start with a large learning rate so that we can take bigger steps towards the goal.

36
00:02:35,600 --> 00:02:38,420
This is the motivation behind momentum as well.

37
00:02:38,540 --> 00:02:44,360
We want to pick up speed by accumulating past gradients because we know that if we are very far away

38
00:02:44,360 --> 00:02:47,690
from our goal then those gradients should be large.

39
00:02:47,690 --> 00:02:51,280
But when we get close to the goal the gradient is going to shrink.

40
00:02:51,560 --> 00:02:57,320
In fact by definition the minimum of a function assessor takes a gradient of zero.

41
00:02:57,320 --> 00:03:00,200
That's how you solve for a minimum of a function and calculus.

42
00:03:00,200 --> 00:03:05,630
If you recall you finally derivative and then you set it to zero and then you solve for the parameter

43
00:03:05,630 --> 00:03:06,950
in question.

44
00:03:07,130 --> 00:03:11,060
So why might we want to slow down when we get close to the minimum.

45
00:03:11,060 --> 00:03:15,990
Well when you close to the minimum and you take too big of a step you're going to overshoot.

46
00:03:16,040 --> 00:03:20,620
And so what ends up happening is you just end up bouncing back and forth.

47
00:03:20,650 --> 00:03:25,960
In fact if you're learning rate is too large you'll just bounce right out of the valley and your laws

48
00:03:25,960 --> 00:03:27,780
might actually increase.

49
00:03:27,790 --> 00:03:37,100
So in order to reduce all this bouncing around we would like to take small steps.

50
00:03:37,290 --> 00:03:43,000
Here is another technique that people use which is related to these variable learning rate techniques.

51
00:03:43,000 --> 00:03:49,150
Sometimes researchers and machining practitioners will actually sit there and babysit the neural network

52
00:03:49,150 --> 00:03:55,870
training process so they'll do a couple of epochs and see how it goes if learning is happening too slowly.

53
00:03:55,870 --> 00:04:00,940
They might increase the learning rate if learning is starting to even out they might decrease the learning

54
00:04:00,940 --> 00:04:07,330
rate but you have to be careful because you're not guaranteed to have a nice monotone ugly decreasing

55
00:04:07,330 --> 00:04:08,150
curve.

56
00:04:08,470 --> 00:04:14,110
Sometimes you might get stuck in a relatively flat portion of the Arab surface but this could just be

57
00:04:14,110 --> 00:04:15,230
temporary.

58
00:04:15,430 --> 00:04:18,520
And if you're patient enough you might end up getting a steep drop.

59
00:04:19,120 --> 00:04:24,460
But at the same time this is not guaranteed as with most things in machine learning.

60
00:04:24,590 --> 00:04:30,140
The answer is that behavior is data dependent waiting longer is desirable.

61
00:04:30,140 --> 00:04:34,570
If you have the resources because then you get a full picture of the learning process.

62
00:04:35,270 --> 00:04:39,230
So manual learning rate scheduling is also an option for you to choose from

63
00:04:44,460 --> 00:04:44,760
now.

64
00:04:44,760 --> 00:04:49,780
Also important to think about what else to all these methods have in common.

65
00:04:49,890 --> 00:04:54,930
Well all of these methods add new hyper parameters to your list of things to optimize.

66
00:04:54,930 --> 00:05:00,880
In fact choosing between each of these methods can also be seen as a hyper parameter optimization.

67
00:05:00,900 --> 00:05:06,240
So in one way knowing about these techniques can help you but from another perspective it adds more

68
00:05:06,240 --> 00:05:06,990
work to your plate

69
00:05:12,100 --> 00:05:17,140
the next couple of techniques I want to talk about are what I like to call adaptive learning rate techniques

70
00:05:17,500 --> 00:05:21,570
because they adapt to the training data that you've seen so far.

71
00:05:21,620 --> 00:05:24,720
The first one I want to discuss is called the eighth grade.

72
00:05:24,860 --> 00:05:31,370
The basic idea is this we can't expect the dependents of the costs on each of the parameters to be the

73
00:05:31,370 --> 00:05:32,450
same.

74
00:05:32,540 --> 00:05:38,300
In other words in one direction the gradient might be really steep but in another direction the gradient

75
00:05:38,300 --> 00:05:39,200
might be really flat.

76
00:05:39,830 --> 00:05:46,760
So perhaps it may be beneficial to adapt the learning rate for each parameter individually based on

77
00:05:46,760 --> 00:05:48,560
how much it has changed in the past

78
00:05:53,550 --> 00:06:00,210
so an undergrad what we do is we introduce a variable called the cash each parameter of the neuron that

79
00:06:00,210 --> 00:06:01,710
work has its own cash.

80
00:06:02,040 --> 00:06:08,490
So for example if you have one way matrix of size three by four then you'll also have a cash matrix

81
00:06:08,490 --> 00:06:13,650
of size three by four and the same thing goes for your bias vectors.

82
00:06:13,670 --> 00:06:19,280
The idea behind the cash is that it's going to accumulate the squared gradients because we're squaring

83
00:06:19,280 --> 00:06:19,920
the gradients.

84
00:06:19,940 --> 00:06:26,660
The cash is always going to be positive and because each parameter has its own cash then if one parameter

85
00:06:26,660 --> 00:06:32,540
has had a lot of large gradients in the past then its cash will be very large and its effective lending

86
00:06:32,540 --> 00:06:37,190
rate will be very small so it will change more slowly in the future.

87
00:06:38,120 --> 00:06:43,850
On the other hand if a parameter has had a lot of small gradients in the past then its cash will be

88
00:06:43,850 --> 00:06:49,910
small so it's effective lending rate will remain large and it will have more opportunity to change in

89
00:06:49,910 --> 00:06:57,440
the future one minor detail is that we usually add a small number Epsilon to the denominator to avoid

90
00:06:57,440 --> 00:06:59,090
dividing by zero.

91
00:06:59,090 --> 00:07:02,960
Typically this is set to 10 to the minus eight or 10 to the minus 10

92
00:07:08,270 --> 00:07:14,030
one important point to stress about eight grand is that everything we're doing is an element y's operation

93
00:07:14,660 --> 00:07:21,320
so each scalar parameter is effectively updated independently of the others which makes sense.

94
00:07:21,320 --> 00:07:27,920
So you can look at the formulas we presented as scalar updates which apply to all the parameters or

95
00:07:27,920 --> 00:07:32,620
you can think of one huge parameter vector that contains all the neuron parameters.

96
00:07:32,750 --> 00:07:36,140
And then each of the operations is an element y's operation

97
00:07:41,210 --> 00:07:47,540
this next technique builds on the fact that researchers observed that 80 grand decreases the learning

98
00:07:47,540 --> 00:07:49,170
rate too aggressively.

99
00:07:49,490 --> 00:07:54,350
So the learning rate would approach zero too quickly when in fact there was still more learning to be

100
00:07:54,350 --> 00:07:55,000
done.

101
00:07:55,740 --> 00:08:01,740
This technique is called Our M.S. problem and was invented by Geoff Hinton and his team.

102
00:08:01,740 --> 00:08:08,550
So the way it works is this the reason it agreed decreases the learning rate too quickly is because

103
00:08:08,550 --> 00:08:10,830
the cash is growing too fast.

104
00:08:10,830 --> 00:08:15,020
So in order to make the cash grow less fast we actually decrease it.

105
00:08:15,360 --> 00:08:23,170
Each time we update it we do this by taking a weighted average of the old cash and the new squared gradient.

106
00:08:23,280 --> 00:08:29,020
We call this way the decay rate and you can see that the two weights add up to 1.

107
00:08:29,130 --> 00:08:32,090
We'll be looking at this form again and again throughout the course.

108
00:08:32,100 --> 00:08:35,730
So don't worry about knowing all there is to know right now.

109
00:08:35,730 --> 00:08:40,680
Just know that we'll be taking part of the old cash and part of the new squared gradient and adding

110
00:08:40,680 --> 00:08:45,610
them together to get the new cash typical values for the decay rate.

111
00:08:45,660 --> 00:08:52,560
Our point nine nine point ninety nine point nine nine nine nine and so on again the intuition for these

112
00:08:52,560 --> 00:08:59,880
choices will be discussed in later lectures so by doing this we say that we're making the cash leaky.

113
00:09:00,140 --> 00:09:05,810
And the reason it's leaking is because you can imagine that if we had zero gradient for a long time

114
00:09:06,230 --> 00:09:11,660
eventually the cash would shrink back down to zero because it would be decreased by the decay rate on

115
00:09:11,660 --> 00:09:12,200
each round

116
00:09:17,550 --> 00:09:18,510
one obstacle.

117
00:09:18,600 --> 00:09:24,150
I realized sometime after this course was originally released is that there is actually some ambiguity

118
00:09:24,390 --> 00:09:27,240
in the arms prop in eight grad algorithms.

119
00:09:27,300 --> 00:09:34,020
Specifically I discovered this in the context of ARM as prop so one thing that's not specified in the

120
00:09:34,020 --> 00:09:34,650
arm is prop.

121
00:09:34,650 --> 00:09:38,170
Algorithm is the initial value of the cash.

122
00:09:38,190 --> 00:09:41,970
Now one might automatically just assume that zero is a good value.

123
00:09:42,000 --> 00:09:45,840
And I did this as well but this actually has a very strange effect.

124
00:09:47,380 --> 00:09:51,400
Let's suppose you choose the decay rate of zero point nine nine nine.

125
00:09:51,430 --> 00:09:57,230
Then your initial update for the cash will be zero point zero zero one times G squared.

126
00:09:57,280 --> 00:10:05,570
So you're taking one one thousandth of the squared gradient so disregarding Epsilon for the time being

127
00:10:06,040 --> 00:10:11,810
your effective learning rate actually becomes very large due to the fact that the denominator is very

128
00:10:11,810 --> 00:10:12,740
small.

129
00:10:12,920 --> 00:10:18,320
What you'd have to do is compensate by making your initial learning rate smaller than usual

130
00:10:23,500 --> 00:10:26,920
one solution to this is to set the cash to one instead.

131
00:10:27,160 --> 00:10:32,860
By manipulating the arms prop update a little bit we can show that this gives us approximately what

132
00:10:32,860 --> 00:10:36,310
the update would be if we weren't doing our mass prop at all.

133
00:10:36,310 --> 00:10:42,190
Or in other words the effective learning rate is approximately equal to the initial learning rate divided

134
00:10:42,190 --> 00:10:48,000
by 1.

135
00:10:48,010 --> 00:10:50,490
Now you might ask Well which one is the correct way.

136
00:10:51,170 --> 00:10:56,770
And we can't really say because it was never specified what I can tell you is that there are some very

137
00:10:56,770 --> 00:10:59,920
major packages that have implemented them both ways.

138
00:11:00,070 --> 00:11:03,310
So it's not as if one is more correct than the other.

139
00:11:03,490 --> 00:11:11,250
For example intensive flow the cache is initialized to 1 in carries the cache is initialized to 0.

140
00:11:11,280 --> 00:11:16,460
In fact it's a great exercise to try and implement these adaptive learning rate techniques.

141
00:11:16,530 --> 00:11:21,900
Then compare your implementation to these official implementations to check if your implementation is

142
00:11:21,900 --> 00:11:22,810
correct.

143
00:11:23,220 --> 00:11:27,900
If it is correct then your last per iteration should match what the library gives you.

144
00:11:29,180 --> 00:11:33,730
And so I've done this in the file AMAs prep test up high in case you want to check it out.