1
00:00:11,040 --> 00:00:17,040
In this lecture, we'll be looking at another way to modify gradient descent using a technique called

2
00:00:17,040 --> 00:00:20,050
adaptive moment estimation or Adam for short.

3
00:00:20,820 --> 00:00:26,160
Adam is a technique that deserves its own lecture because it is the go to method for optimizing neural

4
00:00:26,160 --> 00:00:27,040
networks today.

5
00:00:27,750 --> 00:00:33,480
That is, people most often use Adam as the default choice without really considering other options.

6
00:00:34,020 --> 00:00:35,570
The reason for this is simple.

7
00:00:36,060 --> 00:00:41,170
Adam tends to work well and it tends to work well with the default settings for the learning rate,

8
00:00:41,190 --> 00:00:43,140
momentum parameter and so on.

9
00:00:43,710 --> 00:00:45,510
In other words, Adam is robust.

10
00:00:46,170 --> 00:00:51,750
You can use it and not have to worry about tweaking the knobs to find values which are exactly right,

11
00:00:52,080 --> 00:00:54,780
since what you have by default is probably OK.

12
00:00:55,620 --> 00:00:58,980
In addition, Adam is somewhat of a successor to arms.

13
00:01:00,060 --> 00:01:06,510
Both were developed at the University of Toronto Arms prop was developed by Geoffrey Hinton while Adam

14
00:01:06,510 --> 00:01:10,800
was developed by Jimmy Ba, who is a PhD student of Hinson's at the time.

15
00:01:11,790 --> 00:01:17,760
In fact, Jimmy Ba had three PhDs supervisors, all of whom are prominent machine learning researchers,

16
00:01:18,060 --> 00:01:19,910
and he is now a professor himself.

17
00:01:24,480 --> 00:01:28,560
Now, one way to think of Adam is that its arms prop with momentum.

18
00:01:29,220 --> 00:01:35,070
This is because it takes the same idea of having an adaptive learning rate, but it also adds momentum.

19
00:01:35,640 --> 00:01:41,010
This might be a little confusing since libraries like Tenzer Flow also provide you with an option to

20
00:01:41,010 --> 00:01:43,390
add momentum to arms prop.

21
00:01:43,830 --> 00:01:45,790
However, these are slightly different.

22
00:01:46,290 --> 00:01:51,930
So the key point is Adam is like arms with momentum, but it's not exactly the same.

23
00:01:56,470 --> 00:01:58,940
Let's start by reviewing what we have seen so far.

24
00:01:59,590 --> 00:02:05,950
Specifically, let's consider two helpful ways that we can improve vanilla gradient descent method.

25
00:02:05,950 --> 00:02:07,860
Number one is to use momentum.

26
00:02:08,260 --> 00:02:10,300
An object in motion stays in motion.

27
00:02:11,050 --> 00:02:15,580
Method number two is to use adaptive learning rate methods like arms prop.

28
00:02:16,630 --> 00:02:20,210
Note that I'm using slightly different symbols than what you are probably used to.

29
00:02:20,950 --> 00:02:25,050
This is just to set us up to learn Atem, which uses different symbols by convention.

30
00:02:25,840 --> 00:02:29,390
The parameter update for momentum proceeds in two steps.

31
00:02:30,190 --> 00:02:34,690
First, we update the momentum value, which I'm going to call em in this lecture.

32
00:02:35,620 --> 00:02:42,050
It's equal to some proportion of the previous momentum, minus the learning rate times the current gradient.

33
00:02:42,820 --> 00:02:46,490
Then we update the parameters themselves, which we call theta.

34
00:02:47,230 --> 00:02:51,280
The update is just the old value of theta, plus the new momentum.

35
00:02:53,030 --> 00:02:57,050
For this lecture, it's helpful to present momentum and a slightly different way.

36
00:02:57,740 --> 00:03:03,020
Specifically, we now treat the momentum value as kind of the negative of what it was before.

37
00:03:03,710 --> 00:03:10,400
The update for MFT is now the weighted some of the previous M of T minus one and the current gradient

38
00:03:10,400 --> 00:03:12,870
G of T to update theta.

39
00:03:13,100 --> 00:03:15,860
We now subtract them of T times the learning rate.

40
00:03:15,860 --> 00:03:16,450
Ayda.

41
00:03:17,180 --> 00:03:21,740
Now, since all these new symbols are just constants, you can show that the two methods of presenting

42
00:03:21,740 --> 00:03:23,570
momentum are equivalent.

43
00:03:28,240 --> 00:03:33,220
Let's also review arms prop arms prop also proceeds in two steps.

44
00:03:33,940 --> 00:03:36,950
First, we update our cash, which is now renamed V.

45
00:03:37,900 --> 00:03:41,370
Notice how similar this is to our modified version of momentum.

46
00:03:42,190 --> 00:03:47,500
But instead of being the weighted some of the gradients, it's the weight, some of the gradient squared.

47
00:03:48,370 --> 00:03:54,130
The second step is to update the parameter theta as usual, except instead of having the same learning

48
00:03:54,130 --> 00:03:57,550
rate for all time, we divide by the square root of the cash.

49
00:04:02,180 --> 00:04:07,130
OK, so for the next phase of this lecture, we're going to take a small digression and talk about moving

50
00:04:07,130 --> 00:04:07,880
averages.

51
00:04:08,360 --> 00:04:13,940
By doing this, we'll begin to make sense of why momentum in arms Propp take the forms that they do

52
00:04:14,900 --> 00:04:16,340
to understand the moving average.

53
00:04:16,350 --> 00:04:19,290
A nice place to begin is just the regular average.

54
00:04:19,910 --> 00:04:26,810
Suppose I have a set of data points X one X two all the way up to the average of these data points is

55
00:04:26,810 --> 00:04:28,160
exactly what you expect.

56
00:04:28,670 --> 00:04:33,040
Sum up all the X's and divide by the total number of data points, which is T.

57
00:04:33,770 --> 00:04:35,660
OK, so pretty simple so far.

58
00:04:40,460 --> 00:04:46,150
The next step is to consider what if I have so much data that I cannot fit it all into memory at once?

59
00:04:46,730 --> 00:04:50,350
For example, the data lives in a file that's one terabyte in size.

60
00:04:50,940 --> 00:04:55,430
Unfortunately, your computer is not capable of storing one terabyte of memory.

61
00:04:56,510 --> 00:05:01,130
Or consider another scenario where you have a robot interacting with the world.

62
00:05:02,000 --> 00:05:06,880
In this case, each data point arrives at the robot's sensors at each instant in time.

63
00:05:07,610 --> 00:05:13,670
I can't see any of the future data points, so my average can only consist of the past data points and

64
00:05:13,670 --> 00:05:14,780
the current data point.

65
00:05:15,290 --> 00:05:20,060
If I want the average of all the data points I have seen so far, let's assume that the current time

66
00:05:20,060 --> 00:05:25,600
is t then sure I can add them all up and divide by T, but that seems wasteful.

67
00:05:26,240 --> 00:05:28,460
You can see that as T increases.

68
00:05:28,460 --> 00:05:32,300
The number of things I have to add together also increases.

69
00:05:33,230 --> 00:05:38,900
Therefore, on each step my robot will have to do more and more computation in order to compute this

70
00:05:38,900 --> 00:05:39,530
average.

71
00:05:40,250 --> 00:05:46,910
We say that this computation is of T because the number of steps to perform the computation is proportional

72
00:05:46,910 --> 00:05:47,520
to T.

73
00:05:48,530 --> 00:05:54,710
My claim is that if we have a robot streaming in data one point at a time, we can actually make this

74
00:05:54,710 --> 00:05:56,160
computation of one.

75
00:05:56,960 --> 00:06:03,620
In other words, our computation is constant no matter how much data we've collected specifically that

76
00:06:03,620 --> 00:06:09,230
we can compute the current average by somehow making use of the previous computations we've done, and

77
00:06:09,230 --> 00:06:11,780
even more specifically, the previous average.

78
00:06:16,470 --> 00:06:22,590
To understand how we can do this, let's manipulate the equation for the sample mean, firstly, notice

79
00:06:22,590 --> 00:06:25,800
that our notation for the mean time T is exposed.

80
00:06:25,800 --> 00:06:32,880
Subscript T, let's begin by splitting our sum by taking out only the final data point X of T.

81
00:06:34,020 --> 00:06:37,110
The next step is to multiply out the one over T term.

82
00:06:38,870 --> 00:06:45,980
The next step is to replace the sum from little T equals one up to Big T minus one specifically, we

83
00:06:45,980 --> 00:06:49,220
can replace this with the sample mean at time T minus one.

84
00:06:50,360 --> 00:06:56,480
As you recall, the sample mean a time at T minus one is just the sum divided by T minus one.

85
00:06:57,500 --> 00:06:59,120
OK, so now what do we have?

86
00:06:59,600 --> 00:07:01,900
We have exactly what we discussed previously.

87
00:07:02,720 --> 00:07:08,750
We have an equation that allows us to compute the current average X party in constant time.

88
00:07:09,560 --> 00:07:16,160
As you can see, it only involves something to terms the previous average X bar T minus one and the

89
00:07:16,160 --> 00:07:17,570
current sample X of T.

90
00:07:22,420 --> 00:07:28,510
The next step is to manipulate our expression even more specifically, let's take the phrase constant

91
00:07:28,510 --> 00:07:34,800
T minus one over T and divide everything by T, then we get one minus one over T.

92
00:07:35,440 --> 00:07:38,830
This is helpful because now that one over T term appears twice.

93
00:07:43,440 --> 00:07:48,960
The next step is to ask the question, what if we replace one of our teeth with a constant let's call

94
00:07:48,960 --> 00:07:55,080
it Alpha in this case, we are changing the answer, but we are doing so in a way that might be helpful

95
00:07:55,080 --> 00:08:01,470
later on when this was equal to one over T, we saw that this gave us the regular sample mean.

96
00:08:02,460 --> 00:08:07,170
As you can see, the function one of our T decreases as T increases.

97
00:08:07,860 --> 00:08:14,190
However, it turns out that this will weigh all of the X's just right so that every X we have considered

98
00:08:14,190 --> 00:08:21,420
so far is weighted by one over T note that we have just proven this and as you know, it leaves exactly

99
00:08:21,420 --> 00:08:22,440
to the sample mean.

100
00:08:23,340 --> 00:08:29,070
Importantly, this means that if we replace one over T with a constant, we will no longer have the

101
00:08:29,070 --> 00:08:29,900
sample mean.

102
00:08:30,660 --> 00:08:38,280
In fact, when we have a constant, this gives us the exponentially weighted moving average as an exercise.

103
00:08:38,430 --> 00:08:43,980
You might want to try manipulating the equation for constant alpha and put it back into a summation

104
00:08:43,980 --> 00:08:48,290
so that it depends only on the X's and not any previous averages.

105
00:08:48,840 --> 00:08:54,350
What you should see is that the weights decay exponentially as you go further and further back in time.

106
00:08:55,290 --> 00:08:58,520
This is opposed to the regular average where all the weights are equal.

107
00:08:59,400 --> 00:09:04,710
Basically, this means that for the exponentially weighted moving average data points, which are more

108
00:09:04,710 --> 00:09:06,690
recent or more highly weighted.

109
00:09:11,420 --> 00:09:17,540
Now, just by convention, we can replace Alpha with the new symbol, specifically, let's introduce

110
00:09:17,540 --> 00:09:19,820
Beta, which is equal to one minus Alpha.

111
00:09:20,510 --> 00:09:21,860
We'll call this the decay.

112
00:09:22,580 --> 00:09:24,220
Note that this doesn't change anything.

113
00:09:24,620 --> 00:09:29,310
Only the symbols are different, but the functionality of the equation is still exactly the same.

114
00:09:30,170 --> 00:09:35,840
At this point, you should recognize that this is exactly the same as the cash update for arms probe.

115
00:09:36,200 --> 00:09:39,140
It is also the same as our modified momentum update.

116
00:09:43,960 --> 00:09:48,170
So an arms prop, where do we use this exponentially weighted moving average?

117
00:09:48,730 --> 00:09:52,860
In fact, we use it on the squared radians, so it's worth considering.

118
00:09:53,050 --> 00:09:57,730
What is this the average of what are we estimating now trivially?

119
00:09:57,760 --> 00:10:00,110
We are estimating the mean of the gradient squared.

120
00:10:00,700 --> 00:10:05,170
So an arms probe, because we eventually end up using the square root of the cache.

121
00:10:05,470 --> 00:10:07,780
Now, the name RMX prop makes sense.

122
00:10:08,350 --> 00:10:10,920
Our math stands for Root Mean Square.

123
00:10:11,560 --> 00:10:16,700
Specifically, we are estimating the square root of the mean of the square of the gradient.

124
00:10:17,230 --> 00:10:19,180
However, we can go further than this.

125
00:10:23,870 --> 00:10:25,170
What is the sample mean?

126
00:10:26,180 --> 00:10:31,490
You may recall from statistics that the sample mean is used to estimate expected values.

127
00:10:32,150 --> 00:10:35,780
Now, what is the expected value of the square of a random variable?

128
00:10:36,500 --> 00:10:38,570
Well, it is related to the variance.

129
00:10:39,200 --> 00:10:45,440
Specifically, the variance is equal to the expected value of the random variable squared, minus the

130
00:10:45,440 --> 00:10:46,550
true means squared.

131
00:10:47,330 --> 00:10:52,780
However, when we just have the expected value of the square, we often call this the second moment

132
00:10:53,960 --> 00:10:58,760
in general, the expected value of X to the power N is called the enth moment.

133
00:11:03,480 --> 00:11:09,210
In statistics, it's often the case that what we care about are the first and second moments, there

134
00:11:09,210 --> 00:11:13,860
are some practical and theoretical reasons behind this, but they are outside the scope of this lecture.

135
00:11:14,490 --> 00:11:20,400
Intuitively, we know that what we often care about when it comes to something random is the mean in

136
00:11:20,400 --> 00:11:21,250
the variance.

137
00:11:21,750 --> 00:11:24,900
We know that the second moment is something like a variance.

138
00:11:25,560 --> 00:11:28,380
Similarly, the first moment is simply the mean.

139
00:11:29,040 --> 00:11:34,890
That is the expected value of X to the one or simply the expected value of X is the mean.

140
00:11:39,690 --> 00:11:46,110
At this point, we can recognize how this relates to momentum and arms, prop momentum essentially estimates

141
00:11:46,110 --> 00:11:50,200
the first moment of the gradient using an exponential moving average.

142
00:11:51,000 --> 00:11:57,180
Similarly, arms prop essentially estimates the second moment of the gradient using an exponential moving

143
00:11:57,180 --> 00:11:57,820
average.

144
00:11:59,040 --> 00:12:03,870
Now, since we might want to turn these two value separately, we will give each of them a possibly

145
00:12:03,870 --> 00:12:05,180
different value of data.

146
00:12:05,880 --> 00:12:10,360
We'll call the beta for the first moment beta one and we'll call the beta for the second moment.

147
00:12:10,380 --> 00:12:15,150
Beta to the last step is to essentially put these two ideas together.

148
00:12:15,300 --> 00:12:16,550
And that gives us Adam.

149
00:12:21,240 --> 00:12:26,310
In order to make sense of how to put these two ideas together, let's remind ourselves what momentum

150
00:12:26,310 --> 00:12:31,040
in arms prop look like when they are apart from momentum.

151
00:12:31,050 --> 00:12:35,850
We subtract the learning rate and multiply by four arms prob.

152
00:12:35,850 --> 00:12:41,820
We subtract the learning rate multiplied by the gradient and divide by the square root of V plus some

153
00:12:41,820 --> 00:12:42,630
tiny value.

154
00:12:43,590 --> 00:12:46,730
Thus for Atom we simply combine these two together.

155
00:12:47,280 --> 00:12:53,730
That is to say, we subtract the learning rate, multiply it by M and divide by the square root of V,

156
00:12:54,000 --> 00:12:57,150
plus the usual tiny value to avoid dividing by zero.

157
00:12:58,980 --> 00:13:04,020
Now, this is the essence of the Adam optimizer, but note that there is still one more step before

158
00:13:04,020 --> 00:13:09,780
we are done, since this lecture has been quite long so far, and this is a different topic, we will

159
00:13:09,780 --> 00:13:11,150
save it for the next lecture.