1
00:00:11,110 --> 00:00:17,560
In this lecture, we'll learn about one more step to finalize our atom optimizer equations to understand

2
00:00:17,560 --> 00:00:22,660
why we need to do this, we have to understand what the exponentially weighted moving average can be

3
00:00:22,660 --> 00:00:23,320
used for.

4
00:00:24,160 --> 00:00:27,970
One way to look at it is that it's just another way of computing the average.

5
00:00:28,510 --> 00:00:31,840
Specifically, it's useful when our data is non stationary.

6
00:00:32,380 --> 00:00:35,680
That is, the true expected value might be changing over time.

7
00:00:36,430 --> 00:00:40,190
Therefore, more recent values are more useful than older values.

8
00:00:41,050 --> 00:00:44,370
Another way to look at the exponential moving average is this.

9
00:00:45,130 --> 00:00:48,550
Suppose that you have some time to read signal such as a stock price.

10
00:00:49,090 --> 00:00:54,350
Clearly, this is a non stationary signal, so it's a good candidate for using this kind of average.

11
00:00:55,060 --> 00:00:59,260
Well, what happens if we apply our moving average to this time sorry signal.

12
00:00:59,800 --> 00:01:04,510
Let's call the input time series of T and let's call the output time series Y of T.

13
00:01:09,190 --> 00:01:15,160
Well, what you get is you get a smoothed version of the input signal, that is, it generally follows

14
00:01:15,160 --> 00:01:20,110
the same trend, but in a much smoother manner without lots of random jumps up and down.

15
00:01:20,830 --> 00:01:25,780
This is useful in finance, for example, when you want to know the trend of a stock price, but you

16
00:01:25,780 --> 00:01:27,850
don't care about minor fluctuations.

17
00:01:28,870 --> 00:01:32,870
If you have experience in signal processing, you might know this by another name.

18
00:01:33,580 --> 00:01:35,450
This is called the low pass filter.

19
00:01:36,190 --> 00:01:42,820
The reasoning behind that is this slow fluctuations have low frequencies, whereas fast fluctuations

20
00:01:42,820 --> 00:01:44,090
have high frequencies.

21
00:01:44,770 --> 00:01:50,080
Generally speaking, the low frequency movements are what we actually care about because they're usually

22
00:01:50,080 --> 00:01:51,340
larger in magnitude.

23
00:01:51,880 --> 00:01:56,470
The high frequency movements are usually very tiny and tend not to be that significant.

24
00:01:57,160 --> 00:02:03,730
So by using a low pass filter, we are essentially passing our input signal into an algorithm that removes

25
00:02:03,730 --> 00:02:06,130
the high frequencies we do not care about.

26
00:02:10,980 --> 00:02:13,600
There is one small problem with our low pass filter.

27
00:02:14,460 --> 00:02:19,850
Note that our output we have t always depends on the previous value Y of T minus one.

28
00:02:20,580 --> 00:02:25,530
But if we are trying to generate the first output Y of one, then what is Y of zero?

29
00:02:26,160 --> 00:02:29,170
A convention is to simply set this value to zero.

30
00:02:29,790 --> 00:02:31,410
But what happens when we do this?

31
00:02:36,130 --> 00:02:41,260
The answer is that at the beginning of our output, all the values are biased toward zero.

32
00:02:41,860 --> 00:02:47,140
Now, if you never implemented this before or you've never taken one of my courses where you got an

33
00:02:47,140 --> 00:02:52,480
opportunity to implement this, it would be very useful to try and implement it yourself, to see what

34
00:02:52,480 --> 00:02:54,400
I'm talking about first hand.

35
00:02:55,060 --> 00:03:00,010
It should be a pretty simple exercise and you can try this on any time series you choose, such as the

36
00:03:00,010 --> 00:03:01,420
price of your favorite stock.

37
00:03:02,290 --> 00:03:06,010
Now, what does it mean that the beginning values are biased toward zero?

38
00:03:06,820 --> 00:03:12,280
Well, this is clearly not good because it takes our filter some time to catch up and to start producing

39
00:03:12,280 --> 00:03:15,880
useful outputs as a side note in other fields.

40
00:03:16,030 --> 00:03:21,490
There are simple solutions to this, such as simply setting Y of one to be equal to X of one instead

41
00:03:21,490 --> 00:03:22,070
of zero.

42
00:03:22,810 --> 00:03:26,650
However, in deep learning we employ a method known as bias correction.

43
00:03:31,370 --> 00:03:37,970
So what is bias correction, bias correction means that instead of using what we would usually use Y

44
00:03:37,970 --> 00:03:46,160
of T, we adjust this value by some factor to get Y hat of T specifically, we divide by one minus beta

45
00:03:46,160 --> 00:03:47,150
to the party.

46
00:03:48,080 --> 00:03:54,350
One way to see that this works is by noting that since Beta is a number less than one beta to the power

47
00:03:54,350 --> 00:03:57,470
T will approach zero as he gets large.

48
00:03:58,130 --> 00:04:04,730
Therefore, when T is large, we essentially divide by a number close to one and we have T is very close

49
00:04:04,730 --> 00:04:06,530
to Y of T, which makes sense.

50
00:04:07,190 --> 00:04:11,120
This is because bias correction is only really needed when he is small.

51
00:04:11,810 --> 00:04:13,700
So what happens when he is small?

52
00:04:18,630 --> 00:04:23,790
The simplest way to see this is to plug some numbers in, let's suppose that Beita is equal to zero

53
00:04:23,790 --> 00:04:24,790
point nine nine.

54
00:04:25,500 --> 00:04:27,750
Then why of one is equal to better times.

55
00:04:27,750 --> 00:04:35,490
Why of zero plus one minus better times X of one since Y of zero is zero, this is just zero point zero

56
00:04:35,490 --> 00:04:37,080
one times X of one.

57
00:04:37,920 --> 00:04:43,440
Clearly this is not good since we are only taking one percent of the true input X of one.

58
00:04:44,430 --> 00:04:46,830
However, what do we get for Y height of one?

59
00:04:47,550 --> 00:04:51,930
Well, we take Y of one and divide it by one minus beta to the power one.

60
00:04:52,620 --> 00:04:59,010
Of course that's just one minus beta, but one minus beta is just zero point zero one as we have just

61
00:04:59,010 --> 00:04:59,490
seen.

62
00:05:00,600 --> 00:05:06,840
Therefore Y had of one is just zero point zero one at times X of one divided by zero point zero one,

63
00:05:07,110 --> 00:05:09,000
which is just X of one itself.

64
00:05:09,750 --> 00:05:12,540
Thus Y have one is no longer biased.

65
00:05:12,810 --> 00:05:14,850
It's exactly equal to X of one.

66
00:05:16,020 --> 00:05:17,760
How about when C is equal to two?

67
00:05:18,450 --> 00:05:21,390
In this case, Y of two is equal to better times.

68
00:05:21,390 --> 00:05:28,110
Y of one plus one minus better times X of two, which is equal to zero point nine nine times Y of one

69
00:05:28,350 --> 00:05:30,750
plus zero point zero one at times X of two.

70
00:05:32,420 --> 00:05:37,670
However, we know that we have one as equal to zero point zero one at times because of one, so we can

71
00:05:37,670 --> 00:05:38,420
plug that in.

72
00:05:39,440 --> 00:05:45,500
We then get why of two sequel to zero point zero zero nine nine times out of one plus zero point zero

73
00:05:45,500 --> 00:05:47,010
one a time XXXIV two.

74
00:05:47,900 --> 00:05:53,080
So now we basically have about one percent of acts of one and about one percent of X of two.

75
00:05:53,660 --> 00:05:55,640
Still not a good estimate for Y of two.

76
00:05:56,600 --> 00:05:58,790
But what happens when we do bias correction?

77
00:06:00,750 --> 00:06:04,380
We take one of two and divide that by one minus beta squared.

78
00:06:05,160 --> 00:06:09,390
Now the numbers get a little messy, so you'll have to do the calculations on your own at home.

79
00:06:09,930 --> 00:06:11,850
But what you should end up with is this.

80
00:06:12,390 --> 00:06:18,720
You should get why had to zero two zero point four nine seven times X of one plus zero point five zero

81
00:06:18,720 --> 00:06:20,410
three times X of two.

82
00:06:21,060 --> 00:06:22,380
So does this make sense?

83
00:06:22,830 --> 00:06:24,370
This makes perfect sense.

84
00:06:24,720 --> 00:06:28,250
We take about half of X of one and about half of X of two.

85
00:06:28,830 --> 00:06:33,690
We take a little more from X of two since X of two is more recent than X of one.

86
00:06:35,010 --> 00:06:40,230
Furthermore, notice how these adjusted weights always add up to one, which makes a lot of sense.

87
00:06:44,830 --> 00:06:48,390
OK, so how can we incorporate this into our atom optimization?

88
00:06:49,120 --> 00:06:54,130
Well, we know that we are doing two exponential moving averages, one for M and one for V.

89
00:06:54,820 --> 00:07:00,580
Therefore, we simply replace what we had before with these byas corrected versions of Maanvi.

90
00:07:01,600 --> 00:07:06,880
So now, instead of subtracting the learning rate, multiplying by M and dividing by the square root

91
00:07:06,880 --> 00:07:11,380
of V, we simply replace MLV with their bias corrected versions.

92
00:07:11,410 --> 00:07:12,700
Mhat and we have.

93
00:07:17,530 --> 00:07:24,160
In total, our final algorithm is this before we start a training loop, we initialize M zero to zero

94
00:07:24,670 --> 00:07:27,910
zero two zero and the time index T to zero.

95
00:07:28,960 --> 00:07:32,110
Then we answer our training loop inside the loop.

96
00:07:32,110 --> 00:07:34,080
We increment the time index by one.

97
00:07:35,050 --> 00:07:41,910
Next we find MFT using the usual recursive equation with the hyper parameter decay rate beta one.

98
00:07:42,760 --> 00:07:46,050
Next, we do the same thing for VTE using beta two.

99
00:07:47,260 --> 00:07:51,850
Next, we perform bias correction to get Mahadev T and we have T.

100
00:07:52,810 --> 00:07:56,710
Finally we perform our gradient update using the usual equation.

101
00:07:58,180 --> 00:08:01,510
Note again that this is for some generic parameter vector theta.

102
00:08:02,710 --> 00:08:06,570
In practice this represents a collection of all the parameters of your model.

103
00:08:07,630 --> 00:08:14,020
So you will need EMS and Vee's of the same size as all of your parameter vectors matrices and Tensas.

104
00:08:18,510 --> 00:08:24,130
The final thing I want to discuss in this lecture is what values do we choose for these type of parameters?

105
00:08:24,750 --> 00:08:28,510
Previously, I told you that Adam optimization is very robust.

106
00:08:29,040 --> 00:08:34,370
That is the same values work for a wide range of problems with practice and experience.

107
00:08:34,380 --> 00:08:38,570
You may find that this is the case with Tensor Flow and PI torch.

108
00:08:38,610 --> 00:08:43,830
Note that these default values are built into the optimizers, so you rarely have to think about them.

109
00:08:44,460 --> 00:08:50,280
However, if you do ever end up implementing Adam from scratch, here are the suggested default values

110
00:08:51,540 --> 00:08:52,410
for the learning rate.

111
00:08:52,560 --> 00:08:55,380
A typical default value is ten to the minus three.

112
00:08:56,130 --> 00:08:58,610
For Beta one, we use zero point nine.

113
00:08:59,250 --> 00:09:01,620
As you recall, this is a typical value.

114
00:09:01,620 --> 00:09:06,890
When we are using momentum for Beta two, we use zero point nine nine nine.

115
00:09:07,440 --> 00:09:09,690
As you recall, this is a typical value.

116
00:09:09,690 --> 00:09:15,900
When we are using arms prop for Epsilon, a typical default value is ten to the minus eight.

117
00:09:20,620 --> 00:09:26,110
As a final thought to conclude this lecture, I want to mention that while Adam is a common default

118
00:09:26,110 --> 00:09:30,500
choice, you will still encounter scenarios where it is not the best choice.

119
00:09:30,970 --> 00:09:37,180
In fact, sometime after many people had adopted Adam as their go to default choice of optimizer, a

120
00:09:37,180 --> 00:09:42,790
paper was written that showed a regular stochastic gradient descent with momentum performs better than

121
00:09:42,790 --> 00:09:43,180
Adam.

122
00:09:44,260 --> 00:09:49,090
This paper is called the Marginal Value of Adaptive Gradient Methods in Machine Learning.

123
00:09:50,260 --> 00:09:55,960
Personally, before the invention of Adam and even after the invention of Adam, SAGD with momentum

124
00:09:56,140 --> 00:09:58,060
was often my own default choice.

125
00:09:58,660 --> 00:10:01,670
This reminds me of a rule that I often repeat to my students.

126
00:10:02,080 --> 00:10:06,420
This rule is that machine learning is experimentation, not philosophy.

127
00:10:07,090 --> 00:10:12,280
Oftentimes students just want to be told to one way of doing things so that they have less choices to

128
00:10:12,280 --> 00:10:15,120
make by themselves, which I suppose is more comfortable.

129
00:10:15,820 --> 00:10:22,330
But the reality is, unless you try it yourself, you will never know what the real answer is to put

130
00:10:22,330 --> 00:10:23,260
this in a different way.

131
00:10:23,440 --> 00:10:24,820
Ask yourself this question.

132
00:10:25,360 --> 00:10:30,910
What should you do when you want to know the output of a computer program, specifically a computer

133
00:10:30,910 --> 00:10:35,390
program where you train your model on some data set using a variety of optimizers?

134
00:10:36,310 --> 00:10:42,070
Well, it would be silly and inefficient to ask me or anyone else what the output of the program will

135
00:10:42,070 --> 00:10:42,520
be.

136
00:10:43,530 --> 00:10:50,220
Similarly, it would be incorrect to try and guess what the output of the program will be instead,

137
00:10:50,220 --> 00:10:52,690
if you want to know the output of a computer program.

138
00:10:53,010 --> 00:10:59,730
Let me tell you an ancient secret that only true goosh know the answer is to simply run the computer

139
00:10:59,730 --> 00:11:01,850
program and look at the output yourself.

140
00:11:02,640 --> 00:11:06,790
As always, I am glad to be the source of many surprising and unusual facts.

141
00:11:07,170 --> 00:11:09,210
Thanks for listening and I'll see you in the next lecture.