1
00:00:11,050 --> 00:00:16,480
In this lecture, we are going to discuss another kind of moving average, the exponentially weighted

2
00:00:16,480 --> 00:00:17,390
moving average.

3
00:00:17,950 --> 00:00:22,490
Note that some other names for this are exponential smoothing and the low pass filter.

4
00:00:22,900 --> 00:00:27,970
So if you've taken some of my other courses before and you've heard me use those terms, recognize that

5
00:00:27,970 --> 00:00:33,850
this is the same thing, in fact, that this kind of moving average is very applicable in many areas

6
00:00:33,850 --> 00:00:37,590
of machine learning statistics, finance and signal processing.

7
00:00:37,840 --> 00:00:39,880
So you will generally see it pretty often.

8
00:00:40,510 --> 00:00:43,360
So what is the exponentially weighted moving average?

9
00:00:48,060 --> 00:00:52,500
I want to break this lecture up into two parts, the first part is the short summary.

10
00:00:52,920 --> 00:00:56,590
If you want to only watch this part and then skip to the code, that's fine.

11
00:00:57,120 --> 00:01:02,490
The second part of this lecture will be an optional in-depth discussion about why the exponentially

12
00:01:02,490 --> 00:01:04,470
weighted moving average has its name.

13
00:01:04,890 --> 00:01:10,010
You can opt to watch this if you want to get a better understanding of why and how this works.

14
00:01:14,500 --> 00:01:20,830
OK, so what's the short summary, as you know, the arithmetic mean can be calculated by taking all

15
00:01:20,830 --> 00:01:26,620
of your samples, summing them together and then dividing by the number of samples, the exponentially

16
00:01:26,620 --> 00:01:28,930
weighted moving average is calculated differently.

17
00:01:29,410 --> 00:01:33,480
In fact, it's calculated kind of on the fly or in an online manner.

18
00:01:34,000 --> 00:01:39,760
It says that the moving average at time T is equal to some constant alpha times.

19
00:01:39,760 --> 00:01:46,090
The sample at time T plus one minus alpha times the previous moving average at time at T minus one.

20
00:01:46,750 --> 00:01:52,540
In other words, at each step, the new moving average is the weighted some of the new sample and the

21
00:01:52,540 --> 00:01:53,800
old moving average.

22
00:01:54,400 --> 00:01:54,750
All right.

23
00:01:54,760 --> 00:01:55,740
So that's pretty much it.

24
00:01:55,840 --> 00:01:58,210
It's not a terribly complicated calculation.

25
00:01:58,750 --> 00:02:03,940
Of course, without further analysis, it's not clear why this is an average and it's not clear why

26
00:02:03,940 --> 00:02:05,260
it's exponentially weighted.

27
00:02:10,100 --> 00:02:16,430
The next part of our short summary is this how do we do it in code similar to the simple moving average

28
00:02:16,430 --> 00:02:20,540
we call a function on our series or our data frame called IWM.

29
00:02:20,990 --> 00:02:26,220
This returns and IWM object, which is similar to the rolling objects we saw previously.

30
00:02:26,780 --> 00:02:31,450
It has a similar set of functions such as mean variance, covariance and so forth.

31
00:02:36,200 --> 00:02:42,440
To discuss a practical issue, what value of Alpha should we choose Alpha is something like a decay

32
00:02:42,440 --> 00:02:43,040
factor.

33
00:02:43,490 --> 00:02:48,800
Typically, Alpha is chosen to be a small value between a zero and one like zero point one or zero point

34
00:02:48,800 --> 00:02:52,290
to it might help to look at some extreme cases.

35
00:02:52,580 --> 00:02:54,590
So let's say we choose Alpha equals one.

36
00:02:55,190 --> 00:02:58,880
That means set the average to be just the latest value of X..

37
00:02:59,300 --> 00:03:04,150
In this case, all we're doing is copying X and therefore it's not really an average at all.

38
00:03:04,820 --> 00:03:09,950
On the other hand, let's say we set Alpha equal to zero, then all we're doing is copying the previous

39
00:03:09,950 --> 00:03:14,720
average and we're not taking into account any new samples intuitively.

40
00:03:14,720 --> 00:03:21,410
Then if we set off a very close to one that says new samples matter much more in the old average matters,

41
00:03:21,410 --> 00:03:27,020
much less, you can imagine this will lead to a much more noisy time series which will more closely

42
00:03:27,020 --> 00:03:33,950
match the original if we set Alpha very close to zero that says new samples matter much less and the

43
00:03:33,950 --> 00:03:35,600
old average carries much more weight.

44
00:03:36,170 --> 00:03:41,450
In this situation, you'll get a much smoother time series and it will take a much more drastic change

45
00:03:41,450 --> 00:03:43,640
in X to affect the moving average.

46
00:03:48,390 --> 00:03:53,550
OK, so now that the short summary is complete, if you want to know the details behind the exponentially

47
00:03:53,550 --> 00:03:55,620
weighted moving average, keep listening.

48
00:03:56,490 --> 00:04:02,250
Let's suppose we want to calculate the usual arithmetic sample mean using the formula for the sample

49
00:04:02,250 --> 00:04:02,610
mean.

50
00:04:02,640 --> 00:04:04,710
You might suggest that this is quite obvious.

51
00:04:05,090 --> 00:04:09,960
Just take all the values of X that you've collected, add them all together and divide by the total

52
00:04:09,960 --> 00:04:11,610
number of X is that you have.

53
00:04:12,060 --> 00:04:14,140
The question is what's wrong with this?

54
00:04:14,760 --> 00:04:16,270
I'll give you a minute to think about it.

55
00:04:16,290 --> 00:04:19,500
So please pause the video until you think you have the answer.

56
00:04:24,530 --> 00:04:29,810
All right, so hopefully you thought about why calculating the sample mean naively might not be such

57
00:04:29,810 --> 00:04:30,710
a good idea.

58
00:04:31,370 --> 00:04:34,620
What if we have a lot or even an infinite amount of data?

59
00:04:35,300 --> 00:04:39,530
Obviously, our computers or our servers don't have an infinite amount of space.

60
00:04:39,980 --> 00:04:43,640
And even if they did, calculating a summation is of T.

61
00:04:43,850 --> 00:04:49,430
So the more data you have, the longer it will take and that will increase linearly with how much data

62
00:04:49,430 --> 00:04:50,250
you've collected.

63
00:04:51,170 --> 00:04:52,130
Here's my claim.

64
00:04:52,670 --> 00:04:59,060
I claim that you can make the calculation of the sample mean of one on each step in both space and time

65
00:04:59,060 --> 00:05:06,090
complexity, no matter how much data you collect again as an exercise before moving on to the next slide.

66
00:05:06,290 --> 00:05:08,930
I want you to think about how this might be the case.

67
00:05:09,470 --> 00:05:12,380
Please pause the video if you want to take a moment and think.

68
00:05:17,150 --> 00:05:22,790
OK, so hopefully you thought about how you might calculate a sample mean using constant space and time.

69
00:05:23,570 --> 00:05:29,600
The key is that you can calculate a sample mean using the previous sample mean let's call the sample

70
00:05:29,600 --> 00:05:36,380
mean after collecting samples X bar subscript T, this means that the sample mean after collecting T

71
00:05:36,380 --> 00:05:40,210
minus one samples is X, bar subscript T minus one.

72
00:05:40,790 --> 00:05:44,840
We can write down the definition of both of these, which I hope is pretty obvious.

73
00:05:45,740 --> 00:05:49,280
Now that you know the metric, let's again make this an exercise.

74
00:05:49,670 --> 00:05:56,810
Can you express Esbati in terms of X, bar T minus one, please pause the video until you've tried this

75
00:05:56,810 --> 00:05:57,470
on your own.

76
00:06:02,460 --> 00:06:04,060
OK, so here's what you can do.

77
00:06:04,680 --> 00:06:10,450
First, you take Esbati and split up the summation so that you only sum up to T minus one.

78
00:06:10,890 --> 00:06:13,680
Then you leave X subscript T by itself.

79
00:06:14,010 --> 00:06:16,170
This is just the last sample you've collected.

80
00:06:17,580 --> 00:06:23,340
The next step is to realize that the sum of the ex towers from one up to T minus one can be expressed

81
00:06:23,340 --> 00:06:25,940
in terms of X bar subscripts, T minus one.

82
00:06:26,550 --> 00:06:28,890
We just have to rearrange the equation from earlier.

83
00:06:29,670 --> 00:06:34,380
It's clear that this sum is just T minus one times X, bar T minus one.

84
00:06:35,930 --> 00:06:42,350
We can substitute this into our expression for Esbati to get the sample mean at time t in terms of the

85
00:06:42,350 --> 00:06:44,060
sample mean a time T minus one.

86
00:06:49,080 --> 00:06:53,400
One interesting thing you can do, although it's not totally clear why you'd want to do this at this

87
00:06:53,400 --> 00:06:56,220
time, is split up the formula as follows.

88
00:06:56,850 --> 00:07:04,080
The first step is to multiply out the one over Tetum that gives us T minus one over T as the first coefficient

89
00:07:04,230 --> 00:07:06,560
and one over T as the second coefficient.

90
00:07:07,140 --> 00:07:12,840
The second step is to simplify T minus one over T to one, minus one over T.

91
00:07:13,740 --> 00:07:17,290
At this point we can just leave this as is this is the form that we want.

92
00:07:17,700 --> 00:07:22,700
We have one term with the previous sample mean and we have one term with the latest sample.

93
00:07:23,490 --> 00:07:28,470
What's important to recognize about this equation is that we have discovered a way to calculate the

94
00:07:28,470 --> 00:07:33,600
sample mean that does not depend on carrying around all of the samples you've ever collected.

95
00:07:34,050 --> 00:07:39,450
All you need to have is the previous sample mean the latest sample and the number of samples you've

96
00:07:39,450 --> 00:07:40,200
seen in total.

97
00:07:45,280 --> 00:07:51,250
The next question to consider is, what if we believe that recent data matters more than past data?

98
00:07:51,910 --> 00:07:55,590
If we look at our equation carefully, we see an interesting characteristic.

99
00:07:56,140 --> 00:08:00,490
Remember that as we collect more and more samples, the value of tea is increasing.

100
00:08:01,120 --> 00:08:06,550
That means as we collect more and more samples, the weight that we give to the latest sample decreases.

101
00:08:07,120 --> 00:08:11,330
We can see that the weight that we give to the sample is exactly one over tea.

102
00:08:12,250 --> 00:08:16,930
Now, although this might make you think that the influence of each sample somehow decays over time,

103
00:08:17,200 --> 00:08:21,790
remember that this is not true because this is still just a regular arithmetic mean.

104
00:08:26,670 --> 00:08:32,250
But what if we want recent data to matter more, what would happen if instead of making the way one

105
00:08:32,250 --> 00:08:35,250
over tea, we simply make it a constant alpha?

106
00:08:35,820 --> 00:08:39,340
Well, then this is exactly the exponentially weighted moving average.

107
00:08:39,930 --> 00:08:45,810
The basic idea is instead of giving less and less weight to each new sample, we now give a constant

108
00:08:45,810 --> 00:08:47,150
weight to each new sample.

109
00:08:47,850 --> 00:08:51,000
Let's see how this affects the influence of each sample overall.

110
00:08:56,020 --> 00:09:01,450
The next question we want to answer is, how does this update actually implement an exponentially weighted

111
00:09:01,450 --> 00:09:02,260
moving average?

112
00:09:02,680 --> 00:09:04,120
Can we show that this is true?

113
00:09:04,780 --> 00:09:07,960
And in fact, it's not too difficult at this point.

114
00:09:07,960 --> 00:09:10,810
What we can do is just keep recursively plugging in.

115
00:09:10,810 --> 00:09:17,890
Older and older values of the sample mean so we can replace X, bar T minus one with its representation

116
00:09:17,890 --> 00:09:20,080
in terms of X bar at T minus two.

117
00:09:21,760 --> 00:09:27,280
Then we can multiply out the one minus alpha term so that we get X bar at T minus two by itself.

118
00:09:27,760 --> 00:09:33,970
Now we have three terms X bar T minus two, the sample at T minus one and the sample at time T.

119
00:09:34,960 --> 00:09:40,990
The next step is, of course, to replace X, bar T minus two with its representation in terms of X,

120
00:09:40,990 --> 00:09:42,110
bar T minus three.

121
00:09:42,820 --> 00:09:48,310
From there we can do the same thing, multiply out the one minus alpha and get each of the terms by

122
00:09:48,310 --> 00:09:49,120
themselves.

123
00:09:49,690 --> 00:09:51,520
At this point you should see a pattern.

124
00:09:52,850 --> 00:09:58,370
The number of individual samples keeps growing and the power on the one minus alpha term also keeps

125
00:09:58,370 --> 00:09:58,890
growing.

126
00:09:59,570 --> 00:10:05,660
If we keep repeating this pattern tee times, we end up with this expression involving a summation over

127
00:10:05,660 --> 00:10:08,060
all the past samples from one up to T.

128
00:10:08,750 --> 00:10:14,840
And of course, these weights are exactly exponentially decaying since Alpha is a number between zero

129
00:10:14,840 --> 00:10:18,970
and one, one minus Alpha is also a number between zero and one.

130
00:10:19,370 --> 00:10:24,560
And when you raise a number between zero and one to OPOWER, it gets smaller and smaller exponentially

131
00:10:24,740 --> 00:10:26,390
as K gets larger and larger.

132
00:10:31,510 --> 00:10:37,120
So how can we summarize what we've learned in this lecture, we've extended the concept of the mean

133
00:10:37,330 --> 00:10:39,460
to include the exponentially weighted mean.

134
00:10:39,940 --> 00:10:45,070
We can picture this by assigning weights to each of our samples with the arithmetic average.

135
00:10:45,250 --> 00:10:48,850
Each of the weights is just constant, with equal weight for each sample.

136
00:10:49,420 --> 00:10:54,460
With the exponentially weighted average, the weights decay exponentially, going backwards in time.

137
00:10:55,030 --> 00:10:57,440
This means that the latest sample matters the most.

138
00:10:57,670 --> 00:10:59,470
The second latest sample matters less.

139
00:10:59,650 --> 00:11:02,500
The third latest sample matters even less and so forth.
