1
00:00:11,090 --> 00:00:15,980
In this next part of the course, we are going to begin discussing a different kind of time series model,

2
00:00:16,280 --> 00:00:20,380
which is more like the machine learning type of model that you might be familiar with.

3
00:00:20,930 --> 00:00:26,480
In particular, we'll be looking at auto regressive models, moving average models and combinations

4
00:00:26,480 --> 00:00:27,160
thereof.

5
00:00:27,740 --> 00:00:30,170
Generally, these are known as Arima models.

6
00:00:30,610 --> 00:00:36,260
R stands for auto regressive, Emmas stands for Moving Average and I stands for Integrated.

7
00:00:36,620 --> 00:00:39,630
You will learn what all of these terms mean in the coming lectures.

8
00:00:40,160 --> 00:00:45,230
Note that in this context, moving average does not mean what it meant in the previous lectures, so

9
00:00:45,230 --> 00:00:46,540
don't get them confused.

10
00:00:47,000 --> 00:00:50,960
The names are similar, but as you'll see, the actual models are completely different.

11
00:00:51,620 --> 00:00:54,950
This lecture will focus on auto regressive models specifically.

12
00:00:59,940 --> 00:01:05,310
But first, what is the difference between a Rhema models and the exponential smoothing models we previously

13
00:01:05,310 --> 00:01:06,040
discussed?

14
00:01:06,540 --> 00:01:11,950
As you recall, the exponential smoothing models are built for very specific kinds of data.

15
00:01:12,390 --> 00:01:15,870
They model certain aspects of Time series very explicitly.

16
00:01:16,380 --> 00:01:20,940
In particular, they model linear trends and they model seasonality.

17
00:01:21,360 --> 00:01:23,240
These are built into these models.

18
00:01:23,700 --> 00:01:29,190
On the other hand, you'll find that Arima models impose no such structure in that way.

19
00:01:29,400 --> 00:01:34,020
They are more in the spirit of modern machine learning, where you take a model and you try to fit it

20
00:01:34,020 --> 00:01:36,960
to your data, whatever structure your data may have.

21
00:01:41,690 --> 00:01:48,020
So what are auto regressive models, auto regressive models are basically linear regression models where

22
00:01:48,020 --> 00:01:53,320
the inputs also known as the predictors, are past data points in the Time series.

23
00:01:53,990 --> 00:01:59,390
Let's first review how linear regression works since we may not have gone into enough detail earlier

24
00:01:59,390 --> 00:02:00,200
in the course.

25
00:02:00,860 --> 00:02:04,790
As you recall, the simplest linear regression model looks like this.

26
00:02:05,120 --> 00:02:13,490
We say Y hat equals M, X plus B, in this case, X is the input and Y hat is the prediction and B are

27
00:02:13,490 --> 00:02:17,500
parameters that are found through minimizing the error of the predictions.

28
00:02:18,050 --> 00:02:23,660
One simple example of linear regression would be trying to predict salary from years of experience.

29
00:02:23,990 --> 00:02:28,190
So X would represent years of experience and Y would represent salary.

30
00:02:33,190 --> 00:02:38,320
But what if we have more than one input, you might think years of experience might not be good enough

31
00:02:38,320 --> 00:02:41,280
on its own to make a good prediction about salary.

32
00:02:42,010 --> 00:02:44,650
Let's say we want to take into account age as well.

33
00:02:45,070 --> 00:02:53,350
Now, our model will look like this y hat equals one time zwaan plus W two times X two plus B in this

34
00:02:53,350 --> 00:03:00,030
case, X one could represent years of experience and X to represent age as before.

35
00:03:00,250 --> 00:03:07,000
One, two and B are found by minimizing the error of the predictions compared to their true values in

36
00:03:07,000 --> 00:03:07,630
the data.

37
00:03:09,260 --> 00:03:17,210
Let's think about what the interpretation of one and two and B could be as before B represents the Y

38
00:03:17,210 --> 00:03:21,630
intercept, that is the value of Y when all the X's are zero.

39
00:03:22,040 --> 00:03:26,950
If you plug in X one equals zero and X two equals zero, you can see that this is true.

40
00:03:27,680 --> 00:03:35,450
The WS, on the other hand, tell us how each of the X's affects Y if X one increases by one and everything

41
00:03:35,450 --> 00:03:41,090
else that is X two remains constant, then Y hat will increase by one.

42
00:03:41,600 --> 00:03:46,640
If you don't see this right away, I would recommend plugging in some numbers until you are convinced.

43
00:03:48,070 --> 00:03:55,960
So one is the amount that we had increases by when X one increases by one, to give you a concrete example,

44
00:03:56,110 --> 00:04:04,090
if X one represents years of experience and one equals 5000, then that means in our model, every additional

45
00:04:04,090 --> 00:04:08,230
year of experience should lead to a five thousand dollar increase in salary.

46
00:04:08,380 --> 00:04:09,330
On average.

47
00:04:10,120 --> 00:04:15,550
Note that we sometimes call this model multiple linear regression due to the fact that there are multiple

48
00:04:15,550 --> 00:04:19,590
inputs, whereas before we would call that simple linear regression.

49
00:04:20,080 --> 00:04:23,680
Usually I don't make this distinction and I just call it linear regression.

50
00:04:28,620 --> 00:04:33,300
OK, so what does this have to do with Time series and auto regressive models in particular?

51
00:04:33,900 --> 00:04:40,050
Well, an auto regressive model is nothing but a multiple linear regression model where the inputs are

52
00:04:40,050 --> 00:04:41,190
the past values.

53
00:04:41,190 --> 00:04:49,320
In the Time series, for example, we had a time t is a linear function of Y a time T minus one Y a

54
00:04:49,320 --> 00:04:53,400
time T minus two and so on up to Y a time T minus P.

55
00:04:54,150 --> 00:05:00,720
Note that when we use past data points in the Time series, that is why it time T minus one back to

56
00:05:00,720 --> 00:05:02,150
Y time T minus P.

57
00:05:02,520 --> 00:05:04,810
We call this an AARP model.

58
00:05:05,230 --> 00:05:10,070
Moreover, mostly we say that this is an auto regressive model of order P.

59
00:05:15,020 --> 00:05:20,780
Note that an equivalent way of expressing a linear regression model is this, instead of having Y hat

60
00:05:20,780 --> 00:05:28,520
on the left, we just have Y at time T on the right side, we have added a noise term for time t epsilon

61
00:05:28,520 --> 00:05:29,180
sub t.

62
00:05:29,810 --> 00:05:36,050
What this says is that what do we actually measure that is the true Y is a linear model of the inputs

63
00:05:36,260 --> 00:05:37,560
plus some noise.

64
00:05:38,060 --> 00:05:43,370
This must be the case since as you've seen, usually when we do a linear regression, the data points

65
00:05:43,370 --> 00:05:45,230
do not fall exactly on the line.

66
00:05:46,460 --> 00:05:53,120
Another way of thinking about this is that y had AT&amp;T is just the expected value of Y of T assuming

67
00:05:53,120 --> 00:05:55,580
that the noise terms expected value is zero.

68
00:06:00,550 --> 00:06:06,670
Now, just as a theoretical exercise, we are going to discuss how you might use psychic learn to implement

69
00:06:06,670 --> 00:06:08,150
an auto regressive model.

70
00:06:08,680 --> 00:06:13,930
In actuality, we are going to use stat's models, but I find it very instructive to think about how

71
00:06:13,930 --> 00:06:18,010
you use it, learn which forces you to think about the structure of the data.

72
00:06:18,790 --> 00:06:24,070
Let's think about what a data set might look like for a normal linear regression model where I'm trying

73
00:06:24,070 --> 00:06:27,550
to predict salary from years of experience and age.

74
00:06:28,000 --> 00:06:34,150
In order to do this, I might do a survey where I ask everyone in the office to anonymously fill in

75
00:06:34,150 --> 00:06:35,710
a spreadsheet that I've created.

76
00:06:36,160 --> 00:06:40,930
After all of my coworkers have filled in the spreadsheet, I will have something that looks like what

77
00:06:40,930 --> 00:06:41,790
you see here.

78
00:06:42,430 --> 00:06:45,020
Each row corresponds to a different person.

79
00:06:45,490 --> 00:06:49,460
Each column represents an input feature in our case.

80
00:06:49,480 --> 00:06:51,680
That's years of experience and age.

81
00:06:52,870 --> 00:06:54,730
We call this big table X.

82
00:06:54,880 --> 00:06:58,780
In fact, it is indeed a table, but it is also a matrix.

83
00:06:59,530 --> 00:07:03,330
Furthermore, we have Y, which is just a single column of salaries.

84
00:07:03,700 --> 00:07:08,760
Obviously each of these salaries corresponds to the same row in the X matrix.

85
00:07:09,070 --> 00:07:14,350
So if the first row of X belongs to Alice, then the first row of Y is Alice's salary.

86
00:07:19,260 --> 00:07:24,390
Using this knowledge, can we build a linear regression data set for an AP model?

87
00:07:24,720 --> 00:07:25,820
The answer is yes.

88
00:07:26,220 --> 00:07:30,880
Well, I'm just going to show you the answer, and your job will be to check that this makes sense.

89
00:07:31,320 --> 00:07:36,030
In fact, you could even implement this yourself using Scicluna as an exercise.

90
00:07:36,570 --> 00:07:37,920
So the idea is this.

91
00:07:38,400 --> 00:07:41,100
Suppose that we start with the fact that there are 10 data points.

92
00:07:41,100 --> 00:07:42,680
Why one up to why ten?

93
00:07:43,200 --> 00:07:48,210
We know that the equation to predict Y four is a linear function of Y one way too.

94
00:07:48,210 --> 00:07:49,020
And Y three.

95
00:07:49,560 --> 00:07:55,050
Therefore, if Y four is the target then y one y two and Y three are the inputs.

96
00:07:55,500 --> 00:08:01,160
Similarly, if Y five is the target then Y to y three and Y four are the inputs.

97
00:08:01,560 --> 00:08:04,590
If we keep doing this, we get the table that you see here.

98
00:08:06,710 --> 00:08:12,620
Note that in machine learning, we typically say that the size of X is envied where and is the number

99
00:08:12,620 --> 00:08:17,580
of samples and it is the number of input features for arena models.

100
00:08:17,750 --> 00:08:21,360
We will stick with the convention that the number of predictors is P.

101
00:08:21,770 --> 00:08:27,590
So in case you are confused about this, just remember that in the context of a Remar D is equal to

102
00:08:27,590 --> 00:08:28,100
P.

103
00:08:33,070 --> 00:08:38,440
So at this point, you know how to convert your time series into a tabular data format that can be passed

104
00:08:38,440 --> 00:08:41,980
in the cycle, learn once you have this, the rest is easy.

105
00:08:42,370 --> 00:08:48,610
As usual, you can create a model of type of linear regression you call model that fit to fit your model.

106
00:08:49,000 --> 00:08:54,280
After that, you can call model that predict to make predictions or model that score to check how good

107
00:08:54,280 --> 00:08:55,180
your model is.

108
00:08:55,600 --> 00:08:57,580
This is just another instance of my rule.

109
00:08:57,790 --> 00:08:59,260
All data is the same.

110
00:08:59,500 --> 00:09:01,300
It doesn't matter what your data is.

111
00:09:01,420 --> 00:09:07,000
The basic API never changes whether you're predicting salary or you're predicting a time series.

112
00:09:11,740 --> 00:09:18,190
Another interesting topic to consider is this we know based on our knowledge of machine learning, that

113
00:09:18,190 --> 00:09:20,340
linear models are not that powerful.

114
00:09:20,650 --> 00:09:23,710
In fact, they can only represent lines or planes.

115
00:09:24,250 --> 00:09:26,990
In addition to my rule, all data is the same.

116
00:09:27,040 --> 00:09:28,060
I have another rule.

117
00:09:28,300 --> 00:09:30,610
All machine learning interfaces are the same.

118
00:09:31,210 --> 00:09:32,110
What does this mean?

119
00:09:32,680 --> 00:09:38,170
It means that most of the time when you were doing machine learning, you can use different models say

120
00:09:38,230 --> 00:09:42,080
was learn without any change to the rest of your code.

121
00:09:42,760 --> 00:09:47,850
Previously we used linear regression and we noted that this is an auto regressive model.

122
00:09:48,280 --> 00:09:51,490
But why should we be limited to just linear regression?

123
00:09:52,000 --> 00:09:57,040
What's stopping us from, say, using a neural network or random forest, which we know are much more

124
00:09:57,040 --> 00:09:58,210
powerful models?

125
00:09:59,110 --> 00:10:04,150
If we have our X and we have a Y, there was nearly no change to the code required.

126
00:10:04,420 --> 00:10:08,170
Simply replace the linear regression class with some other class.

127
00:10:08,740 --> 00:10:14,380
On the other hand, it's also important to remember that in this class, which is focused around financial

128
00:10:14,380 --> 00:10:20,160
data, there was a strong emphasis on the modeling aspect rather than predictive accuracy.

129
00:10:20,680 --> 00:10:26,230
What you will learn in the following lectures is that Arima models allow us to understand our data in

130
00:10:26,230 --> 00:10:29,950
a much more in-depth way than simply plugging in a neural network.

131
00:10:34,710 --> 00:10:41,400
Yet another way to understand auto regressive models and Arima in general is this, as you know, one

132
00:10:41,400 --> 00:10:44,670
of the major themes of this course is price simulation.

133
00:10:45,150 --> 00:10:46,560
What were you actually studying in?

134
00:10:46,560 --> 00:10:53,040
This course is random processes and naturally a random process can be simulated using a computer.

135
00:10:53,640 --> 00:10:59,160
We looked at a few examples in previous lectures, such as simulating random walks and simulating a

136
00:10:59,160 --> 00:11:04,240
sequence of coin flips to decide whether you would take a step to the left or take a step to the right.

137
00:11:05,220 --> 00:11:10,770
So it's useful to ask the question if we are given an AP model that is all of the weights of the model,

138
00:11:11,010 --> 00:11:12,950
how do we generate a time series?

139
00:11:13,590 --> 00:11:18,720
Not that this is kind of the opposite question that we are used to considering in machine learning and

140
00:11:18,720 --> 00:11:19,490
machine learning.

141
00:11:19,500 --> 00:11:23,400
The question we usually ask is, given the time series, tell me the weights.

142
00:11:23,760 --> 00:11:25,350
Now we are going in reverse order.

143
00:11:25,590 --> 00:11:28,800
Given the weights, tell me what the time series might look like.

144
00:11:30,780 --> 00:11:36,390
So how can we do this, let's suppose we have an R three model, so the next value in our Time series

145
00:11:36,630 --> 00:11:38,990
always depends on the last three values.

146
00:11:39,600 --> 00:11:44,280
Let's suppose also that the first three values are given so that the first value we are responsible

147
00:11:44,280 --> 00:11:51,690
for generating is Y for in this case, the first step will be to generate Epsilon for the error at time

148
00:11:51,690 --> 00:11:56,070
four, which is a normal with mean zero and some variance sigma squared.

149
00:11:56,670 --> 00:12:00,840
Then we can use our formula to calculate Y for next.

150
00:12:00,840 --> 00:12:02,610
We have to calculate Y five.

151
00:12:03,090 --> 00:12:08,970
Before we can do that, we first generate Epsilon five again from the same normal with mean zero and

152
00:12:08,970 --> 00:12:10,350
variance sigma squared.

153
00:12:11,070 --> 00:12:18,300
Then we add Epsilon five to the Y to Y three and newly generated Y four terms and that gives us Y five.

154
00:12:19,500 --> 00:12:25,390
Then we generate Epsilon six and Y six, epsilon seven and Y seven and so on and so forth.

155
00:12:25,950 --> 00:12:31,560
So you see that by doing this we understand that this model is more expressive than a simple random

156
00:12:31,560 --> 00:12:32,060
walk.

157
00:12:32,640 --> 00:12:37,580
This would be a random walk if we only had a one previous value and a weight of one.

158
00:12:38,160 --> 00:12:41,970
Instead, we have a bunch more terms and arbitrary weights.

159
00:12:42,420 --> 00:12:46,080
In the next few lectures, we will make this model even more expressive.