1
00:00:11,060 --> 00:00:16,220
In this lecture, we will be looking at the theory behind vector auto, regressive and moving average

2
00:00:16,220 --> 00:00:16,940
models.

3
00:00:17,750 --> 00:00:21,460
So let's begin with the model itself, since I think it's pretty straightforward.

4
00:00:22,040 --> 00:00:25,940
Basically, in the previous Arima model, the Time series was a scalar.

5
00:00:26,300 --> 00:00:27,950
Now the Time series is a vector.

6
00:00:28,670 --> 00:00:32,910
This is because of the Time series now contains multiple observations at once.

7
00:00:33,440 --> 00:00:38,550
For example, the temperature in multiple cities or the voltage at multiple locations in your brain.

8
00:00:39,530 --> 00:00:45,050
So in the scalar case, all the parameters are scalars as well, since the scalar times a scalar gives

9
00:00:45,050 --> 00:00:45,790
you a scalar.

10
00:00:46,280 --> 00:00:49,760
But in the vector case, the multiplicative parameters are matrices.

11
00:00:50,240 --> 00:00:51,740
This is because of matrix times.

12
00:00:51,740 --> 00:00:53,540
A vector gives you a vector.

13
00:00:54,170 --> 00:00:56,390
The bias term, however, is a vector.

14
00:00:57,590 --> 00:01:02,090
Now normally note that I will not use the arrows above variables to denote vectors.

15
00:01:02,420 --> 00:01:04,700
I'm just doing that here to be extra explicit.

16
00:01:05,270 --> 00:01:09,020
But most of the time it should be clear whether or not this is the case.

17
00:01:10,340 --> 00:01:13,160
Also, notice that I use capital letters from matrices.

18
00:01:13,520 --> 00:01:18,400
So in both cases we use five for the auto regressive part and theta for the moving average part.

19
00:01:19,040 --> 00:01:22,430
But in the AMA case we have lower case and lower case theta.

20
00:01:22,850 --> 00:01:26,060
For Varma, we have uppercase Phi and uppercase theta.

21
00:01:30,810 --> 00:01:36,780
OK, so here are some quick things to notice, firstly, note that these matrices are square matrices.

22
00:01:37,290 --> 00:01:43,350
This is because if your time series values are vectors of sized, then you need a D by The Matrix to

23
00:01:43,350 --> 00:01:47,250
get back a vector of sized after multiplying by the Matrix.

24
00:01:48,000 --> 00:01:52,170
Also note, this model still only expresses linear relationships.

25
00:01:52,920 --> 00:01:54,510
As with scalar omma models.

26
00:01:54,520 --> 00:01:58,850
It's possible to have purely auto regressive, purely moving average or both.

27
00:01:59,460 --> 00:02:04,020
The most common model is the vector auto regression, for reasons we will learn about later.

28
00:02:05,920 --> 00:02:10,150
Notice that we still use the parameters and cue the same way we did before.

29
00:02:10,720 --> 00:02:15,630
So is the number of pass values in the Time series and Q is the number of past errors.

30
00:02:16,240 --> 00:02:20,720
Not that these errors are also vectors of sized, the same as the Time series itself.

31
00:02:25,400 --> 00:02:31,040
OK, so I hope you agree that the Vamo model is a pretty straightforward extension of what we saw before

32
00:02:31,550 --> 00:02:36,470
the rest of this lecture, we'll look at a few extra details and we'll try to gain a better understanding

33
00:02:36,620 --> 00:02:37,940
of what the model does.

34
00:02:38,690 --> 00:02:44,900
So the easiest way to understand what the model is doing is to consider of Arpit in two dimensions.

35
00:02:45,500 --> 00:02:50,480
When we do this, it's easy to write down in scalar form to make it simple.

36
00:02:50,480 --> 00:02:51,890
Let's consider of our one.

37
00:02:53,040 --> 00:02:59,810
It's easy to see that our new model is just like our old model, except with extra terms in particular.

38
00:02:59,990 --> 00:03:05,780
The current output depends on its own past lags as it normally would for an error process.

39
00:03:06,230 --> 00:03:10,350
But it also depends on past lags of the other time series as well.

40
00:03:11,090 --> 00:03:16,370
And note that this relationship is completely linear in terms of all the legs, whether they come from

41
00:03:16,370 --> 00:03:18,080
the same or different components.

42
00:03:19,940 --> 00:03:26,000
Now, suppose that we simply had to separate air one models in this case, we see that the equations

43
00:03:26,000 --> 00:03:29,460
would almost be the same, except that there are no cross terms.

44
00:03:29,900 --> 00:03:32,430
So each time series cannot affect the others.

45
00:03:33,200 --> 00:03:35,030
So that's the premise of this model.

46
00:03:35,540 --> 00:03:40,970
By using vector auto regression, you're assuming that there's some predictive capacity across multiple

47
00:03:40,970 --> 00:03:42,170
scale or time series.

48
00:03:46,810 --> 00:03:52,390
One interesting exercise to consider is how many parameters are needed for a vector out of aggression

49
00:03:52,600 --> 00:03:55,270
compared to just multiple scalar auto regressions.

50
00:03:55,900 --> 00:04:00,070
Let's consider only just for simplicity as an exercise.

51
00:04:00,220 --> 00:04:03,040
You might want to try to derive these four Winmar as well.

52
00:04:04,090 --> 00:04:10,630
So in the scalar case, if we have a D independent time series and corresponding IRP models, then in

53
00:04:10,630 --> 00:04:14,260
total we will have D times plus one parameters.

54
00:04:14,740 --> 00:04:21,040
This is because each individual model will have plus one parameters, including the intercept and we

55
00:04:21,040 --> 00:04:22,390
will have a D of these.

56
00:04:23,380 --> 00:04:26,840
Now let's consider the vector case in the vector case.

57
00:04:27,070 --> 00:04:29,590
We will have P matrices each of size D.

58
00:04:30,700 --> 00:04:34,110
We'll also have a bias term of size D, which is now a vector.

59
00:04:34,930 --> 00:04:39,070
So in total we will have P times D squared plus the parameters.

60
00:04:40,120 --> 00:04:45,700
You should convince yourself that as D grows, the number of parameters in the vector model grows faster.

61
00:04:46,300 --> 00:04:50,460
In fact, we would say it grows quite dramatically with the number of components.

62
00:04:50,920 --> 00:04:55,760
You can imagine why this might be detrimental in the case where your dimensionality is large.

63
00:04:56,560 --> 00:05:00,850
So if you've studied machine learning, then you've seen that having too many parameters can lead to

64
00:05:00,850 --> 00:05:02,710
overfitting, which is not good.

65
00:05:03,160 --> 00:05:08,110
This means that your model performs well on in sample data but underperforms on new data.

66
00:05:12,710 --> 00:05:18,560
So one question to consider is, why are purely auto regressive models more popular than Varma where

67
00:05:18,560 --> 00:05:20,690
we combine it with the vector moving average?

68
00:05:21,290 --> 00:05:26,570
This seems counterintuitive since we've seen that moving average terms can make a rhema more powerful.

69
00:05:27,710 --> 00:05:31,100
So the reason has to do with model identifiability.

70
00:05:31,610 --> 00:05:37,820
Now, this can get pretty mathematical, but the basic idea is a Vamo models are not unique, meaning

71
00:05:37,820 --> 00:05:40,100
you cannot uniquely determine pacu.

72
00:05:41,360 --> 00:05:46,040
From a practical standpoint, this just means you'll get a warning from stat's models whenever you try

73
00:05:46,040 --> 00:05:46,850
to fit Varma.

74
00:05:47,240 --> 00:05:50,030
Now, whether or not that matters to you is up to you.

75
00:05:54,840 --> 00:06:00,960
Something else to consider is why is there no such thing as a very well, just like with armor, we

76
00:06:00,960 --> 00:06:04,820
only want to fit a model if the time series is weeks and stationary.

77
00:06:05,430 --> 00:06:09,220
But if this is the case, then why do we use Varma and not Roraima?

78
00:06:09,900 --> 00:06:11,260
Well, to be more specific.

79
00:06:11,280 --> 00:06:15,640
There seems to be some discussion of it online, but you won't find it in stats models.

80
00:06:16,650 --> 00:06:21,490
One reason this might be the case is there some parts of your time series might need to be different

81
00:06:21,510 --> 00:06:23,190
to a different number of times.

82
00:06:23,520 --> 00:06:25,620
So there won't only be one value for D.

83
00:06:26,460 --> 00:06:30,960
For example, if you're looking at GDP, this might grow over time with a trend.

84
00:06:31,410 --> 00:06:35,790
But if you're looking at unemployment at the same time, this would likely not have a trend.

85
00:06:36,390 --> 00:06:41,610
Since this is the case, it will be your responsibility to make sure each component of your time series

86
00:06:41,610 --> 00:06:43,800
is stationary before using Varma.

87
00:06:48,510 --> 00:06:54,180
OK, so now that you understand the basics of Varma, let's think about how to use it in Python, you'll

88
00:06:54,180 --> 00:06:55,930
find that it's pretty similar to a rhema.

89
00:06:56,910 --> 00:07:01,710
We'll look at both VMAX and VARE, which have slightly different APIs for some reason.

90
00:07:02,640 --> 00:07:04,330
OK, so let's start with Varma.

91
00:07:05,070 --> 00:07:10,060
In this case, we'll import a model called VMAX, which allows you to add exogenous data.

92
00:07:10,740 --> 00:07:15,810
This takes in a multivariate time series data set of shape t TBD ends in order tuple.

93
00:07:16,560 --> 00:07:19,320
So there's tuple should contain your values for PACU.

94
00:07:20,250 --> 00:07:24,470
But note that it does not specify any differences since you'll have to do that yourself.

95
00:07:25,440 --> 00:07:29,350
The next step is to call model that fit, which returns a results object.

96
00:07:29,880 --> 00:07:34,070
This takes in some optional parameters, like the maximum number of iterations.

97
00:07:35,010 --> 00:07:38,270
So this isn't super important unless the defaults aren't working for you.

98
00:07:38,490 --> 00:07:41,580
So I encourage you to check the documentation if that's the case.

99
00:07:42,590 --> 00:07:48,390
OK, so as usual, once we have the results object, we can call the fitted values attribute, which

100
00:07:48,390 --> 00:07:50,430
will return the sample predictions.

101
00:07:51,000 --> 00:07:55,320
We can also call get forecast passing in the number of steps to forecast.

102
00:07:56,250 --> 00:08:03,030
As with the AMA, we can also get confidence intervals by using the function content as before.

103
00:08:03,060 --> 00:08:07,680
This returns a data frame with each column name prepend ID with either upper or lower.

104
00:08:12,420 --> 00:08:16,390
The next step is to look at the API for VoIP, which is a bit different.

105
00:08:17,130 --> 00:08:23,640
So in this case, we start the same way as before by creating of our object and passing in a TBD multivariate

106
00:08:23,640 --> 00:08:24,500
time series.

107
00:08:25,110 --> 00:08:27,750
At this point, you do not pass in the model order.

108
00:08:29,240 --> 00:08:34,070
So one interesting feature of this model is that it actually has a function called select order.

109
00:08:34,640 --> 00:08:39,470
When you call this function, you pass in the maximum number of legs you want your model to have.

110
00:08:40,700 --> 00:08:45,980
It then goes through every possibility and computes the AIC along with other criteria, which have a

111
00:08:45,980 --> 00:08:47,090
similar purpose.

112
00:08:47,700 --> 00:08:53,300
As mentioned before, statisticians love having multiple options, but normally we just use the AIC.

113
00:08:54,620 --> 00:08:59,330
So from here you'll get an object of type lag order results from this object.

114
00:08:59,510 --> 00:09:02,160
You can call an attribute called selected orders.

115
00:09:02,570 --> 00:09:07,610
This will return a dictionary where the key is the criterion and the value is the selected order.

116
00:09:08,150 --> 00:09:13,610
So you'll find the AIC back and so forth, each with a number that tells you which order leads to the

117
00:09:13,610 --> 00:09:14,330
best value.

118
00:09:15,530 --> 00:09:20,120
However, note that all this work is not really necessary because you have to pass in the same thing

119
00:09:20,120 --> 00:09:26,720
anyway when you call model does fit in this case, when you call model does fit, you pass in the maximum

120
00:09:26,720 --> 00:09:30,460
number of lags and the information criterion you want to use.

121
00:09:30,920 --> 00:09:33,690
So the fit function is going to do all the same work again.

122
00:09:34,700 --> 00:09:37,420
Note that this also means you cannot choose your own p.

123
00:09:38,510 --> 00:09:43,490
The real use of calling select order manually is that you can get to see if there's any discrepancy

124
00:09:43,700 --> 00:09:45,620
if you choose a different criterion.

125
00:09:46,910 --> 00:09:52,850
OK, so as usual, after you call modeled outfit, this is going to give you back of our results objects.

126
00:09:54,340 --> 00:09:59,380
From this, you can get the order selected by the fit function by calling the attribute K. Underscore

127
00:09:59,380 --> 00:10:06,220
are now this might seem just like an interesting value to look at, but in fact, it's required in order

128
00:10:06,220 --> 00:10:07,980
to complete the following steps.

129
00:10:08,680 --> 00:10:11,170
So the next step is to compute the forecast.

130
00:10:11,560 --> 00:10:17,440
But for some reason, this forecast function is unlike the others we've seen in this case.

131
00:10:17,470 --> 00:10:23,770
This function also takes in the prior values in the Time series, as you recall, other models we've

132
00:10:23,770 --> 00:10:27,730
looked at like AAMA, Arima and Varma do not work like this.

133
00:10:28,420 --> 00:10:33,410
So in order to get the forecast, you'll have to pass in the pass values of the Time series.

134
00:10:33,970 --> 00:10:39,220
This is where the lag order comes into play, since that's how many pass values you need to pass in

135
00:10:40,120 --> 00:10:41,050
in stat's models.

136
00:10:41,050 --> 00:10:45,510
They call this the prior value not to be confused with the prior in Bayesian machine learning.

137
00:10:46,780 --> 00:10:50,830
Now, you might think that this is pretty inconvenient, but in fact it's pretty useful.

138
00:10:51,430 --> 00:10:56,650
Recall that for other models like Arima, if you wanted to incorporate new data, we would just have

139
00:10:56,650 --> 00:10:59,920
to train the whole model again and get forecast again.

140
00:11:00,610 --> 00:11:03,490
Otherwise the model would have to use predicted values.

141
00:11:04,540 --> 00:11:09,970
So, for example, if you want to forecast YFC plus three, you would first need to forecast YFC plus

142
00:11:09,970 --> 00:11:15,890
one and why two plus two and use those predictions to compute the prediction for YFC plus three.

143
00:11:16,750 --> 00:11:21,270
But what if today is T plus two and you just want to make a forecast for one day ahead?

144
00:11:21,670 --> 00:11:25,810
In other words, you want to use the true values for why 50 plus one and YFC plus two?

145
00:11:26,680 --> 00:11:32,170
Well, with the rhema, we would need to train a new model that uses all the data up to date plus two.

146
00:11:32,740 --> 00:11:38,000
But with var we can just pass in the true lag's values assuming that the model hasn't changed.

147
00:11:38,500 --> 00:11:39,730
So this is useful.

148
00:11:39,730 --> 00:11:42,360
Assuming that you don't want to train a new model every day.

149
00:11:42,970 --> 00:11:47,050
You'll recall that this is also the same approach used in regular machine learning.

150
00:11:48,700 --> 00:11:54,540
One final thing to notice here is that both the prior input and the forecast are no higher res.

151
00:11:55,060 --> 00:12:00,730
This is different from the previous examples which accept data frames as input and give back forecast

152
00:12:00,730 --> 00:12:01,920
objects in return.

153
00:12:06,490 --> 00:12:10,030
OK, so let's quickly summarize this lecture since it's been quite long.

154
00:12:10,870 --> 00:12:14,260
We first looked at the equation for the full Varma Piku model.

155
00:12:15,010 --> 00:12:19,850
This helps us recognize that it's just a straightforward extension of Arma two vectors.

156
00:12:20,560 --> 00:12:26,020
We noted that this is still a linear model, except that now each time series depends linearly on its

157
00:12:26,020 --> 00:12:29,450
own, like values and the values of the other time series.

158
00:12:30,730 --> 00:12:36,050
We learn that one problem with Full Varma is that generally these models are not identifiable.

159
00:12:36,640 --> 00:12:38,650
You'll also see that they take longer to train.

160
00:12:39,850 --> 00:12:45,070
We learn that the number of parameters and vector models grows quadratic with the number of dimensions

161
00:12:45,070 --> 00:12:46,150
in the Time series.

162
00:12:46,720 --> 00:12:51,580
This could lead to overfitting and it's due to the fact that our parameters are matrices which are square.

163
00:12:52,780 --> 00:12:57,790
We then looked at the stats models API for both of Varma and VAR, which are slightly different.

164
00:12:58,540 --> 00:13:03,400
We saw that VAR models have the option for automatic order selection, although you could do the same

165
00:13:03,400 --> 00:13:05,990
thing manually with VMAX if you wanted.