1
00:00:11,080 --> 00:00:17,680
In this video, we are going to look at Arima in code for the first few Arima lectures will continue

2
00:00:17,680 --> 00:00:22,300
looking at the airline passengers data to build your intuition on how a remote works.

3
00:00:22,600 --> 00:00:25,990
And then to cap off this section, we will return to looking at stocks.

4
00:00:26,770 --> 00:00:30,250
So in this lecture, we will focus solely on airline passengers.

5
00:00:30,550 --> 00:00:35,890
Since using this data set, it will be much easier to understand the behavior of a rhema under different

6
00:00:35,890 --> 00:00:38,470
party and new values with stocks.

7
00:00:38,620 --> 00:00:43,270
This is kind of difficult because you can't really tell if there's a problem due to bad settings or

8
00:00:43,270 --> 00:00:47,020
if it's simply due to the inherent randomness in stock prices themselves.

9
00:00:47,710 --> 00:00:49,430
So hopefully that makes sense for you.

10
00:00:51,730 --> 00:00:56,950
So we'll start by updating stats models, as you recall, at the current time, the version of stat's

11
00:00:56,950 --> 00:00:59,870
models on CoLab is different from the latest version.

12
00:01:00,880 --> 00:01:05,740
This notebook happens to contain code that will work differently depending on which version you have.

13
00:01:06,460 --> 00:01:11,350
Note that if you've seen a similar lecture from me before, there are a few details that are different

14
00:01:11,470 --> 00:01:12,950
in this version of the script.

15
00:01:13,420 --> 00:01:17,050
Of course, this sort of thing changes on pretty much a week by week basis.

16
00:01:17,290 --> 00:01:22,180
But I know that you, as a good data scientist, are not intimidated by these little details.

17
00:01:30,230 --> 00:01:34,700
OK, so the next thing we're going to do is download the airline passengers CSFI.

18
00:01:41,190 --> 00:01:44,280
Next, we import Pendas nonpaying matplotlib.

19
00:01:48,190 --> 00:01:52,570
Next, we're going to load in our CSFI using predatorial CSFI.

20
00:01:57,010 --> 00:02:00,790
Next, we do a different head just to remind you of what's in our data frame.

21
00:02:07,170 --> 00:02:11,190
Next, we do a dot plot to remind you what our data set looks like.

22
00:02:17,760 --> 00:02:20,910
Next, we're going to calculate the first difference in our data.

23
00:02:23,390 --> 00:02:28,670
Next, we do the plot so you can see what the first difference of the Time series looks like.

24
00:02:34,160 --> 00:02:40,220
As you can see, it's not quite stationary because there is some seasonality and the variation is increasing

25
00:02:40,220 --> 00:02:40,930
over time.

26
00:02:44,820 --> 00:02:50,580
Next, we're going to compute the log of the passengers column by calling airport log on the passengers

27
00:02:50,580 --> 00:02:51,060
column.

28
00:02:53,900 --> 00:02:56,240
Next, we plot the log passengers column.

29
00:03:02,720 --> 00:03:07,760
Notice something interesting about this, which is that the values don't seem to be growing as rapidly

30
00:03:07,970 --> 00:03:10,520
as they were in the non logged time series.

31
00:03:11,520 --> 00:03:15,690
Furthermore, the amplitude of the cycles looks more constant than it was before.

32
00:03:19,000 --> 00:03:22,870
Next, we're going to import the Arima class from stat's models.

33
00:03:27,410 --> 00:03:31,100
Next, we're going to set the frequency of our data from index to months.

34
00:03:33,090 --> 00:03:37,910
And we're also going to split our data into train and test using the last 12 months of our data set,

35
00:03:38,070 --> 00:03:39,090
as the test said.

36
00:03:42,850 --> 00:03:48,520
The next step is to create a trainer and test reacts in order to index our data frame either in the

37
00:03:48,520 --> 00:03:49,730
train or test rose.

38
00:03:50,770 --> 00:03:55,390
Recall that this is necessary for when we want to store our predictions in the data frame.

39
00:03:59,450 --> 00:04:01,680
Next, we're going to test in air one model.

40
00:04:02,330 --> 00:04:08,150
This is a model in which the current value in the Time series depends linearly on just the previous

41
00:04:08,150 --> 00:04:10,230
value, as you recall.

42
00:04:10,430 --> 00:04:13,070
This is equivalent to a rhema one zero zero.

43
00:04:16,640 --> 00:04:23,330
Next, we call a treatment outfit, as with hot winters, this returns and a remar result object.

44
00:04:23,690 --> 00:04:26,570
Unlike psychic learn, this does not return a model.

45
00:04:28,060 --> 00:04:34,540
Next, we call a rhema result predict to obtain the training predictions will place this in a new column

46
00:04:34,540 --> 00:04:36,820
in our data frame called R-1.

47
00:04:40,330 --> 00:04:46,210
Next will plot the passengers column along with the air one column to see how well our model fits with

48
00:04:46,210 --> 00:04:46,780
the data.

49
00:04:52,490 --> 00:04:57,680
So as you can see, we get this kind of delayed behavior, which should remind you of what happened

50
00:04:57,680 --> 00:05:05,000
with whole winters, as you recall, this is how hot winters also behaved when the model was specified.

51
00:05:05,540 --> 00:05:11,090
In these cases, it seems like the best your model can do is to just copy the previous value or close

52
00:05:11,090 --> 00:05:11,530
to it.

53
00:05:15,280 --> 00:05:21,670
Next, we're going to grab our out of sample predictions by calling a rhema result, get forecasts for

54
00:05:21,670 --> 00:05:22,750
NASA steps.

55
00:05:23,410 --> 00:05:29,090
OK, so unlike other prediction functions you may have seen, this does not return numerical data directly.

56
00:05:29,660 --> 00:05:33,160
Instead, this returns an object of type prediction results.

57
00:05:33,610 --> 00:05:36,210
You're encouraged to check the stats, models, documentation.

58
00:05:36,400 --> 00:05:42,100
If you want to learn exactly what attributes and methods are available in this course, you'll be shown

59
00:05:42,100 --> 00:05:43,190
the relevant parts.

60
00:05:43,810 --> 00:05:49,180
So if you want to access the actual data corresponding to the prediction, we simply call the attribute

61
00:05:49,180 --> 00:05:50,230
predicted mean.

62
00:05:50,950 --> 00:05:55,390
Note that we can also obtain data like confidence intervals, but we won't look at that just yet.

63
00:05:56,200 --> 00:06:00,430
For now, let's just plot the air one column again against the passengers data.

64
00:06:08,460 --> 00:06:14,130
As you can see, this model is not very good, in fact, it doesn't even capture the fact that the data

65
00:06:14,130 --> 00:06:15,300
is trending upwards.

66
00:06:15,720 --> 00:06:19,570
Instead, the forecast is going in completely the wrong direction.

67
00:06:20,010 --> 00:06:22,890
So even a naive forecast would be better than this.

68
00:06:26,330 --> 00:06:29,720
The next step is to explore the prediction results a little bit further.

69
00:06:30,230 --> 00:06:33,040
Let's first check to see what kind of object it really is.

70
00:06:33,440 --> 00:06:37,550
As you recall, we can accomplish this by using the tape function in Python.

71
00:06:40,930 --> 00:06:46,930
OK, so we see that this is a prediction results rapper, so this is a common paradigm in object oriented

72
00:06:46,930 --> 00:06:52,270
programming, but basically since this is a rapper, you can assume that any attribute or method that

73
00:06:52,270 --> 00:06:58,180
belongs to a prediction results object probably also belongs to this rapper object, at least if you

74
00:06:58,180 --> 00:06:59,040
want to be lazy.

75
00:06:59,380 --> 00:07:03,490
Otherwise, check the documentation in the case where you encounter any discrepancies.

76
00:07:06,320 --> 00:07:11,270
The next step is to learn how to obtain our confidence intervals in order to do this, we'll call the

77
00:07:11,270 --> 00:07:12,710
function and confident.

78
00:07:16,190 --> 00:07:21,800
OK, so as you can see what we get as a data frame, which contains two columns, the lower bound in

79
00:07:21,800 --> 00:07:23,900
the upper bounds for each point in time.

80
00:07:24,950 --> 00:07:27,130
Notice the convention for the column names.

81
00:07:27,410 --> 00:07:31,820
It's basically the string, lower or upper, prepend it to the original column name.

82
00:07:37,640 --> 00:07:42,980
Next, we're going to write a function that will plot the fitted values of an arena model along with

83
00:07:42,980 --> 00:07:45,470
the forecast and confidence intervals.

84
00:07:46,040 --> 00:07:50,390
This will be helpful for the rest of this script so that we don't have to keep writing this code over

85
00:07:50,390 --> 00:07:51,220
and over again.

86
00:07:52,040 --> 00:07:57,920
So this function, it takes in one argument, which is NRMA result object inside the function.

87
00:07:57,980 --> 00:08:03,300
The first thing we do is set the size of the player, which will be 15 columns by five rows.

88
00:08:04,280 --> 00:08:08,910
Next, we call applied function passing in the passenger's column of the data frame.

89
00:08:09,650 --> 00:08:11,210
This will plot the true data.

90
00:08:12,260 --> 00:08:17,150
Next, we're going to obtain the train predictions by calling a result that fitted values.

91
00:08:17,810 --> 00:08:23,240
Then we will plot the train predictions by calling the plot function again, passing in the train index

92
00:08:23,450 --> 00:08:25,960
along with the train predictions we just obtained.

93
00:08:26,450 --> 00:08:29,450
We'll make this plot green and we'll give it the label fitted.

94
00:08:30,590 --> 00:08:37,580
Next, we will plot the forecast to start, we need to call ResultSet forecast for NTSA steps.

95
00:08:38,000 --> 00:08:41,600
This gives us backup prediction results objects from there.

96
00:08:41,600 --> 00:08:46,730
We need to call the method in order to get a data frame containing the confidence intervals.

97
00:08:48,170 --> 00:08:53,110
The next step is to assign these confidence interval columns to variables called lower and upper.

98
00:08:54,170 --> 00:08:58,040
The next step is to grab our forecast to obtain the actual predicted values.

99
00:08:58,400 --> 00:09:00,790
Again, this uses the attribute predicted mean.

100
00:09:02,330 --> 00:09:08,150
Next, we will call the plot function again, this time passing in the test index along with the forecast

101
00:09:08,150 --> 00:09:08,930
values.

102
00:09:09,470 --> 00:09:13,100
We'll give this the label forecast next.

103
00:09:13,100 --> 00:09:18,980
In order to plot the confidence balance, we'll use the function at fill between the arguments to this

104
00:09:18,980 --> 00:09:20,120
will be the test index.

105
00:09:20,120 --> 00:09:25,850
Once again, the lower and upper values, the color red and alpha equals zero point three to set the

106
00:09:25,850 --> 00:09:26,720
transparency.

107
00:09:27,770 --> 00:09:31,850
Lastly, we call a legend so that the legend will show up in our PLI.

108
00:09:35,080 --> 00:09:40,000
All right, so in this next block, we're going to test the function we just created passing in the

109
00:09:40,000 --> 00:09:42,400
Arima result objects we got back earlier.

110
00:09:46,790 --> 00:09:53,720
OK, so this is our Awana forecast again, not very good, but we have many options to try next.

111
00:09:57,940 --> 00:10:02,740
Next, we're going to try something which I think should be pretty natural for most machine learning

112
00:10:02,740 --> 00:10:07,470
people, if our model is bad, why not simply add more inputs?

113
00:10:07,900 --> 00:10:11,290
And so in this next block, we're going to use an Artan model.

114
00:10:12,010 --> 00:10:15,220
Luckily, there's not much more work to do beyond what we've already done.

115
00:10:15,850 --> 00:10:20,090
We start by creating another Arima object with the order at 10 00.

116
00:10:20,770 --> 00:10:26,120
Next, we call a remote outfit and then we pass the Arima results into our plotting function.

117
00:10:26,800 --> 00:10:27,910
So let's run this.

118
00:10:34,910 --> 00:10:38,480
OK, and we see that this does, in fact, do a much better job.

119
00:10:39,110 --> 00:10:44,750
The first thing we notice is that it does not have the Lagan characteristic that the previous R-1 model

120
00:10:44,750 --> 00:10:45,190
had.

121
00:10:45,830 --> 00:10:51,140
This suggests that the model is actually learning to anticipate the pattern rather than just copying

122
00:10:51,140 --> 00:10:52,130
the last value.

123
00:10:52,730 --> 00:10:58,010
If we look at the forecast, we see that the model really does learn that the signal is periodic.

124
00:10:58,520 --> 00:11:02,870
Unfortunately, it seems to underestimate the true value significantly.

125
00:11:06,390 --> 00:11:12,780
Next, we're going to test and may one model, as you recall, this is equivalent to a rhema zero zero

126
00:11:12,780 --> 00:11:14,910
one as before.

127
00:11:14,910 --> 00:11:17,160
We can just repeat the same code from earlier.

128
00:11:17,700 --> 00:11:18,810
So let's run this.

129
00:11:23,320 --> 00:11:25,520
OK, so this looks pretty bad.

130
00:11:25,930 --> 00:11:28,570
It's clear that the May one model does not work.

131
00:11:28,600 --> 00:11:30,790
In fact, it's even worse than R1.

132
00:11:31,450 --> 00:11:36,770
As you recall, the moving average models expected forecast is always a constant value.

133
00:11:37,240 --> 00:11:40,630
So intuitively, we know that this is probably not the right choice.

134
00:11:45,810 --> 00:11:50,100
All right, so next, we're going to investigate using the log passengers.

135
00:11:50,730 --> 00:11:54,110
Let's begin by taking the first difference of the log passengers.

136
00:11:54,600 --> 00:11:56,430
Again, we use the function.

137
00:11:58,970 --> 00:12:02,360
Next, we plot the first difference of the log passengers.

138
00:12:06,230 --> 00:12:09,730
As you can see, the behavior is indeed what we expected.

139
00:12:10,340 --> 00:12:15,290
Unlike the first difference of the non logged passengers, these values do not seem to grow that much

140
00:12:15,290 --> 00:12:16,090
over time.

141
00:12:21,980 --> 00:12:27,590
All right, so next, before we do anything with the log passengers, I want to fit a new model on the

142
00:12:27,590 --> 00:12:29,200
non logs passengers.

143
00:12:29,720 --> 00:12:33,810
We know that defensing is good because it at least seems to remove the trend.

144
00:12:34,370 --> 00:12:40,850
So for this model, I'm going to use a rhema eight one one I've chosen P equals eight since it seems

145
00:12:40,850 --> 00:12:43,520
like regression on more pass values helps.

146
00:12:44,030 --> 00:12:50,660
I've chosen D equals one since as we saw, this removes the trend from the Time series and I've chosen

147
00:12:50,660 --> 00:12:55,680
Q equals one for good measure just so that we have a full Arima model with all the components.

148
00:12:56,390 --> 00:13:01,400
Now in order to plot the results from this Arima, we're going to need to create a new function.

149
00:13:01,670 --> 00:13:09,390
Y the reason why is because we've differenced our data, as you recall, when we difference our data.

150
00:13:09,530 --> 00:13:13,490
This removes one row from our data set, specifically the first row.

151
00:13:14,300 --> 00:13:19,220
Each point in the different time series is the current value minus the last value.

152
00:13:19,760 --> 00:13:24,520
But for the first row there is no last value and therefore that value can't exist.

153
00:13:25,280 --> 00:13:30,080
You'll see that if you try to use the previous function that we wrote for plotting the results, you

154
00:13:30,080 --> 00:13:30,970
will get an error.

155
00:13:31,520 --> 00:13:35,930
You were strongly encouraged to try it yourself and to see what the error actually is.

156
00:13:36,650 --> 00:13:40,700
OK, so this function is called plot fit and forecast int.

157
00:13:43,580 --> 00:13:46,520
It now takes in a three arguments instead of just one.

158
00:13:47,970 --> 00:13:51,660
First, it takes in the Arima results object, same as before.

159
00:13:52,350 --> 00:13:54,480
Second, it takes in a value for D.

160
00:13:55,530 --> 00:14:00,180
D will represent the first row in the Difference Time series that actually exists.

161
00:14:00,990 --> 00:14:06,000
Third, we have an argument called Call, which will tell us which column of the data frame we fit,

162
00:14:06,000 --> 00:14:09,960
the model line that will either be passengers or log passengers.

163
00:14:10,500 --> 00:14:14,760
Of course, we have to choose the appropriate one in order to plot the correct data.

164
00:14:14,760 --> 00:14:19,940
With the model predictions inside the function, we first plot the data.

165
00:14:20,670 --> 00:14:26,460
This looks the same as before, except now we index the data frame using the call argument rather than

166
00:14:26,460 --> 00:14:27,600
a fixed column name.

167
00:14:28,860 --> 00:14:32,270
Next, we grab our training predictions using the predict function.

168
00:14:32,820 --> 00:14:38,330
I've used the predict function so that we can be explicit about our indices, but also because it's

169
00:14:38,350 --> 00:14:44,460
necessary in order to get the values back in terms of the original time series rather than the difference

170
00:14:44,460 --> 00:14:45,540
the Time series.

171
00:14:46,530 --> 00:14:53,880
Notice that for the start argument I've passed in train that index index at D before this value would

172
00:14:53,880 --> 00:14:55,830
have just been zero instead of D.

173
00:14:58,040 --> 00:15:04,250
Next, we call the plot function for the predictions again, notice that for the index, I'm only passing

174
00:15:04,250 --> 00:15:06,680
in the values from the death row onward.

175
00:15:09,540 --> 00:15:16,110
Next, we plot the forecast, this part is the same as before we call the forecast function for n test

176
00:15:16,110 --> 00:15:20,340
timestamps, then we call the plot function to plot the forecast.

177
00:15:21,030 --> 00:15:24,300
Then we call the fill between function to plot the confidence interval.

178
00:15:26,040 --> 00:15:29,230
Lastly, we call the legend function to show the legend.

179
00:15:30,180 --> 00:15:33,780
Next, we're going to call our function, which will plot the results.

180
00:15:40,640 --> 00:15:47,300
OK, so as you can see, this does pretty well better than a purely auto regressive model, the train

181
00:15:47,300 --> 00:15:52,970
fit is pretty good and the forecast captures the seasonality, although it still seems to underestimate

182
00:15:52,970 --> 00:15:53,660
the peak.

183
00:15:57,560 --> 00:16:04,190
Next, we fit in a Rhema model with the same speed and values, except this time we use the logged data

184
00:16:04,190 --> 00:16:04,640
set.

185
00:16:05,390 --> 00:16:08,930
As usual, we call fit and then we plot the results.

186
00:16:14,280 --> 00:16:19,770
All right, so as you can see, our model still does decently well, it captures the seasonality, but

187
00:16:19,770 --> 00:16:21,360
it still underestimates the peak.

188
00:16:21,750 --> 00:16:24,440
It seems to do a bit worse than the non log data.

189
00:16:28,320 --> 00:16:34,050
So for the next step, I wondered what would happen if I simply added more auto regressive terms, this

190
00:16:34,050 --> 00:16:36,250
time with equals 12 and two equals zero.

191
00:16:36,810 --> 00:16:40,130
The reason I chose 12 here is because that's the length of the period.

192
00:16:40,710 --> 00:16:45,870
So it might make sense to use data from all 12 months over the past year in order to make a prediction.

193
00:16:47,100 --> 00:16:51,780
OK, so this is the same code as before, except now the Arima order is 12 one zero.

194
00:16:52,320 --> 00:16:53,700
OK, so let's run this.

195
00:16:59,830 --> 00:17:01,150
OK, so what do we see?

196
00:17:01,900 --> 00:17:04,320
Well, it seems like this is our best model so far.

197
00:17:08,080 --> 00:17:13,420
Now, just out of curiosity, I wanted to fit another model on the log passenger's data set with the

198
00:17:13,420 --> 00:17:14,980
same Arima parameters.

199
00:17:21,690 --> 00:17:25,520
OK, so, again, it looks pretty good, but it's hard to tell which one is better.

200
00:17:28,880 --> 00:17:33,680
Now, in order to really know how well these models have performed, it would be useful to look at a

201
00:17:33,680 --> 00:17:35,910
metric such as the root mean squared error.

202
00:17:36,650 --> 00:17:39,170
So next, we're going to define such a function.

203
00:17:39,860 --> 00:17:41,680
This will take in two arguments.

204
00:17:41,690 --> 00:17:47,630
The Arima results object and a flag to tell us whether the model was fitted on the log passengers or

205
00:17:47,630 --> 00:17:54,020
the non log passengers inside the function we call results forecast to get the forecast.

206
00:17:54,650 --> 00:17:56,420
Next, we check is logged.

207
00:17:56,870 --> 00:18:03,560
If is logged is true, then we exponentiation forecast to put the data back into the original scale.

208
00:18:04,310 --> 00:18:09,460
Finally, we assign the passengers column of the test data frame to a variable called T.

209
00:18:10,400 --> 00:18:13,550
Then we assign the forecast to a variable called Y.

210
00:18:14,930 --> 00:18:20,210
Then we apply our usual formula for the root mean squared error and then we return this value.

211
00:18:26,350 --> 00:18:30,730
All right, so in the next block, we compare the root mean squared error for the last four models we

212
00:18:30,730 --> 00:18:31,420
created.

213
00:18:32,320 --> 00:18:38,950
These are Remate one one on the raw data set a eight one one on the log data set, RMI 12 one zero on

214
00:18:38,950 --> 00:18:42,340
the raw data set in agreement 12, one zero on the log data set.

215
00:18:42,970 --> 00:18:44,350
OK, so let's run this.

216
00:18:47,420 --> 00:18:48,630
So what do we see?

217
00:18:49,610 --> 00:18:55,510
Firstly, our suspicions were correct, logging the data was a bit worse, at least for me, maybe one

218
00:18:55,520 --> 00:18:55,980
one.

219
00:18:56,880 --> 00:19:02,720
Next, we see that adding more lagged values to the auto regressive part of the model does really help.

220
00:19:03,110 --> 00:19:05,860
A rhema 12 one zero achieves the best accuracy.

221
00:19:06,260 --> 00:19:10,250
The best combination appears to be a rhema 12 one zero with logging.