1
00:00:11,090 --> 00:00:17,050
In this lecture, we are going to apply everything we learned about Arima and apply that to stock prices.

2
00:00:17,750 --> 00:00:22,220
So this goes back to the fundamental question we asked at the beginning of this section.

3
00:00:22,880 --> 00:00:27,470
If we make use of time series analysis, can we predict stock prices?

4
00:00:27,950 --> 00:00:33,800
After all, stock prices are a time series, just like airline passengers, just like sales data and

5
00:00:33,800 --> 00:00:34,520
so forth.

6
00:00:35,150 --> 00:00:39,020
I think by the end of this section, we've come to learn a few basic facts.

7
00:00:39,770 --> 00:00:44,940
Firstly, that Arima is a pretty well accepted staple of time series forecasting.

8
00:00:45,680 --> 00:00:51,500
Secondly, is that a more modern approach to finding the best Arima orders would be to use auto arima

9
00:00:51,680 --> 00:00:53,760
rather than trying to pick the orders manually.

10
00:00:54,470 --> 00:00:59,390
So this lecture is all about taking everything that we learned and applying that to stock prices.

11
00:01:00,020 --> 00:01:05,700
As a side note, I want to mention that this lecture contains no new code, which you haven't seen before.

12
00:01:06,230 --> 00:01:11,660
So if you feel like exercising what you've learned, you already know what we're going to do if you

13
00:01:11,660 --> 00:01:13,190
want to try to do it all on your own.

14
00:01:13,340 --> 00:01:14,930
That would be highly encouraged.

15
00:01:16,240 --> 00:01:20,050
OK, so let's start by downloading the S&amp;P 500 at CSFI.

16
00:01:22,870 --> 00:01:29,530
Next, let's install PMed Yarema, unfortunately, because this library doesn't come with Google CoLab,

17
00:01:29,650 --> 00:01:32,230
you're going to need to install it manually each time.

18
00:01:35,860 --> 00:01:41,110
Next, we're going to import PMed Yarema as well as pandas, lumpia matplotlib.

19
00:01:47,150 --> 00:01:52,190
Next, we're going to use Pedigree's GSV to load in our data as a data frame.

20
00:01:57,220 --> 00:02:01,570
Next, we're going to check the head to remind ourselves what the data looks like.

21
00:02:07,730 --> 00:02:12,860
Next, we're going to filter out Google's closed prices and assign this to a variable named GOOG.

22
00:02:14,430 --> 00:02:18,840
Note that you also have the option of filling the log prices instead for this lecture.

23
00:02:18,870 --> 00:02:23,070
We're going to work with the nonlawyers prices, but you're encouraged to try both on your own.

24
00:02:25,420 --> 00:02:30,570
Next, we're going to apply Google's closed prices so that we know what auto Auriemma is trying to model.

25
00:02:36,340 --> 00:02:43,010
Next, we're going to split our data into train and test set, we'll use 30 days as the forecast horizon.

26
00:02:46,180 --> 00:02:52,390
Next, let's call the function auto arima note that we pass in seasonal equals false sense.

27
00:02:52,390 --> 00:02:56,680
If we pass in a seasonal equals true, then we have to pass in the seasonal period.

28
00:02:56,830 --> 00:02:58,960
And of course, it's not clear what that is.

29
00:02:59,620 --> 00:03:01,300
OK, so let's run this.

30
00:03:16,010 --> 00:03:17,870
Next, let's run a model summary.

31
00:03:22,680 --> 00:03:26,820
OK, so we see that the best model found is in a rhema three one zero.

32
00:03:27,330 --> 00:03:28,680
So does that make sense?

33
00:03:29,280 --> 00:03:34,230
The difference in order makes a lot of sense, because as you've seen, returns are pretty stationary.

34
00:03:34,890 --> 00:03:40,020
Although the first difference of the raw price is not exactly the return, we can imagine that they

35
00:03:40,020 --> 00:03:41,760
are related to.

36
00:03:41,760 --> 00:03:45,130
Be sure you might want to plot the first difference and check it yourself.

37
00:03:46,500 --> 00:03:50,460
The next thing we see is that the auto regressive component has order three.

38
00:03:50,910 --> 00:03:52,220
That kind of makes sense.

39
00:03:52,620 --> 00:03:57,380
It says that the last three days of data might be useful in predicting the next return.

40
00:04:00,760 --> 00:04:03,700
OK, so let's look at Model yet Perens.

41
00:04:05,150 --> 00:04:09,960
This provides some interesting information about the model, but not much we didn't already know.

42
00:04:10,610 --> 00:04:14,950
The important part of this is that it gives us a nice way to retrieve the order of the model.

43
00:04:15,530 --> 00:04:20,350
As you recall, we need to know this, since it affects which days we can make predictions for.

44
00:04:20,870 --> 00:04:26,420
For example, if this is equal to one, then we can make a prediction on day one, because defensing

45
00:04:26,420 --> 00:04:28,790
always makes the first row and any value.

46
00:04:32,720 --> 00:04:35,920
OK, so next, we're going to write a function called plot result.

47
00:04:36,530 --> 00:04:39,990
This function is very similar to the ones we wrote for the previous scripts.

48
00:04:40,700 --> 00:04:47,060
Basically, the point is to plot the data itself, the fit it values for the sample data, the forecast

49
00:04:47,060 --> 00:04:49,990
for the out of sample data and the confidence balance.

50
00:04:50,630 --> 00:04:57,320
So as input into this function, we take in a trained model the full data, the train data and the test

51
00:04:57,320 --> 00:04:57,800
data.

52
00:04:58,610 --> 00:05:00,080
The full data is just the data.

53
00:05:00,080 --> 00:05:02,510
Before we split the data into train and test.

54
00:05:03,800 --> 00:05:08,730
The first thing we do in this function is to make use of the jet programs function we just looked at.

55
00:05:09,290 --> 00:05:10,450
This is a dictionary.

56
00:05:10,700 --> 00:05:15,600
So first we look at the key order, which gives us the order of the fitted Arima.

57
00:05:16,400 --> 00:05:20,540
Next, we grab the second component, which gives us back the difference in order.

58
00:05:20,640 --> 00:05:26,670
The next we call model predict in sample to get the train predictions.

59
00:05:27,260 --> 00:05:34,760
Note that the starting value is the next we call model to predict to get the out of sample predictions

60
00:05:34,790 --> 00:05:36,680
along with the confidence balance.

61
00:05:37,850 --> 00:05:39,880
Next, we call our plotting functions.

62
00:05:40,430 --> 00:05:42,030
First we plot the full data.

63
00:05:42,260 --> 00:05:44,000
This will be given the label data.

64
00:05:44,900 --> 00:05:46,760
Next, we plot the trend predictions.

65
00:05:47,000 --> 00:05:48,710
This will be given the label fitted.

66
00:05:49,370 --> 00:05:51,290
Next, we plot the test predictions.

67
00:05:51,710 --> 00:05:53,810
This will be given the label forecast.

68
00:05:54,410 --> 00:05:58,400
Next, we call the function to fill between to draw the confidence bounce.

69
00:05:59,090 --> 00:06:02,210
Lastly, we call a legend to show the legend.

70
00:06:04,140 --> 00:06:05,220
So let's run this.

71
00:06:07,790 --> 00:06:10,370
And let's test that our plot result function.

72
00:06:14,390 --> 00:06:17,790
So as you can see, the fitted model looks pretty good.

73
00:06:18,410 --> 00:06:21,590
Of course, you shouldn't trust a plow like this because it's very small.

74
00:06:22,330 --> 00:06:27,800
Unfortunately, it's too small to see whether or not our forecast is good, which is the part we actually

75
00:06:27,800 --> 00:06:28,410
care about.

76
00:06:33,710 --> 00:06:39,170
So next, we're going to write a function called a plot test, the point of this function is only to

77
00:06:39,170 --> 00:06:40,460
plot the test period.

78
00:06:41,030 --> 00:06:46,010
By seeing it up close, we'll get a better idea of how well our model actually performs.

79
00:06:46,610 --> 00:06:52,430
So in this function, we only take in at two arguments, a fit model and the test data frame.

80
00:06:54,060 --> 00:06:59,280
Inside the function we call model predict to get the predictions and the confidence bounce.

81
00:07:00,030 --> 00:07:05,310
Next, we plot the true test data along with the forecast and then the confidence bounce.

82
00:07:05,850 --> 00:07:10,190
Actually, this is the same code from earlier, so hopefully you're comfortable with it by now.

83
00:07:16,660 --> 00:07:19,390
Next, we test out our new functional a test.

84
00:07:23,060 --> 00:07:24,080
So what do we see?

85
00:07:24,920 --> 00:07:28,150
Well, we see that our forecast is not actually that good.

86
00:07:28,820 --> 00:07:31,260
Remember, this is over 30 trading days.

87
00:07:32,150 --> 00:07:36,390
One nice thing about this is that it does seem to capture that average quite well.

88
00:07:37,070 --> 00:07:42,550
In addition, the true price always seems to stay within the confidence bounds for the most part.

89
00:07:44,060 --> 00:07:48,770
Of course, it's worth asking whether a model like this could actually predict what will happen over

90
00:07:48,770 --> 00:07:50,510
the next 30 trading days.

91
00:07:51,410 --> 00:07:54,470
Elon Musk goes on Twitter and starts making crazy tweets.

92
00:07:54,680 --> 00:07:57,830
I don't think models like this will be able to predict that.

93
00:08:05,210 --> 00:08:10,940
OK, so as you know, one important question we have to ask at this point is, do these models perform

94
00:08:10,940 --> 00:08:12,980
better than the new forecast?

95
00:08:13,460 --> 00:08:15,090
Is using these models worth it?

96
00:08:15,860 --> 00:08:19,160
So let's define the arm as a function which you've seen before.

97
00:08:22,080 --> 00:08:28,950
Next, let's print out the pharmacy for our Arima model and for the naive forecast note that for the

98
00:08:28,950 --> 00:08:34,590
naive forecast, the train prediction can simply be a single value, which is the final value of the

99
00:08:34,590 --> 00:08:35,530
train series.

100
00:08:36,030 --> 00:08:42,420
This is because series and LEMPIRAS broadcast during subtraction so you can subtract a scalar from an

101
00:08:42,420 --> 00:08:43,680
array or vice versa.

102
00:08:45,240 --> 00:08:45,600
All right.

103
00:08:45,600 --> 00:08:46,740
So what do we see?

104
00:08:48,090 --> 00:08:53,250
Well, we see that although we've done a lot of hard work to train this model, it is, in fact the

105
00:08:53,250 --> 00:08:55,350
naive forecast that wins.

106
00:08:59,600 --> 00:09:03,260
OK, so next, we're going to do the same thing for Apple stocks.

107
00:09:08,450 --> 00:09:14,720
OK, so note that for Apple, there was a big dip in the price over the last period, it's worth asking

108
00:09:14,720 --> 00:09:19,430
yourself, is such a drastic event predictable by NRMA model?

109
00:09:20,480 --> 00:09:25,730
Remember that these models tend to follow patterns if there's a trend, it follows the trend.

110
00:09:25,910 --> 00:09:28,600
If they're seasonality, it follows the seasonality.

111
00:09:29,240 --> 00:09:34,850
But I think intuition, it tells us that models like these do not predict anomalous events.

112
00:09:35,180 --> 00:09:40,600
For example, a stock like this rising upwards very quickly and then dropping off all of a sudden.

113
00:09:41,060 --> 00:09:42,350
That's not really a pattern.

114
00:09:42,380 --> 00:09:44,210
That's more like the opposite of a pattern.

115
00:09:45,260 --> 00:09:49,250
OK, so next, we're going to split our data into train and test.

116
00:09:53,060 --> 00:09:57,630
Next, we're going to call Otto Arima again, note that seasonal is false.

117
00:09:58,160 --> 00:09:59,240
So let's run this.

118
00:10:13,070 --> 00:10:15,050
OK, so here's the model summary.

119
00:10:18,980 --> 00:10:24,710
Again, the fact that the difference in order is one is not surprising, what is kind of surprising

120
00:10:24,710 --> 00:10:27,580
is that this model is very different from the previous model.

121
00:10:28,070 --> 00:10:31,100
Here we have P equals two and two equals two.

122
00:10:31,520 --> 00:10:34,520
Earlier we had a P equals three and Q equals zero.

123
00:10:40,620 --> 00:10:42,900
Next, we're going to call a function of plot result.

124
00:10:46,870 --> 00:10:52,780
So as you can see, even with this tiny plot, it's obvious that the true data goes way outside the

125
00:10:52,780 --> 00:10:53,850
confidence bounds.

126
00:10:54,440 --> 00:10:57,820
However, I think this makes sense in terms of the model we're using.

127
00:10:58,540 --> 00:11:01,730
The model seems to just want to carry on the existing trend.

128
00:11:02,320 --> 00:11:08,080
Philosophically, I think that's what most models would do in the context of pattern recognition and

129
00:11:08,080 --> 00:11:08,920
machine learning.

130
00:11:09,070 --> 00:11:10,330
That makes total sense.

131
00:11:16,730 --> 00:11:22,250
OK, so next, let's call the function a plot test so we can see the forecast period of close up.

132
00:11:27,550 --> 00:11:31,100
All right, and we see that, in fact, the prediction is pretty far off.

133
00:11:31,810 --> 00:11:35,380
However, note that in the short term, the prediction is not that bad.

134
00:11:35,950 --> 00:11:40,530
Over the first few days, the true values stay within the confidence bounce for the most part.

135
00:11:45,270 --> 00:11:50,220
Now, of course, we still have to check the root mean squared error against the naive forecast.

136
00:11:53,450 --> 00:11:57,170
Again, the Niyi forecast performs better than the Arima.

137
00:12:01,520 --> 00:12:07,520
Next, we're going to do the same exercise, but for the IBM stock prices, again, let's start by plotting

138
00:12:07,520 --> 00:12:08,140
the data.

139
00:12:14,460 --> 00:12:17,040
Next, let's split the data into train and test.

140
00:12:20,970 --> 00:12:22,950
Next, let's call Auto Arima.

141
00:12:29,530 --> 00:12:31,490
Next, let's check model summary.

142
00:12:35,390 --> 00:12:41,220
OK, so what do we see, in fact, for IBM, the best fitting model is a random walk.

143
00:12:41,660 --> 00:12:42,590
What does this mean?

144
00:12:43,280 --> 00:12:49,970
It means that IBM stock returns are completely unpredictable, given lagged values or lagged errors.

145
00:12:54,280 --> 00:12:55,930
Next, let's call Plott result.

146
00:12:58,610 --> 00:13:00,750
OK, so nothing out of the ordinary.

147
00:13:01,220 --> 00:13:05,210
It does look like the confidence bounce and close the true values in the forecast.

148
00:13:09,870 --> 00:13:11,610
Next, let's call playtest.

149
00:13:14,160 --> 00:13:19,770
Again, we see that the confidence balance and close the true values, in fact, this prediction doesn't

150
00:13:19,770 --> 00:13:25,110
look that bad, the actual price crisis through the prediction line multiple times.

151
00:13:28,600 --> 00:13:33,130
So what happens when we check the root mean square error against the naive forecast?

152
00:13:35,890 --> 00:13:40,900
Well, we see that they are the same since this is a random walk model with no drift.

153
00:13:41,230 --> 00:13:46,090
The forecast is just the last known value and it's the same as the naive forecast.

154
00:13:49,220 --> 00:13:53,930
OK, so next, let's try Starbucks again, let's start by looking at the plot.

155
00:13:59,190 --> 00:14:02,010
Next, let's split the data into train and test.

156
00:14:05,760 --> 00:14:07,950
Next, let's call Otto Arima.

157
00:14:15,770 --> 00:14:17,810
Next, let's check the model summary.

158
00:14:22,420 --> 00:14:29,650
OK, so here is another random walk, a rhema zero one zero note, however, that this is a random walk

159
00:14:29,650 --> 00:14:33,930
with Dreft since the model fits with a non-zero intercept term.

160
00:14:37,520 --> 00:14:39,320
Next, let's check Resul.

161
00:14:42,250 --> 00:14:45,550
OK, so it's a bit hard to see what's going on during the forecast.

162
00:14:49,860 --> 00:14:51,960
Next, let's check playtest.

163
00:14:56,070 --> 00:15:01,770
All right, so this doesn't look too bad, it's decently accurate for the first few values and then

164
00:15:01,770 --> 00:15:07,440
it overestimates the price when the price drops, although most of the time the price does stay within

165
00:15:07,440 --> 00:15:08,670
the confidence bounds.

166
00:15:09,330 --> 00:15:13,410
So the lesson is, don't ignore the negative side of your confidence balance.

167
00:15:13,770 --> 00:15:17,700
Those values are perfectly achievable and they result in a loss of money.

168
00:15:22,720 --> 00:15:25,270
So what happens when we check the root mean square error?

169
00:15:29,240 --> 00:15:35,240
Well, although we detected that this was a random walk with a non-zero drift, it turns out that assuming

170
00:15:35,240 --> 00:15:40,450
a random walk with zero drift would have been better, the naive forecast wins again.