1
00:00:11,050 --> 00:00:16,480
So it's really inevitable when you hear of a new time series model to ask the question, how will this

2
00:00:16,480 --> 00:00:20,090
model perform on stock predictions in this lecture?

3
00:00:20,110 --> 00:00:21,570
We will answer this question.

4
00:00:22,450 --> 00:00:28,000
I'll also point out some very bad mistakes I've come across online, which are the top hits on Google

5
00:00:28,000 --> 00:00:31,230
when you search for how to do stock predictions with profits.

6
00:00:31,780 --> 00:00:35,520
As I recall, one of these authors even claims to have a Ph.D..

7
00:00:36,250 --> 00:00:37,720
So take this as a lesson.

8
00:00:37,810 --> 00:00:43,660
Just because someone says they have a Ph.D. or works at such and such company does not imply that they

9
00:00:43,660 --> 00:00:44,560
know what they are doing.

10
00:00:45,520 --> 00:00:51,040
Moreover, this is yet another example of why you should not trust the top hits on a search engine,

11
00:00:51,580 --> 00:00:56,740
especially those that appear on the most popular blogging platforms, which I will not name but are

12
00:00:56,860 --> 00:00:57,640
pretty obvious.

13
00:00:57,640 --> 00:01:02,620
If you look so, we'll begin by installing profit once again.

14
00:01:14,490 --> 00:01:16,320
The next step is to download our data.

15
00:01:24,610 --> 00:01:27,250
The next step is to import the usual library's.

16
00:01:31,730 --> 00:01:35,330
The next step is to load in our data using PDF to read CSFI.

17
00:01:39,980 --> 00:01:42,460
The next step is to grab the clothes prices for Googs.

18
00:01:46,840 --> 00:01:51,400
The next step is to call Jughead head to remind ourselves what this data frame looks like.

19
00:01:57,620 --> 00:02:03,980
The next step is to rename the clothes column to Y as required by profit, we'll also create a new column

20
00:02:03,980 --> 00:02:07,520
called Yes from the Index, also required by profit.

21
00:02:11,930 --> 00:02:17,480
The next step is to run through the same lines of code we normally do not that I haven't used the entire

22
00:02:17,480 --> 00:02:20,870
time series to build this model only the last few years.

23
00:02:21,830 --> 00:02:27,440
Also note that for the forecast I've specified three hundred sixty five days, which is more than one

24
00:02:27,440 --> 00:02:28,640
year in trading time.

25
00:02:28,910 --> 00:02:33,510
But recall that the forecast gives you all the dates without skipping non-trading days.

26
00:02:33,950 --> 00:02:36,710
So basically this will be a one year forecast.

27
00:02:45,840 --> 00:02:49,470
OK, so looking at this plot, we see that it's not really a great fit.

28
00:02:50,040 --> 00:02:55,590
Moreover, notice that there seems to be some high frequency seasonal component in the forecast, which

29
00:02:55,590 --> 00:02:58,820
is weird because we know that stocks shouldn't look like this.

30
00:02:59,640 --> 00:03:05,370
If we look at the info printout, we see that daily seasonality has been disabled, which makes sense,

31
00:03:05,370 --> 00:03:08,040
but weekly seasonality has remained intact.

32
00:03:08,670 --> 00:03:13,890
Thus, we can assume that these weird little peaks are due to a false weekly seasonal component.

33
00:03:15,420 --> 00:03:20,190
Also, remember that profit is only capable of producing linear and logistic trends.

34
00:03:20,700 --> 00:03:26,520
Thus, this forecast simply assumes that the stock price will continue to go down in the future, which

35
00:03:26,520 --> 00:03:28,200
of course may not be sensible.

36
00:03:32,130 --> 00:03:34,080
The next step is to plot the components.

37
00:03:39,150 --> 00:03:43,950
So one interesting thing that should catch your eye is that the prediction interval for the trend that

38
00:03:43,950 --> 00:03:49,110
gets big very rapidly, this suggests that our model is not really sure of this trend.

39
00:03:52,800 --> 00:03:58,380
The next thing to notice is that there is, in fact, a weekly seasonal component, which makes no sense.

40
00:03:58,920 --> 00:04:04,590
We see that the values increase on the weekends, which cannot be the case since the market is not open

41
00:04:04,590 --> 00:04:05,610
on the weekends.

42
00:04:09,300 --> 00:04:12,240
The final thing to look at is the yearly seasonal component.

43
00:04:12,900 --> 00:04:15,220
Now, this might be real or it might not be.

44
00:04:15,480 --> 00:04:20,290
However, note that the magnitude of this component is much smaller compared to the trend.

45
00:04:20,850 --> 00:04:25,770
So although there seems to be an increase from April to August, this would still be overtaken by the

46
00:04:25,770 --> 00:04:26,460
trend.

47
00:04:30,190 --> 00:04:35,660
OK, so recall that for the previous model, we simply use the default settings for instantiating profit.

48
00:04:36,340 --> 00:04:41,710
What I want to show you in this next example is what not to do, which can be found on essentially every

49
00:04:41,710 --> 00:04:46,360
blog article on this topic, like blog articles on Lithium's.

50
00:04:46,360 --> 00:04:50,260
My suspicion is that these are people who have just all copied from each other.

51
00:04:50,920 --> 00:04:54,700
So the difference here is that we are setting daily seasonality to true.

52
00:04:55,420 --> 00:04:59,400
Of course, this makes absolutely zero sense because our data is daily.

53
00:05:00,070 --> 00:05:05,890
Remember that weekly seasonality is for detecting repeating patterns from one week to the next, which

54
00:05:05,890 --> 00:05:08,080
requires a sub weekly time series.

55
00:05:08,530 --> 00:05:14,170
Yearly seasonality is for detecting patterns that repeat from one year to the next, which requires

56
00:05:14,170 --> 00:05:15,870
a sub yearly time series.

57
00:05:16,930 --> 00:05:22,930
Thus, in order to have daily seasonality, we must have data which is sub daily, which we do not.

58
00:05:30,130 --> 00:05:34,760
However, note that when we run this, we do, incidentally, end up getting a smoother plot.

59
00:05:35,230 --> 00:05:39,790
Of course, this is still not satisfactory because it's based on false presumptions.

60
00:05:43,260 --> 00:05:45,270
The next step is to plot the components.

61
00:05:50,640 --> 00:05:55,680
OK, so most of this we've seen, but the interesting part should be the daily seasonal component.

62
00:05:56,370 --> 00:06:00,960
Now, you should be very worried when you see this because it means your model is doing something which

63
00:06:00,960 --> 00:06:01,830
is impossible.

64
00:06:02,490 --> 00:06:07,980
It's showing us that there are periodic changes within the day, which obviously our model has no chance

65
00:06:07,980 --> 00:06:11,280
of knowing because our data only has daily granularity.

66
00:06:14,870 --> 00:06:19,310
The next step is to create another model, but this time we're going to do what we should have done,

67
00:06:19,550 --> 00:06:22,280
which is to set weekly seasonality to false.

68
00:06:22,820 --> 00:06:25,760
This will help us avoid any false weekly patterns.

69
00:06:33,000 --> 00:06:38,370
OK, so notice how this ends up giving a supply without those very small weekly jumps that we saw the

70
00:06:38,370 --> 00:06:39,120
first time.

71
00:06:43,320 --> 00:06:45,420
The next step is to plot the components.

72
00:06:48,570 --> 00:06:54,180
So essentially, this is very similar to the previous plots, except that now we don't have any false

73
00:06:54,180 --> 00:06:57,390
weekly patterns nor any false daily patterns.

74
00:07:01,470 --> 00:07:08,040
In this next step, we're going to perform cross-validation, the goal of this is not to do cross-validation

75
00:07:08,040 --> 00:07:11,790
per say, but to compare it to the baseline forecasts.

76
00:07:12,330 --> 00:07:15,750
As you recall, I've stated multiple times that this is essential.

77
00:07:15,930 --> 00:07:21,660
If you want to claim that your model can predict stock prices, clearly it's not very good if it cannot

78
00:07:21,660 --> 00:07:23,850
beat simply predicting the last value.

79
00:07:26,530 --> 00:07:29,830
OK, so we're going to start by doing all the necessary imports.

80
00:07:34,600 --> 00:07:38,710
The next step is to create a model setting weekly seasonality to false.

81
00:07:42,850 --> 00:07:44,440
The next step is to fit our model.

82
00:07:50,960 --> 00:07:53,160
The next step is to call cross-validation.

83
00:07:54,020 --> 00:07:59,870
Now you'll notice that I've said both period and horizons of five, you can feel free to try other values,

84
00:07:59,870 --> 00:08:02,450
but you'll probably see that you get the same results.

85
00:08:02,960 --> 00:08:06,480
Note that I've also tried this with 15, 30 and 30, 60.

86
00:08:06,920 --> 00:08:09,070
Either way, the result has not changed.

87
00:08:23,030 --> 00:08:28,120
The next step is to call the head function to remind ourselves what the KVI data frame looks like,

88
00:08:28,730 --> 00:08:34,090
we're going to end up doing a little hack in order to evaluate the new forecast based on this format.

89
00:08:44,540 --> 00:08:50,120
OK, so the next step is to create a data frame called naïf, which we're going to initialize as a copy

90
00:08:50,120 --> 00:08:55,070
of Dfki, note that I've only taken four out of the six columns.

91
00:08:55,970 --> 00:09:01,370
The basic idea is we're going to use this data frame with the same format, but we are going to replace

92
00:09:01,370 --> 00:09:04,130
the white hat column with the night forecast.

93
00:09:04,730 --> 00:09:10,380
Now, because this data frame has the same format as the previous one, we can use all the same functions.

94
00:09:10,850 --> 00:09:16,010
So we can use profits, functionality to get the rolling SMAP, which is what we'll use to compare the

95
00:09:16,010 --> 00:09:17,030
two approaches.

96
00:09:22,410 --> 00:09:28,200
So the next step is a tiny bit complex, but basically we're going to implement the night forecast.

97
00:09:29,130 --> 00:09:34,350
As you recall, the DFAC data frame contains two columns which saw a timestamp.

98
00:09:34,890 --> 00:09:41,160
The cutoff column is where the forecast starts and the D column is the timestamp we are making a forecast

99
00:09:41,160 --> 00:09:41,700
for.

100
00:09:42,300 --> 00:09:45,990
Thus, every forecast should come from its cutoff day.

101
00:09:47,370 --> 00:09:53,250
Now, unfortunately, not every cutoff date exists in the Time series since, as you recall, we only

102
00:09:53,250 --> 00:09:55,010
have prices for trading days.

103
00:09:55,710 --> 00:10:01,530
So the basic strategy is we're going to look backwards to get the last known price and use that as the

104
00:10:01,530 --> 00:10:02,190
prediction.

105
00:10:03,750 --> 00:10:08,320
OK, so to start, we're going to create an empty array of zeros called naive storage.

106
00:10:08,700 --> 00:10:10,850
This is where we will store the predictions.

107
00:10:11,880 --> 00:10:17,060
The next step is to loop through every row of the naïf data frame inside the loop.

108
00:10:17,070 --> 00:10:18,770
We're going to grab the cutoff date.

109
00:10:19,620 --> 00:10:24,250
The next step is to check whether or not this cutoff date exists in the data frame.

110
00:10:24,900 --> 00:10:29,640
If it does not, then we will enter this wide loop where we decrement cut off by one day.

111
00:10:30,240 --> 00:10:34,840
So this loop will only quit when we have found a date which actually has a price.

112
00:10:35,430 --> 00:10:41,190
Once we found this data, we assign the value in the Y column to our naïf storage array at Index I.

113
00:10:42,120 --> 00:10:46,510
Once we are outside the loop, we can then assign naïf storage back to white hat.

114
00:10:46,950 --> 00:10:49,680
And at this point we will have our new forecast.

115
00:10:55,400 --> 00:11:01,700
OK, so the next step is to call performance metrics on Dev CSV, which contains our model predictions,

116
00:11:02,210 --> 00:11:07,010
since we'd like to summarize this in a single number, we're going to take the mean of the SMAP.

117
00:11:11,080 --> 00:11:16,390
Now, of course, this number isn't useful by itself, it's only useful relative to the benchmark.

118
00:11:18,460 --> 00:11:22,180
So the next step is to call the same function on the naïf data frame.

119
00:11:26,350 --> 00:11:30,570
And as you can see, it turns out that the naive forecast wins yet again.

120
00:11:34,180 --> 00:11:37,620
The next step is to make a plot of the cross-validation results.

121
00:11:43,550 --> 00:11:47,530
As expected, we see that the forecast error tends to grow over time.

122
00:11:51,910 --> 00:11:56,920
Now, you might wonder whether or not this will work if we use profit on the log price instead of the

123
00:11:56,920 --> 00:11:57,530
price.

124
00:11:58,120 --> 00:12:03,160
So in this next portion of the notebook, we're going to repeat the same steps, but using the log price

125
00:12:03,160 --> 00:12:03,850
instead.

126
00:12:06,150 --> 00:12:10,650
So the first step is to make a copy of Googs and then compute the log of the price.

127
00:12:15,390 --> 00:12:20,810
The next step is to create a model, fit the model, perform cross validation and compute the SMAP.

128
00:12:27,980 --> 00:12:30,530
OK, so this is the number we get for profit.

129
00:12:34,430 --> 00:12:39,170
The next step is to compute the naive forecast note that this is the same code as before.

130
00:12:44,470 --> 00:12:48,130
OK, so again, we find that the naive forecast wins.

131
00:12:52,540 --> 00:12:56,260
The final step is to make a plot of the cross-validation results.

132
00:13:01,170 --> 00:13:05,220
So, again, we see that over time, the forecast error increases.