1
00:00:11,090 --> 00:00:16,940
In this lecture, we are going to discuss how to forecast with a Time series model now, of course,

2
00:00:16,940 --> 00:00:19,070
you might say I already know how to do this.

3
00:00:19,070 --> 00:00:20,780
I just call models that predict.

4
00:00:21,290 --> 00:00:24,640
But that is not the answer because you don't actually know what's going on.

5
00:00:24,640 --> 00:00:30,440
And when you call that function, the purpose of this lecture is to actually expose a common mistake,

6
00:00:30,710 --> 00:00:35,930
although obviously, if you ever want to implement your own forecasting model from scratch, these ideas

7
00:00:35,930 --> 00:00:36,640
will be helpful.

8
00:00:41,720 --> 00:00:47,340
OK, so recall that one way to think of an auto regressive model is that it's just linear regression.

9
00:00:47,960 --> 00:00:53,360
You can implement an auto regressive model yourself provided that you organize your data in the correct

10
00:00:53,360 --> 00:00:53,800
way.

11
00:00:54,650 --> 00:00:56,080
Let's recall what that is.

12
00:00:56,690 --> 00:00:59,740
Suppose that we have a time series y one up to Y 10.

13
00:01:00,620 --> 00:01:03,500
Also, suppose that we want to build an AR three model.

14
00:01:03,500 --> 00:01:08,100
So P equals three, which means the number of columns in our data set is three.

15
00:01:08,720 --> 00:01:12,170
Therefore, we organize our data into tables as follows.

16
00:01:12,620 --> 00:01:19,790
The first row is y one way too and Y three, the corresponding target is Y for the second row is y two

17
00:01:19,790 --> 00:01:23,490
y three and Y for the corresponding target is Y five.

18
00:01:23,990 --> 00:01:24,370
All right.

19
00:01:24,380 --> 00:01:25,430
So you get the idea.

20
00:01:30,400 --> 00:01:34,610
Now, at this point in the course, we're a little more sophisticated than we were before.

21
00:01:35,290 --> 00:01:40,360
We know that in addition to the training data, we want to have test data so that we can be sure that

22
00:01:40,360 --> 00:01:42,550
our model actually predicts the future.

23
00:01:43,270 --> 00:01:46,610
So let's say we would like the last three samples to be the test data.

24
00:01:47,530 --> 00:01:47,880
All right.

25
00:01:47,890 --> 00:01:51,820
So we call the first four rows, the training data, X train and Y train.

26
00:01:52,390 --> 00:01:57,670
Then we call the last three rows the test data X test and we test what's next.

27
00:02:02,390 --> 00:02:09,380
Then we fit our model, so we instantiate a linear regression object, then we call model Duffett X,

28
00:02:09,380 --> 00:02:15,140
train Y train, then we can calculate the out of sample accuracy of our model by calling model that

29
00:02:15,140 --> 00:02:17,210
score X test y test.

30
00:02:18,080 --> 00:02:22,400
We can also make future predictions using models that predict X test.

31
00:02:23,150 --> 00:02:25,190
Unfortunately, this is all wrong.

32
00:02:25,520 --> 00:02:27,620
I repeat, do not do this.

33
00:02:32,400 --> 00:02:38,880
So this is something I alluded to earlier in my lecture on the naive forecast in that lecture, I said

34
00:02:38,880 --> 00:02:44,500
that one of the bad things marketers do is they write these articles about predicting stock prices with

35
00:02:44,780 --> 00:02:46,860
the firms that make this huge mistake.

36
00:02:47,460 --> 00:02:50,640
Well, this lecture is all about what that huge mistake is.

37
00:02:51,600 --> 00:02:57,240
As you've seen for yourself after going through this section, it's very easy to get a plot with a model.

38
00:02:57,240 --> 00:03:00,300
Predictions almost exactly track the true value.

39
00:03:00,990 --> 00:03:04,080
How can this be when stock prices are so hard to predict?

40
00:03:04,710 --> 00:03:10,620
Well, the reality is these models are doing nothing but making a prediction very close to the Niyi

41
00:03:10,620 --> 00:03:11,500
forecast.

42
00:03:12,000 --> 00:03:15,510
In other words, these models are basically just copying the last value.

43
00:03:15,810 --> 00:03:17,780
They're not learning the underlying pattern.

44
00:03:18,820 --> 00:03:24,100
The reason they look so close is because these plots typically contain so many data points that you

45
00:03:24,100 --> 00:03:25,610
don't get to see them up close.

46
00:03:26,320 --> 00:03:31,120
But how do these marketers get their place to look so good on the train set and the test set?

47
00:03:31,870 --> 00:03:37,690
As you know, the model should never see data from the test set and therefore should have no opportunity

48
00:03:37,690 --> 00:03:39,160
to copy those values.

49
00:03:43,860 --> 00:03:50,130
Well, in fact, that's exactly where the mistake is if we split up our data more carefully, what we

50
00:03:50,130 --> 00:03:51,480
would want to do is this.

51
00:03:52,140 --> 00:03:57,180
Let's suppose again that the last three data points are for the test said these are the data points

52
00:03:57,180 --> 00:03:58,500
that we want to forecast.

53
00:03:58,740 --> 00:03:59,700
So that's why eight.

54
00:03:59,700 --> 00:04:01,020
Why nine and why 10?

55
00:04:01,590 --> 00:04:07,610
Why one up to why seven belong to the train said, well then what's the problem with defining our X

56
00:04:07,620 --> 00:04:10,500
train on our X test matrices as we did before?

57
00:04:11,370 --> 00:04:18,090
Well, as you can see, we are making a mistake by including true test values as inputs into the model.

58
00:04:18,750 --> 00:04:20,280
This is not allowed.

59
00:04:25,190 --> 00:04:31,130
To see why this is more clearly, let's look at how we would actually compute a forecast, suppose that

60
00:04:31,130 --> 00:04:32,180
today is day seven.

61
00:04:32,720 --> 00:04:38,310
We are trying to make a forecast for day eight, day nine and 10 since today is day seven.

62
00:04:38,600 --> 00:04:42,270
We can only use way one up to seven to make predictions.

63
00:04:42,740 --> 00:04:48,620
So, for example, if today is July seven, then obviously we can't use the values on a July eight,

64
00:04:48,620 --> 00:04:52,130
July nine or July 10 because those days haven't happened yet.

65
00:04:52,970 --> 00:04:54,400
OK, so that should be fine.

66
00:04:54,830 --> 00:05:03,200
We can calculate in the usual way we had A is equal to B plus 51 times Y seven plus five, two times

67
00:05:03,200 --> 00:05:05,790
Y six plus five, three times Y five.

68
00:05:06,170 --> 00:05:07,160
That makes sense.

69
00:05:08,480 --> 00:05:09,980
Next we have to calculate why.

70
00:05:09,980 --> 00:05:10,710
Hat nine.

71
00:05:11,300 --> 00:05:12,650
Now you just follow the pattern.

72
00:05:12,860 --> 00:05:19,220
So you might say why had nine is equal to B plus Y one at times Y eight plus Y two times Y seven plus

73
00:05:19,220 --> 00:05:20,720
five, three times Y six.

74
00:05:21,290 --> 00:05:23,300
Unfortunately, this is all wrong.

75
00:05:23,990 --> 00:05:25,700
Remember, we do not know why.

76
00:05:25,700 --> 00:05:28,770
Ed, the best thing we can do is plug in.

77
00:05:28,800 --> 00:05:31,370
We had a our prediction for day eight.

78
00:05:32,180 --> 00:05:40,220
So in actuality we had nine is equal to B plus one times we had A plus five, two times Y seven plus

79
00:05:40,220 --> 00:05:41,630
five, three times Y six.

80
00:05:42,230 --> 00:05:47,750
We can repeat the same pattern for we had 10, which will depend on why had nine and we had eight and

81
00:05:47,750 --> 00:05:50,150
the true value on day seven y seven.

82
00:05:55,090 --> 00:06:00,490
So just in case this isn't clear, let's summarize what's happening and pretty much all of these blog

83
00:06:00,490 --> 00:06:06,070
posts and even some paired courses that I've seen claim to show you how to predict stock prices with

84
00:06:06,070 --> 00:06:07,180
Alethia jobs.

85
00:06:07,930 --> 00:06:11,460
First, they make the mistake of not splitting up the data properly.

86
00:06:12,010 --> 00:06:16,360
That is, they include test data as inputs into the forecast period.

87
00:06:16,930 --> 00:06:22,750
This isn't necessarily wrong if you make it clear that you are only trying to predict one day ahead.

88
00:06:23,320 --> 00:06:27,940
I've never seen anyone make this admission to by doing this.

89
00:06:28,210 --> 00:06:33,460
All they've done is created a model that copies the previous value in a time series or very close to

90
00:06:33,460 --> 00:06:33,670
it.

91
00:06:34,420 --> 00:06:39,520
You've already seen in this course that this basically leads to the predictions being very close to

92
00:06:39,520 --> 00:06:40,810
the true Time series.

93
00:06:41,440 --> 00:06:44,890
As long as you're zoomed out far enough, they're basically indistinguishable.

94
00:06:46,200 --> 00:06:53,100
And three, to bring this back to the new forecast, generally, this would be OK if you make it clear

95
00:06:53,100 --> 00:06:55,560
that you are only trying to predict one day ahead.

96
00:06:55,950 --> 00:06:59,550
And do you check that your forecast is better than the new forecast?

97
00:07:00,270 --> 00:07:03,870
Obviously, this must be done on the out of sample data or the test set.

98
00:07:04,500 --> 00:07:07,020
These models will not beat the forecast.

99
00:07:11,700 --> 00:07:17,670
Note that this is obvious when you consider the API for stats models, as you recall, it works something

100
00:07:17,670 --> 00:07:18,310
like this.

101
00:07:18,900 --> 00:07:20,670
First, you create an Arima model.

102
00:07:21,090 --> 00:07:26,790
When you call the Arima constructor, you pass in the training data, then you call model Duffett,

103
00:07:27,030 --> 00:07:28,950
which takes in no arguments.

104
00:07:29,820 --> 00:07:34,440
Later, when you want to get your sample predictions, you call results predict.

105
00:07:34,920 --> 00:07:39,450
The arguments to this are only the start and end indices, not any data.

106
00:07:40,140 --> 00:07:43,320
If you want to forecast, you call results forecast.

107
00:07:44,040 --> 00:07:48,280
Note that the only arguments to this function is the number of forecasting steps.

108
00:07:48,990 --> 00:07:54,270
Therefore, there is no point at which your model ever sees any data points from the test data.

109
00:07:54,810 --> 00:07:59,610
Your model only sees the training data which is passed in when you instantiate the model.

110
00:08:04,610 --> 00:08:05,900
So what is the lesson here?

111
00:08:06,590 --> 00:08:11,900
The lesson is that whenever you see plots like these, pay special attention to the actual code.

112
00:08:12,470 --> 00:08:18,470
Are these models actually predicting stock prices or are they just finding very computationally expensive

113
00:08:18,470 --> 00:08:24,980
ways to copy the last value in a time series, most likely performing worse than the naive forecast?

114
00:08:25,970 --> 00:08:31,210
Furthermore, we know that it's not stock prices that we care to predict, but rather stock returns.

115
00:08:32,060 --> 00:08:36,890
If you've had experience with a machine learning, then, you know, it's probably not a good idea to

116
00:08:36,890 --> 00:08:39,830
use prices as inputs or outputs to your model.

117
00:08:40,460 --> 00:08:43,730
Machine learning models are generally not good at extrapolation.

118
00:08:44,180 --> 00:08:49,160
So, for example, if the prices that you trained on were between one hundred dollars and two hundred

119
00:08:49,160 --> 00:08:53,020
dollars, then maybe it'll learn something for that range of prices.

120
00:08:53,570 --> 00:08:57,150
But what about in the future when your price goes beyond 200?

121
00:08:57,620 --> 00:09:02,010
Your model has never seen this before and therefore it probably won't know what to do.

122
00:09:02,720 --> 00:09:08,930
So be suspicious when you see marketing material like this and don't use test data in your forecast.
