1
00:00:11,050 --> 00:00:16,610
OK, so in this video, we are going to look at how to apply auto arima to champagne sales.

2
00:00:17,110 --> 00:00:22,210
Now, before we begin this lecture, I want to give you an opportunity to stop this video and to try

3
00:00:22,210 --> 00:00:23,320
to do this on your own.

4
00:00:23,860 --> 00:00:28,630
Since you've already learned all the code you need to complete this exercise, you technically do not

5
00:00:28,630 --> 00:00:29,300
need my help.

6
00:00:29,740 --> 00:00:35,350
So if you want to treat this as an exercise, please download the data set from the euro given in this

7
00:00:35,350 --> 00:00:35,950
notebook.

8
00:00:36,280 --> 00:00:40,360
But after you get the euro, close the notebook and build a model on your own.

9
00:00:41,140 --> 00:00:44,940
OK, so if you want to try this as an exercise, please do so now.

10
00:00:45,070 --> 00:00:46,590
Otherwise, let's continue.

11
00:00:47,810 --> 00:00:50,810
OK, so let's skip the imports, since you've seen this all before.

12
00:00:52,610 --> 00:00:54,690
The next step is to update stat's bottles.

13
00:00:55,010 --> 00:00:58,820
Again, this is because the newest version differs from the defaults on CoLab.

14
00:01:02,650 --> 00:01:04,930
The next step is to install PMed Yarema.

15
00:01:08,760 --> 00:01:10,890
The next step is to download our CSFI.

16
00:01:15,210 --> 00:01:19,050
The next step is to read in our CCV using PDF that read CSFI.

17
00:01:22,630 --> 00:01:26,620
The next step is to call the dot head so we remember what our data looks like.

18
00:01:33,510 --> 00:01:38,910
Since the column name in our data frame is kind of user unfriendly, we're going to rename it to sales.

19
00:01:42,240 --> 00:01:45,570
The next step is to plot our data, so you remember what it looks like.

20
00:01:49,550 --> 00:01:54,470
So, again, there's a pretty strong seasonal component, but the amplitude of the cycles seems to grow

21
00:01:54,470 --> 00:01:55,180
over time.

22
00:01:58,420 --> 00:02:00,850
The next step is to compute the log, transform.

23
00:02:01,300 --> 00:02:05,980
Now, we're not going to go through every possible option in this script, but please try anything you're

24
00:02:05,980 --> 00:02:06,760
curious about.

25
00:02:14,050 --> 00:02:17,110
The next step is to set our index frequency two months.

26
00:02:21,460 --> 00:02:24,520
The next step is to split our data into train and test.

27
00:02:28,550 --> 00:02:33,050
The next step is to create indexes to our data frame for both the train and test sets.

28
00:02:36,880 --> 00:02:39,370
The next step is to import PMed Yarema.

29
00:02:43,700 --> 00:02:50,270
The next step is to run auto arima on the log sales will set Seasonale equals true to start using stepwise

30
00:02:50,270 --> 00:02:50,870
search.

31
00:02:57,760 --> 00:03:01,570
OK, so this gives us a Remar one zero one zero one one.

32
00:03:06,060 --> 00:03:11,610
The next step is to plot our insane plan out of sample predictions, since I've shown you this code

33
00:03:11,610 --> 00:03:12,120
before.

34
00:03:12,210 --> 00:03:13,800
I'm not going to explain it again.

35
00:03:19,680 --> 00:03:24,150
OK, so we see that our forecasts very closely matches the true Time series.

36
00:03:27,940 --> 00:03:30,610
The next step is to compute the R-squared of our prediction.

37
00:03:34,060 --> 00:03:37,260
OK, so we get about zero point nine five, which makes sense.

38
00:03:40,900 --> 00:03:46,560
Now, we know that it's not entirely necessary to use a seasonal model when you have seasonal data.

39
00:03:47,230 --> 00:03:53,110
So the next thing I want to try is to simply do a grid search over non seasonal Arima models to see

40
00:03:53,110 --> 00:03:54,130
what we can find.

41
00:03:55,240 --> 00:03:57,550
Not that I've said seasonal to false in step.

42
00:03:57,550 --> 00:03:58,570
Wise to false.

43
00:04:05,650 --> 00:04:08,650
OK, so our best model is a Rhema 12 one one.

44
00:04:09,520 --> 00:04:13,700
This makes sense since the seasonal period of the time series is 12 months.

45
00:04:14,650 --> 00:04:20,830
Now, just note that in this case, using PMed, Yarema is kind of unnecessary if you want to do a grid

46
00:04:20,830 --> 00:04:23,950
search and your goal is to get the model with the best test predictions.

47
00:04:24,160 --> 00:04:28,900
It's probably better to just use the method we learned earlier, which was walk forward validation.

48
00:04:30,590 --> 00:04:34,100
So why are we doing grid search instead of the default stepwise search?

49
00:04:34,880 --> 00:04:37,490
Well, I encourage you to try this code with step by step.

50
00:04:37,520 --> 00:04:38,030
True.

51
00:04:38,270 --> 00:04:41,180
So you can see that you can't just trust this method blindly.

52
00:04:44,560 --> 00:04:48,250
The next step is to plot our predictions using the same code as before.

53
00:04:53,560 --> 00:04:56,670
OK, so our predictions are pretty close, as you might expect.

54
00:05:00,020 --> 00:05:02,120
The next step is to compute the R-squared.

55
00:05:05,970 --> 00:05:11,490
So the R-squared of our non seasonal model is zero point nine seven, which is a bit better than before.

56
00:05:14,650 --> 00:05:19,600
The next step in the script is to choose our Arima model manually by using the Akef and the.

57
00:05:24,040 --> 00:05:26,110
So let's start by plotting the akef.

58
00:05:30,330 --> 00:05:35,730
So what we can see from this is that there is a highly significant point around like 12, which makes

59
00:05:35,730 --> 00:05:40,470
sense, but note that it's not very common to see models with high IQ values.

60
00:05:44,470 --> 00:05:50,680
The next step is to plot the pickoff now you might be wondering, what is this strange argument method?

61
00:05:51,250 --> 00:05:56,030
The answer is that, as you recall, finding the chief involves doing linear regression.

62
00:05:56,830 --> 00:05:59,400
It turns out that there are multiple ways to do this.

63
00:05:59,800 --> 00:06:04,840
Unfortunately, with the default method, you're going to get an invalid answer, which is not useful.

64
00:06:05,350 --> 00:06:08,060
Using ordinarily squares seems to fix the problem.

65
00:06:08,230 --> 00:06:09,610
So that's what we'll use.

66
00:06:13,950 --> 00:06:18,600
OK, so we see that we get statistically significant values up to like 12.

67
00:06:19,140 --> 00:06:22,020
This makes sense and it gives us a reason to set the 12.

68
00:06:25,190 --> 00:06:29,660
Another interesting result from our grid search was that it shows D equals one, which means that it

69
00:06:29,660 --> 00:06:30,920
took the first difference.

70
00:06:31,430 --> 00:06:34,810
That, to me is a bit strange because there's no obvious trend in the data.

71
00:06:39,370 --> 00:06:44,800
If we plot the first difference of the log sales, we see that it doesn't really accomplish much, the

72
00:06:44,800 --> 00:06:47,460
TIME series is still seasonal, just with a different pattern.

73
00:06:51,800 --> 00:06:55,280
The next step is to plot the akef of the difference in log sales.

74
00:06:59,380 --> 00:07:04,180
So interestingly, this looks very similar to the Akef of the original Time series.

75
00:07:07,440 --> 00:07:11,250
The next step is to plot the scale of the difference in log sales.

76
00:07:14,980 --> 00:07:20,530
Again, it essentially tells us the exact same thing as the chief of the original Time series.

77
00:07:23,760 --> 00:07:28,830
So one way to justify taking the difference of our Time series is to use the ADF test.

78
00:07:32,610 --> 00:07:35,220
So let's run the ATF test on the log sales.

79
00:07:39,360 --> 00:07:45,810
OK, so as you recall, the P-value is the second element in this tuple, using a significant threshold

80
00:07:45,810 --> 00:07:48,900
of five percent, we would not reject the null hypothesis.

81
00:07:49,410 --> 00:07:52,560
Therefore, we would not say that this time series is stationary.

82
00:07:55,750 --> 00:07:59,680
The next step is to try the AIDS test on the difference in log sales.

83
00:08:02,910 --> 00:08:08,880
So this time, our P-value is far below the significant threshold of five percent in this case, we

84
00:08:08,880 --> 00:08:12,760
do reject the null hypothesis and we say that the TIME series is stationary.

85
00:08:13,350 --> 00:08:16,680
So this gives us some justification for setting vehicles one.

86
00:08:21,140 --> 00:08:26,330
So in this next block of code, I've copied a function written earlier which plots in Narima forecast,

87
00:08:26,660 --> 00:08:30,560
so please check the previous lectures if you want to learn what this code does.

88
00:08:35,910 --> 00:08:41,040
The next step is to train our Arima model using the parameters suggested by the pilots we just saw.

89
00:08:41,580 --> 00:08:46,670
So I've chosen P equals 12 since that's the maximum significant leg in the life.

90
00:08:47,070 --> 00:08:50,670
I've chosen vehicles, one based on the results of our ADF test.

91
00:08:51,160 --> 00:08:53,820
And I've also chosen Q equals two for good measure.

92
00:09:01,840 --> 00:09:03,730
OK, so this seems like a pretty good fit.

93
00:09:07,410 --> 00:09:09,330
The next step is to check the R-squared.

94
00:09:12,910 --> 00:09:17,170
So the R-squared is about zero point nine eight seven, which is better than before.

95
00:09:18,160 --> 00:09:23,710
OK, so I hope this served as an educational example of how to apply the tools we learned about to New

96
00:09:23,710 --> 00:09:24,830
Time series data.

97
00:09:25,510 --> 00:09:30,550
Now, in practice, you'd want to use all the tools you learned previously, such as walk forward validation

98
00:09:30,550 --> 00:09:31,390
and so forth.

99
00:09:32,020 --> 00:09:37,090
So although we got a better test R-squared on this particular split, it's not necessarily true that

100
00:09:37,090 --> 00:09:40,030
this model is better when you consider other split's.