1
00:00:11,110 --> 00:00:16,210
OK, so in this lecture, we are going to look at applying the whole Winters' model to a new data set.

2
00:00:16,990 --> 00:00:22,150
Now, before we begin this lecture, I want to give you an opportunity to stop this video and to try

3
00:00:22,150 --> 00:00:23,200
to do this on your own.

4
00:00:23,710 --> 00:00:28,690
Since you've already learned all the code you need to complete this exercise, you technically do not

5
00:00:28,690 --> 00:00:29,380
need my help.

6
00:00:29,860 --> 00:00:35,110
So if you want to treat this as an exercise, please download the data set from the euro given in this

7
00:00:35,110 --> 00:00:35,740
notebook.

8
00:00:36,040 --> 00:00:40,270
But after you get the euro, close the notebook and build the model on your own.

9
00:00:40,810 --> 00:00:44,530
OK, so if you want to try this as an exercise, please do so now.

10
00:00:44,680 --> 00:00:46,240
Otherwise, let's continue.

11
00:00:47,200 --> 00:00:53,410
So we'll start by importing the usual libraries, pandas number matplotlib and the R-squared Metrick.

12
00:00:53,860 --> 00:00:58,060
Now, there's no special reason I'm using this metric, so please feel free to choose your own.

13
00:01:03,290 --> 00:01:06,860
Next, we're going to update stats models so that we have the latest version.

14
00:01:11,230 --> 00:01:17,740
Next, we're going to download the data set, so this is a data set of champagne sales, typical sort

15
00:01:17,740 --> 00:01:19,570
of TIME series for this kind of model.

16
00:01:24,920 --> 00:01:28,490
We'll now run the head commands in order to check what's inside our CSFI.

17
00:01:32,520 --> 00:01:37,380
So as you can see, the header looks kind of funky, so that's something we'll rename and pantos.

18
00:01:41,120 --> 00:01:45,630
Next, we're going to call ESV to load in our data frame.

19
00:01:46,310 --> 00:01:49,910
Note that there are two junk lines at the bottom of this CSFI file.

20
00:01:49,920 --> 00:01:52,150
So we pass in a skip four equals two.

21
00:01:52,610 --> 00:01:55,970
You might want to look at the file yourself to verify what I'm saying.

22
00:02:00,970 --> 00:02:04,410
The next step is to do a DFG head to look at our data frame.

23
00:02:09,060 --> 00:02:15,150
So pretty much what we expect the month is the index in the champagne sales are the one and only column.

24
00:02:18,310 --> 00:02:21,910
The next step is to rename a column to be something nicer to deal with.

25
00:02:25,840 --> 00:02:27,750
The next step is to plot our data center.

26
00:02:32,580 --> 00:02:34,630
So here's what we notice about this data.

27
00:02:35,520 --> 00:02:38,970
First, we recognize that there is seasonality in this data.

28
00:02:39,750 --> 00:02:45,510
Second, we see that there might be a slight trend, but the seasonal pattern is definitely more obvious.

29
00:02:46,260 --> 00:02:51,750
Third, note that the seasonal pattern is not constant, but seems to increase over time, at least

30
00:02:51,750 --> 00:02:52,440
to a point.

31
00:02:56,040 --> 00:03:00,540
OK, so the next step is to set the frequency of our index, which is in months.

32
00:03:04,860 --> 00:03:10,410
The next step is to split our data into train and test will choose and test equals 12, which makes

33
00:03:10,410 --> 00:03:12,960
sense since the seasonal cycle is one year.

34
00:03:17,450 --> 00:03:23,050
The next step is to create a bullying series in order to index the train and Cicero's of our data frame.

35
00:03:29,970 --> 00:03:32,800
The next step is to instantiate our whole Winsor's model.

36
00:03:33,420 --> 00:03:38,190
Now, there's no particular reason I've chosen these options except for obviously the seasonal period,

37
00:03:38,190 --> 00:03:39,040
which is 12.

38
00:03:39,540 --> 00:03:43,010
Feel free to try other options and use what you've learned in the course.

39
00:03:43,920 --> 00:03:47,160
Note that we're also going to fit the model in the same block.

40
00:03:51,230 --> 00:03:56,660
The next step is to assign the predictions to our data frame, so we'll start with the sample data,

41
00:03:56,660 --> 00:03:59,120
which will set to the whole winter's train.

42
00:04:02,760 --> 00:04:04,790
Was that the out of sample data to the column?

43
00:04:04,980 --> 00:04:09,860
Winter's Test, this is so that the train and test predictions appear in different colors.

44
00:04:13,740 --> 00:04:16,740
The next step is to plot our predictions against the true data.

45
00:04:24,090 --> 00:04:26,580
As you can see, it's a pretty reasonable fit.

46
00:04:29,820 --> 00:04:34,140
OK, so the next step is to check the R-squared for both the train and test sets.

47
00:04:39,650 --> 00:04:44,120
As you can see, we get a pretty high R-squared for the tests there, but not as high for the trains,

48
00:04:44,120 --> 00:04:49,760
that this makes sense since as you saw on the playa, the model has a hard time predicting the very

49
00:04:49,760 --> 00:04:50,850
high peaks in the data.

50
00:04:51,380 --> 00:04:55,280
So you might want to try other options and see if that improves the performance.