1
00:00:11,110 --> 00:00:15,400
In this lecture, we are going to discuss a concept called the walk forward validation.

2
00:00:16,120 --> 00:00:21,400
So to give you some motivation for this lecture, we need to understand why traditional methods won't

3
00:00:21,400 --> 00:00:22,610
work with Time series.

4
00:00:23,140 --> 00:00:24,710
Let's begin with the train test split.

5
00:00:25,720 --> 00:00:29,890
We know that in order to build a model, we should split our data into train and test.

6
00:00:30,320 --> 00:00:32,140
This is how we can avoid overfitting.

7
00:00:32,680 --> 00:00:37,900
Overfitting is when our model appears to perform really well, but only because it's over fitted to

8
00:00:37,900 --> 00:00:40,740
the noise in the training set in the real world.

9
00:00:40,750 --> 00:00:44,980
What really matters is how our model performs on data it has not seen before.

10
00:00:45,360 --> 00:00:50,770
For example, the forecast for next week or whether a new customer is willing to purchase an item.

11
00:00:51,310 --> 00:00:56,320
If a model is too flexible, it might fit very well to the training data, but not that well to the

12
00:00:56,320 --> 00:00:57,040
new data.

13
00:01:01,710 --> 00:01:06,030
Now, there's one downside to the single train to split as we've been doing.

14
00:01:06,780 --> 00:01:11,490
Imagine that in order to build your model, you try out a bunch of different parameters on the data.

15
00:01:11,850 --> 00:01:15,570
You train each model on the train set and test each model on the test set.

16
00:01:16,140 --> 00:01:20,850
Now, unfortunately, what you've really done is kind of like a manual parameter fitting.

17
00:01:21,600 --> 00:01:27,780
Instead of letting an algorithm optimize a set of parameters, you've been doing the work yourself effectively.

18
00:01:27,780 --> 00:01:30,620
Your test set has become in sample data.

19
00:01:30,930 --> 00:01:34,200
You've been using it to optimize some set of parameters.

20
00:01:38,730 --> 00:01:44,320
So one way to mitigate this problem is to evaluate your model using multiple validation sets.

21
00:01:44,910 --> 00:01:50,610
Note that I'm conflating the concept of validation and test sets for the purpose of this lecture will

22
00:01:50,610 --> 00:01:55,590
assume that in the real world, your true test set is the data that lies in the future that your model

23
00:01:55,590 --> 00:01:57,050
really has not seen before.

24
00:01:57,510 --> 00:02:02,140
For example, nonpublic data from a Kagle contest or new customers to your website.

25
00:02:02,610 --> 00:02:08,670
So data for which you truly do not know the answer anything else will consider to be validation data

26
00:02:08,790 --> 00:02:11,210
and will conflate it with test data for simplicity.

27
00:02:12,650 --> 00:02:17,950
OK, so with that out of the way, here's how cross-validation works in a non time series setting,

28
00:02:18,380 --> 00:02:20,270
that is for regular machine learning.

29
00:02:21,290 --> 00:02:27,800
The basic idea is you're going to split up your data into random parts, then you'll repeat this process

30
00:02:27,800 --> 00:02:28,850
at K times.

31
00:02:29,240 --> 00:02:34,790
Each time you're going to pick one of the key parts to be your validation set and the other key minus

32
00:02:34,790 --> 00:02:36,650
one parts to be your train set.

33
00:02:37,250 --> 00:02:41,450
Each time you do this, you'll train a model and evaluate it on the validation set.

34
00:02:42,050 --> 00:02:46,130
Your final score for the model is just the average of the key individual scores.

35
00:02:46,720 --> 00:02:48,490
OK, so hopefully that's pretty simple.

36
00:02:48,740 --> 00:02:51,200
We call this K to cross-validation.

37
00:02:55,980 --> 00:02:59,860
So four time series Kay Foulds cross-validation does not apply.

38
00:03:00,570 --> 00:03:05,490
This is because four time series data, there is a time dependence among the data points.

39
00:03:05,850 --> 00:03:10,540
So if you split the data randomly, you're going to mix future data with past data.

40
00:03:10,890 --> 00:03:15,550
And of course, this is a big no no if you need justification for this.

41
00:03:15,780 --> 00:03:21,360
Think about the fact that it does not reflect how our model will operate in the real world, in the

42
00:03:21,360 --> 00:03:24,210
real world, will collect data and train our model.

43
00:03:24,630 --> 00:03:27,000
The data we collect must be from the past.

44
00:03:27,180 --> 00:03:28,640
Unless you have a time machine.

45
00:03:29,250 --> 00:03:34,140
Since nobody I know has a time machine, it's not possible to train a model with future data.

46
00:03:35,100 --> 00:03:38,820
Therefore, doing so during cross validation is not realistic.

47
00:03:39,720 --> 00:03:46,290
Instead, what is realistic is that we will have past data and we can use all of that past data to predict

48
00:03:46,290 --> 00:03:46,950
the future.

49
00:03:51,620 --> 00:03:57,640
So our new validation method, which is realistic for a time series, is called a walk forward validation.

50
00:03:58,190 --> 00:04:03,800
The way it works is this we're going to pick some minimum amount of data that we will use to train our

51
00:04:03,800 --> 00:04:04,260
model.

52
00:04:04,580 --> 00:04:05,880
This will be our starting point.

53
00:04:06,890 --> 00:04:12,720
Then suppose that we have a forecast horizon H, which can be one or any integer bigger than one.

54
00:04:13,790 --> 00:04:19,090
Then we're going to train our model on the existing training data and validate it on the next stage

55
00:04:19,100 --> 00:04:19,970
data points.

56
00:04:21,490 --> 00:04:27,730
The next step is to then walk forward one step and to spend the next true data point to our training

57
00:04:27,730 --> 00:04:34,270
data and repeat the process that is our train set has now become bigger by one true data point.

58
00:04:34,480 --> 00:04:39,400
And we train a new model on this data and validate it again on the next stage data points.

59
00:04:40,600 --> 00:04:45,140
After this, we continue to repeat this process until we've reached the end of our data set.

60
00:04:46,330 --> 00:04:48,080
OK, so I think that's pretty simple.

61
00:04:48,670 --> 00:04:54,110
Note that unlike cross-validation, this does reflect how our model will operate in the real world.

62
00:04:54,520 --> 00:05:00,340
We will train our model on all available data and then use it to make a forecast for however many steps

63
00:05:00,340 --> 00:05:01,600
we wish to forecast.

64
00:05:06,150 --> 00:05:12,390
And as a side note, it's not necessary to use all available data in the past, perhaps the dependencies

65
00:05:12,390 --> 00:05:17,570
of your time series change over time, such that recent data is more useful than past data.

66
00:05:18,510 --> 00:05:24,000
In that case, you might want to make the training set a constant size window and throw away old data.

67
00:05:24,720 --> 00:05:27,320
This still reflects how your model would be trained in reality.

68
00:05:27,390 --> 00:05:28,410
So it's acceptable.

69
00:05:32,960 --> 00:05:37,770
Another option you might want to consider is that you don't have to walk forward one step at a time.

70
00:05:38,570 --> 00:05:44,270
It's also common to walk forward eight steps at a time so that no validation set overlaps with any other.

71
00:05:45,410 --> 00:05:50,420
But note that if you choose this method, then your window size has to evenly divide the total collection

72
00:05:50,420 --> 00:05:53,030
of points that you plan to use for validation.

73
00:05:57,720 --> 00:06:03,180
So one question that pops up from time to time is, what if I want to use Time's jury split, which

74
00:06:03,180 --> 00:06:04,490
is part of Cycad learn?

75
00:06:05,070 --> 00:06:11,100
So my opinion on this is that it might save you from doing a bit of work under the right circumstances.

76
00:06:11,820 --> 00:06:16,000
At this point, it doesn't help us at all since we are not training a secular model.

77
00:06:16,830 --> 00:06:22,500
So one limitation of these methods is that the model you're across validating must conform to the cyclone

78
00:06:22,500 --> 00:06:25,500
interface, which is not the case for stat's models.

79
00:06:26,220 --> 00:06:31,140
But if you're using SYK, you learn as we will later in this course, then it is a neat option that

80
00:06:31,140 --> 00:06:38,610
might save you from writing a bit of code at the same time, recognize that this method is limited and

81
00:06:38,610 --> 00:06:42,260
this is aside from the fact that it will only work with secular models.

82
00:06:43,050 --> 00:06:48,210
So one way it's limited is that, as mentioned previously, it's possible for us to choose the step

83
00:06:48,210 --> 00:06:53,880
size as we walk forward with Saikat learned you won't have this option and you'll be forced to use non

84
00:06:53,880 --> 00:06:55,110
overlapping blocks.

85
00:06:56,610 --> 00:07:01,740
The Second Way's limited is that you don't get to choose the initial size of your first block, all

86
00:07:01,740 --> 00:07:03,320
blocks will be of equal size.

87
00:07:04,320 --> 00:07:08,820
Neither of these necessarily reflects how you'd work with a time series in the real world.

88
00:07:09,270 --> 00:07:14,570
In the real world, you would probably have the resources to update your model after every timestep.

89
00:07:14,940 --> 00:07:18,000
So a step size of one is probably more realistic.

90
00:07:18,390 --> 00:07:21,060
However, this also depends on how much data you have.

91
00:07:21,660 --> 00:07:25,200
If you have lots of data, then bigger step sizes might be acceptable.

92
00:07:26,250 --> 00:07:31,380
And furthermore, you would probably begin model training with a train set that substantially larger

93
00:07:31,380 --> 00:07:35,090
than your test set so you wouldn't be using evenly sized blocks.

94
00:07:36,180 --> 00:07:41,310
And the last kind of disadvantage of using this method is that it really takes away from your learning

95
00:07:41,310 --> 00:07:42,140
experience.

96
00:07:43,140 --> 00:07:48,630
So when you use this method, you're basically not thinking of all the important details and not thinking

97
00:07:48,630 --> 00:07:50,220
about all the important details.

98
00:07:50,220 --> 00:07:55,380
When you're taking a course specifically to learn about those important details is pretty suboptimal,

99
00:07:55,380 --> 00:07:56,180
in my opinion.

100
00:07:57,210 --> 00:08:01,980
In any case, if you are interested in learning how to use time service split, please let me know on

101
00:08:01,980 --> 00:08:02,700
the Q&amp;A.

102
00:08:07,330 --> 00:08:13,120
Now, just as a final note for this lecture, we will look at how to do walk forward validation in Python,

103
00:08:13,300 --> 00:08:14,470
which is coming next.

104
00:08:15,200 --> 00:08:18,960
However, we will not use this method for every example in this course.

105
00:08:19,420 --> 00:08:24,850
The reason for this is it just makes the code messy and doesn't provide any conceptual benefit once

106
00:08:24,850 --> 00:08:26,020
you already understand it.

107
00:08:26,860 --> 00:08:31,790
So you'll notice that for pretty much all the scripts in this course, we will only do a train to split

108
00:08:31,790 --> 00:08:32,470
at once.

109
00:08:33,240 --> 00:08:37,960
That's not to say you shouldn't use walk for validation in the real world when it really matters.

110
00:08:38,560 --> 00:08:42,880
This goes back to one of the major themes in this course, which is that there are so many options to

111
00:08:42,880 --> 00:08:43,570
choose from.

112
00:08:43,900 --> 00:08:48,950
If we used every option in every script, there would be a combinatorial explosion of things to try.

113
00:08:49,720 --> 00:08:55,300
So as mentioned, one of your major exercises in this course is to mix and match any of the concepts

114
00:08:55,300 --> 00:08:55,870
you learn.

115
00:08:56,410 --> 00:09:01,410
Anything you're curious about, please try it yourself and report your results to the rest of the class.
