1
00:00:11,050 --> 00:00:16,030
In this lecture, we are going to circle back to the earlier part of the previous lecture, which is

2
00:00:16,030 --> 00:00:22,570
how do we define the best model, we know that Otto Arima will find the best model out of some set of

3
00:00:22,570 --> 00:00:23,100
models.

4
00:00:23,110 --> 00:00:28,390
But of course, in order to make any sense of that, we need to know what it means for a model to be

5
00:00:28,390 --> 00:00:32,730
the best, because these models come from the statistics literature.

6
00:00:32,980 --> 00:00:37,900
If you come from a machine learning background, you will recognize that machine learning people might

7
00:00:37,900 --> 00:00:39,290
approach this topic differently.

8
00:00:40,000 --> 00:00:42,660
However, Arima is a statistical method.

9
00:00:42,670 --> 00:00:47,980
And so generally, when we look at how these libraries work and what they do, they will be using techniques

10
00:00:47,980 --> 00:00:51,300
founded on traditional statistics rather than machine learning.

11
00:00:52,600 --> 00:00:56,830
What is somewhat interesting about Otto Arima is that we end up coming full circle.

12
00:00:57,520 --> 00:01:03,190
As you may recall, the reason that we use methods such as the Akef and the pickoff is because we want

13
00:01:03,190 --> 00:01:08,980
a more direct way of choosing hyper parameters than simple trial and error in comparison.

14
00:01:09,040 --> 00:01:12,270
Trial and error seems like kind of a very naive approach.

15
00:01:12,580 --> 00:01:16,270
And yet with Otto Arima, that is exactly what we do.

16
00:01:16,990 --> 00:01:22,720
It turns out that the method of manually looking at the HCF in the Sieff does not always lead to the

17
00:01:22,720 --> 00:01:23,620
optimal answer.

18
00:01:24,580 --> 00:01:30,070
Otto Arima does a more exhaustive search and therefore has a better opportunity of finding the best

19
00:01:30,070 --> 00:01:30,610
answer.

20
00:01:31,450 --> 00:01:37,360
Now, one method you may have thought of is grid search that is searching through every possible set

21
00:01:37,360 --> 00:01:39,060
of hyper parameters in a grid.

22
00:01:39,760 --> 00:01:44,890
For example, if you had it to hyper parameters to search for, this would be a two dimensional grid.

23
00:01:45,490 --> 00:01:49,660
For example, you might want to try different combinations of P and Q from one to 10.

24
00:01:49,660 --> 00:01:52,210
So it's one one, one, two, one three and so on.

25
00:01:52,960 --> 00:01:57,280
If you had three type of parameters to search for, this would be a three dimensional grid.

26
00:01:57,880 --> 00:02:04,060
A grid search would be simply to look through every possible position on the grid to find the best combination

27
00:02:04,060 --> 00:02:05,200
of high parameters.

28
00:02:05,890 --> 00:02:11,020
Now, although Auto Arima gives you the option to do this, this is not the default behavior.

29
00:02:11,680 --> 00:02:18,670
Instead, Auto Arima uses a stepwise algorithm to more intelligently find the best set of hyper parameters.

30
00:02:19,570 --> 00:02:25,370
It's true that computers are so fast nowadays that they allow a function like auto arima to be possible.

31
00:02:25,810 --> 00:02:28,910
However, a full grid search can still be quite slow.

32
00:02:29,560 --> 00:02:33,040
Therefore, we typically use the stepwise algorithm as the default.

33
00:02:37,910 --> 00:02:43,650
All right, so when we run out Arima, it's going to use some criteria to evaluate each model.

34
00:02:44,360 --> 00:02:51,440
The model that gives us the best value will be the model chosen to sort of equally good evaluation criteria

35
00:02:51,590 --> 00:03:00,020
or the AIC and the Bisi AIC stands for a information criterion and Bisi stands for Bayesian information

36
00:03:00,020 --> 00:03:00,700
criteria.

37
00:03:01,550 --> 00:03:03,980
The intuition behind both of these is the same.

38
00:03:04,790 --> 00:03:08,540
When we are building machine learning models, we often have to make a trade off.

39
00:03:09,020 --> 00:03:13,220
This trade off happens between a model complexity and model accuracy.

40
00:03:13,910 --> 00:03:18,500
For a model, complexity means increasing the values of P and Q.

41
00:03:19,100 --> 00:03:23,930
Recall that P is the number of past data points to include in the model, and she was the number of

42
00:03:23,930 --> 00:03:26,170
past errors to include in the model.

43
00:03:26,870 --> 00:03:32,040
You can imagine that as we add more and more terms, the model will get more and more accurate.

44
00:03:32,600 --> 00:03:38,510
In fact, if you studied linear regression with me in the past, then you know that even adding completely

45
00:03:38,510 --> 00:03:41,600
random noise will increase the accuracy of your model.

46
00:03:42,080 --> 00:03:43,030
This is not good.

47
00:03:43,490 --> 00:03:45,480
How do we know when enough is enough?

48
00:03:45,770 --> 00:03:47,390
How do we know when we've overdone it?

49
00:03:52,430 --> 00:03:58,430
In statistics, the answer is to penalize the moral complexity again, if you've studied with me in

50
00:03:58,430 --> 00:04:00,980
the past, then you know what I'm about to discuss.

51
00:04:01,490 --> 00:04:02,210
If you haven't.

52
00:04:02,240 --> 00:04:02,950
That's OK.

53
00:04:03,140 --> 00:04:06,680
But feel free to ask me about this on the Q&amp;A if you want to learn more.

54
00:04:07,550 --> 00:04:13,510
It turns out that the laws function when we optimize these Arima models is the negative log likelihood.

55
00:04:13,970 --> 00:04:20,720
It also turns out that for the most part, minimizing the negative likelihood is equivalent to minimizing

56
00:04:20,720 --> 00:04:21,560
the squared error.

57
00:04:22,860 --> 00:04:28,440
So this doesn't contradict anything I said earlier about minimizing the squared error of the predictions,

58
00:04:28,890 --> 00:04:32,790
the log likelihood is more general, however, since it can account for variance.

59
00:04:33,450 --> 00:04:37,540
Now, if we only look at the log likelihood, we might end up overfitting.

60
00:04:38,190 --> 00:04:42,090
So what we do is we add a penalty term to the negative likelihood.

61
00:04:42,870 --> 00:04:48,270
The main difference between the AIC and the Bisi is that this penalty term is computed differently.

62
00:04:53,260 --> 00:04:59,350
So just in case you're curious, the AIC and the BSI are defined as follows for both of these, you

63
00:04:59,350 --> 00:05:01,910
will have two times the negative likelihood.

64
00:05:02,710 --> 00:05:05,430
So in these equations, l represents the likelihood.

65
00:05:05,440 --> 00:05:11,050
And so each of these contains the term minus to log out for the AIC.

66
00:05:11,050 --> 00:05:17,620
The log likelihood is penalised by adding two times the number of parameters in the model for the Bisi.

67
00:05:17,620 --> 00:05:22,270
It's penalized by adding the number of parameters in the model times, the log of the number of data

68
00:05:22,270 --> 00:05:22,940
points.

69
00:05:23,410 --> 00:05:26,010
So they both do the same thing just slightly differently.

70
00:05:26,890 --> 00:05:33,430
Auto Arima happens to use the EIC by default, although it's often said that both of these usually lead

71
00:05:33,430 --> 00:05:34,810
to the same answer anyway.

72
00:05:35,620 --> 00:05:40,420
For some reason, statisticians always discuss both of these simultaneously, even though you always

73
00:05:40,420 --> 00:05:41,920
end up having to choose just one.

74
00:05:46,880 --> 00:05:51,560
All right, so now that we've discussed the statistics way of doing model selection, I want to briefly

75
00:05:51,560 --> 00:05:55,030
discuss how a machine learning person might go about this task.

76
00:05:55,610 --> 00:05:58,180
Note that this is entirely theoretical at this point.

77
00:05:58,550 --> 00:06:02,000
We are not going to use this method, and I only mention this out of interest.

78
00:06:02,990 --> 00:06:07,340
In machine learning, we often don't care that much about the number of parameters in a model.

79
00:06:08,000 --> 00:06:12,040
One reason why this is, is that you might be comparing different kinds of models.

80
00:06:12,500 --> 00:06:17,630
If you're comparing a decision tree to a neural network, for example, they are not really comparable

81
00:06:17,630 --> 00:06:18,410
in that way.

82
00:06:19,160 --> 00:06:22,370
Another reason is for modern methods such as deep learning.

83
00:06:22,580 --> 00:06:26,410
It often doesn't matter in the pre deep learning era.

84
00:06:26,450 --> 00:06:29,660
That is, before people even came up with the phrase deep learning.

85
00:06:29,960 --> 00:06:35,540
You will see a lot of papers on neural networks talking about this idea of comparing the number of parameters

86
00:06:35,540 --> 00:06:36,800
to the number of samples.

87
00:06:37,340 --> 00:06:43,070
This makes a lot of sense when you look at linear regression because the actual data Matrix X has a

88
00:06:43,080 --> 00:06:48,500
number of rows equal to the number of samples and number of columns equal to the number of parameters.

89
00:06:49,400 --> 00:06:54,440
So this makes sense when you consider linear algebra and the solution for linear regression and so on.

90
00:06:55,560 --> 00:07:00,750
People used to extend this thinking to neural networks, but today we have found that neural networks

91
00:07:00,750 --> 00:07:06,930
don't actually behave badly when you have many more parameters than samples today, you can have neural

92
00:07:06,930 --> 00:07:11,910
networks with billions of parameters that cost the equivalent of millions of dollars to train.

93
00:07:12,420 --> 00:07:16,020
So it's not really model complexity that we care about in machine learning.

94
00:07:20,790 --> 00:07:26,760
In machine learning, what we really care about is the ability to generalise, that is to say we don't

95
00:07:26,760 --> 00:07:30,380
want our model to be accurate only on the data that it was trained on.

96
00:07:30,780 --> 00:07:33,640
We want it to be accurate for data that it hasn't seen yet.

97
00:07:34,200 --> 00:07:37,280
This is important for pretty much everyone that does machine learning.

98
00:07:38,580 --> 00:07:43,830
For example, if you're building a recommender system, those recommendations will be going to people

99
00:07:43,980 --> 00:07:48,520
who have not yet seen the movies or purchased the products that you were going to recommend.

100
00:07:49,110 --> 00:07:54,690
If you're building a fraud detection system, you want to detect future fraud, not pass fraud, which

101
00:07:54,690 --> 00:07:55,950
you already know is fraud.

102
00:07:56,820 --> 00:08:01,740
If you're building a time series forecasting model, you want the forecast to be accurate, not just

103
00:08:01,740 --> 00:08:03,630
the data from before the forecast.

104
00:08:04,710 --> 00:08:10,380
For many models, this is highly correlated to model complexity, which explains why model complexity

105
00:08:10,650 --> 00:08:12,030
was something people care about.

106
00:08:12,030 --> 00:08:12,900
In the past.

107
00:08:13,320 --> 00:08:18,630
You would see plots such as the following weather test performance starts to degrade when the model

108
00:08:18,630 --> 00:08:20,190
becomes too complex.

109
00:08:24,790 --> 00:08:30,940
So given this information, what might we want to do instead, if our task is to choose the best model?

110
00:08:31,660 --> 00:08:36,400
Well, why not simply check the out of sample accuracy or the test accuracy directly?

111
00:08:37,630 --> 00:08:44,080
In other words, why bother checking the AIC or the Besi when we can evaluate each model according to

112
00:08:44,080 --> 00:08:45,670
it's out of simple accuracy.

113
00:08:46,300 --> 00:08:51,780
This seems like it will more closely do what we want to do rather than adding some penalty term to the

114
00:08:51,970 --> 00:08:58,720
sample accuracy, which we already know is less relevant in the scenario where we are using out of sample

115
00:08:58,720 --> 00:09:00,520
data to choose hyper parameters.

116
00:09:00,790 --> 00:09:04,150
We would call this the validation set rather than the test set.

117
00:09:04,840 --> 00:09:08,620
You might use methods such as cross-validation to choose your hyper parameters.

118
00:09:09,310 --> 00:09:12,520
So that makes sense when what you care about is good accuracy.

119
00:09:13,750 --> 00:09:20,320
However, one reason to use the AIC and the Bisi is because we really are trying to achieve the simplest

120
00:09:20,320 --> 00:09:21,210
model possible.

121
00:09:22,000 --> 00:09:28,420
As you recall, one of our main motivations is modeling the data and explaining how it arose rather

122
00:09:28,420 --> 00:09:29,800
than making predictions.

123
00:09:30,250 --> 00:09:36,160
If this is our motivation, then it makes sense to want the simplest model possible that can adequately

124
00:09:36,160 --> 00:09:39,780
explain the data that we saw in statistics.

125
00:09:39,970 --> 00:09:41,740
Sometimes we call this parsimony.

126
00:09:42,460 --> 00:09:45,490
We say that we want to find at the most parsimonious model.
