1
00:00:11,090 --> 00:00:16,610
So in this lecture, we will be looking at Gargash more in depth and will describe conceptually how

2
00:00:16,610 --> 00:00:19,470
you might extend this model to deepen our own networks.

3
00:00:20,150 --> 00:00:25,130
So as we've gone through this course, you may have noticed one omission, which is that we didn't spend

4
00:00:25,130 --> 00:00:28,310
much time on discussing how these models are actually trained.

5
00:00:28,970 --> 00:00:34,130
This is because this is an application focused course rather than a math and programming type.

6
00:00:34,130 --> 00:00:39,890
Of course, in this lecture we will dive into some of these details, which I believe will help you

7
00:00:39,890 --> 00:00:41,170
understand further.

8
00:00:42,620 --> 00:00:47,420
So with any machine learning or statistical model, there are really two steps of learning how they

9
00:00:47,420 --> 00:00:47,940
work.

10
00:00:48,470 --> 00:00:53,660
The first step is necessary and it's the one we've covered in this course in any other course I teach.

11
00:00:54,230 --> 00:00:56,840
That is, how does the model make predictions?

12
00:00:57,380 --> 00:01:00,410
Or in other words, how do we get from the input to the output?

13
00:01:01,770 --> 00:01:05,880
Often this simply means learning the equation that tells us how to do this.

14
00:01:06,570 --> 00:01:12,500
As usual, this involves your typical mathematical operations like adding, multiplying and so forth.

15
00:01:13,230 --> 00:01:18,690
As you've seen for Gurche, this involves the model parameters, Omega, the Alphas and the bait is

16
00:01:19,260 --> 00:01:23,560
what we have not yet discussed is how are these parameters actually found.

17
00:01:24,090 --> 00:01:27,500
This is the second step now for deep learning.

18
00:01:27,510 --> 00:01:29,770
We've kind of discussed this, but not really.

19
00:01:30,360 --> 00:01:35,490
We know that we call some function called fit and we know that this does something like gradient descent,

20
00:01:35,790 --> 00:01:37,760
which looks like you're traveling down a hill.

21
00:01:38,520 --> 00:01:42,120
That's nice, but that's only about 10 percent of what you really need to know.

22
00:01:42,330 --> 00:01:44,640
If you want to make something like this yourself.

23
00:01:45,300 --> 00:01:50,610
That's not to say understanding those things isn't important, but it does say that the journey to mastery

24
00:01:50,610 --> 00:01:52,110
may be longer than expected.

25
00:01:52,590 --> 00:01:55,490
Of course, this lecture will take you closer to that point.

26
00:02:00,330 --> 00:02:06,840
OK, so the central item of interest when we want to fit or train a model is the objective for deep

27
00:02:06,840 --> 00:02:09,980
learning Arima and many other machine learning models.

28
00:02:10,260 --> 00:02:12,060
This is based on the log likelihood.

29
00:02:12,870 --> 00:02:17,820
Basically, our goal is to maximize the likelihood with respect to the model parameters.

30
00:02:18,330 --> 00:02:24,000
When we find the model parameters that maximize or minimize this objective, we consider the model to

31
00:02:24,000 --> 00:02:24,780
be trained.

32
00:02:25,590 --> 00:02:30,480
Now, it's not my goal in this lecture to explain maximum likelihood from scratch, but you can check

33
00:02:30,480 --> 00:02:34,470
the resources in extra reading dot text if you would like to learn more.

34
00:02:35,310 --> 00:02:40,500
This lecture is more geared towards those who already understand maximum likelihood estimation.

35
00:02:42,400 --> 00:02:46,910
So normally when we want to perform regression, we use the squared error objective.

36
00:02:47,350 --> 00:02:51,130
I know you've all seen this before, so I won't bother to explain this in detail.

37
00:02:52,330 --> 00:02:57,700
Now, if you've taken my in-depth courses before, then you also know that minimizing the sum of squared

38
00:02:57,700 --> 00:03:03,500
errors is the same as maximizing the log likelihood when the distribution of errors is normal.

39
00:03:04,300 --> 00:03:09,610
That is, when you use the squared error objective, you assume that your errors come from a normal

40
00:03:09,610 --> 00:03:10,470
distribution.

41
00:03:11,350 --> 00:03:15,610
What you may not have known is that this also requires you to make another assumption.

42
00:03:16,300 --> 00:03:19,840
This assumption is that the variance of those errors is constant.

43
00:03:20,350 --> 00:03:26,200
In order to understand why this is the case, we're going to review the steps to prove that minimizing

44
00:03:26,200 --> 00:03:30,610
the squared error is equivalent to maximizing the likelihood of the normal.

45
00:03:35,410 --> 00:03:40,720
OK, so in order to maximize the likelihood, we start by writing out the likelihood function, which

46
00:03:40,720 --> 00:03:47,290
makes use of the normal PDF, as you recall, we are assuming that our targets, which form the true

47
00:03:47,290 --> 00:03:52,780
time series, are samples from the normal, where the mean is the model prediction and the variance

48
00:03:52,780 --> 00:03:54,970
is some constant sigma squared.

49
00:03:55,750 --> 00:04:02,110
Also recall that this is equivalent to saying that the target Y of T is equal to the mean Y had of T

50
00:04:02,380 --> 00:04:10,300
plus Epsilon T, where in this case Epsilon T is an ID sample from the normal with mean zero and variance

51
00:04:10,300 --> 00:04:14,060
Sigma Squared also for completion sake.

52
00:04:14,080 --> 00:04:20,200
Well also suppose that Y had of T is some function of the previous plague's and some model parameters

53
00:04:20,200 --> 00:04:21,500
which we will just call theta.

54
00:04:22,210 --> 00:04:27,370
This part isn't necessary, but it helps to frame this in the same way as our auto regressive models

55
00:04:27,370 --> 00:04:28,360
from this course.

56
00:04:29,740 --> 00:04:34,200
OK, so from here we can write our likelihood of function as big of theta.

57
00:04:34,810 --> 00:04:39,420
You should recognize the formula inside the product, which is just the normal PDF.

58
00:04:40,360 --> 00:04:45,240
As you know, the next step is to take the log, which simplifies the optimization.

59
00:04:46,210 --> 00:04:52,240
So as an exercise, you may want to go through these steps slowly by yourself in case you forgot exactly

60
00:04:52,240 --> 00:04:53,200
how this works.

61
00:04:57,800 --> 00:05:02,010
Now, from this log, likelihood, we know that this expression can be simplified.

62
00:05:02,750 --> 00:05:08,720
Firstly, let's recognize that our objective is to find the theta which maximizes this objective.

63
00:05:09,650 --> 00:05:15,300
In other words, the Theta we want to call it theta star is the Amax of the log likelihood.

64
00:05:16,040 --> 00:05:19,760
Now you'll recognize that Theta doesn't actually appear on the right side.

65
00:05:20,150 --> 00:05:22,770
Just remember that Y hat is a function of theta.

66
00:05:22,910 --> 00:05:25,610
I'm just not showing it here anyway.

67
00:05:25,610 --> 00:05:29,150
Once we have this, it's easy to see how this can be simplified.

68
00:05:30,230 --> 00:05:33,440
Notice that the long term does not depend on theta at all.

69
00:05:33,950 --> 00:05:37,520
To Pi is just a number, as is the variance sigma squared.

70
00:05:38,000 --> 00:05:42,650
So if we simply ignored this term, the optimal theta star would still be the same.

71
00:05:44,440 --> 00:05:49,780
The next thing to notice is that the one 1/2 and the Sigma are just the multiplicative constants.

72
00:05:50,200 --> 00:05:55,900
For example, the extreme point of X squared is the same as the extreme point of two X squared.

73
00:05:56,260 --> 00:05:58,900
Multiplying by a constant makes no difference.

74
00:05:59,260 --> 00:06:01,060
So we can remove those as well.

75
00:06:03,100 --> 00:06:07,550
The final thing to recognize is that the minus sign can also be removed.

76
00:06:08,110 --> 00:06:13,670
Basically, we know that maximizing minus X squared is the same as minimizing X squared.

77
00:06:14,020 --> 00:06:15,910
We get the same X either way.

78
00:06:16,480 --> 00:06:22,000
So to simplify this, we can remove the minus sign and change the max to an argument.

79
00:06:22,780 --> 00:06:26,500
Of course, this is just the sum of squared errors as promised.

80
00:06:31,000 --> 00:06:36,610
OK, so at this point, we recognize that in the previous derivation of the squared error, from the

81
00:06:36,610 --> 00:06:42,820
likelihood we made one critical assumption, we assume that the variance of the error is constant.

82
00:06:43,480 --> 00:06:48,610
Of course, this whole section is about exploring what happens when the variance of the error is not

83
00:06:48,610 --> 00:06:49,340
constant.

84
00:06:49,870 --> 00:06:54,720
So in order to understand how Ghazi's trained, we simply have to go back a few steps.

85
00:06:55,480 --> 00:06:58,970
This is our log likelihood again, but with one small difference.

86
00:06:59,500 --> 00:07:06,280
Now I've written Sigma with an argument t this is to signify that Sigma is no longer a constant.

87
00:07:07,600 --> 00:07:12,760
On the left side, we can think of Theta as a collection of model parameters, whether they are for

88
00:07:12,760 --> 00:07:14,720
Y hat or sigma or both.

89
00:07:15,670 --> 00:07:21,190
So this time, because Sigma is not constant, it can no longer be dropped from the Amax.

90
00:07:22,300 --> 00:07:27,740
One small simplification we can make is to drop the two pi since that part is still constant.

91
00:07:28,390 --> 00:07:33,430
We can also drop the one half and the negative signs and turn the objective into something we want to

92
00:07:33,430 --> 00:07:34,330
minimize.

93
00:07:36,370 --> 00:07:42,130
The final step we can take to make this look more like Darch is to recall that our main model is normally

94
00:07:42,130 --> 00:07:49,190
just zero and hence on the numerator we simply have Ypsilanti squared, which is the error time series.

95
00:07:49,690 --> 00:07:54,190
So this is the objective function you would use if you wanted to train a garbage.

96
00:07:58,770 --> 00:08:04,850
So essentially, this is all that you need to know if you want to train your own garbage at this point,

97
00:08:04,860 --> 00:08:09,650
you can plug this objective into any stock optimizer such as LBJ's.

98
00:08:10,110 --> 00:08:14,220
I encourage you to check the CPI documentation if you want to learn more.

99
00:08:15,210 --> 00:08:20,330
Now, as you recall, the objective is not the only thing you need to do the optimization.

100
00:08:20,850 --> 00:08:25,140
You also need to include the constraints to make sure that the variance is always positive.

101
00:08:25,770 --> 00:08:30,070
And of course, many of these stock optimizers allow you to include these constraints.

102
00:08:30,330 --> 00:08:31,760
So this is not a concern.

103
00:08:32,760 --> 00:08:37,950
Also, recall that we derived this objective, assuming that the error is normally distributed.

104
00:08:38,370 --> 00:08:44,040
You can if you want to replace that with any other distribution you like, such as the student T or

105
00:08:44,040 --> 00:08:45,990
the Scooty or anything else.

106
00:08:50,610 --> 00:08:56,100
Now, if you've studied deep learning and neural networks, then your immediate question should be why

107
00:08:56,100 --> 00:08:57,900
do we need to stick with Gurche?

108
00:08:58,380 --> 00:09:03,290
Gurche, as you know, is a linear model and we've seen that linear models can be limited.

109
00:09:03,900 --> 00:09:10,020
So a natural question to ask is why not simply parameterize Sigma T using a neural network?

110
00:09:10,590 --> 00:09:12,910
In fact, this is entirely possible.

111
00:09:13,620 --> 00:09:21,090
So this is an easy but also powerful way to extend Gargash without much extra work that is provided.

112
00:09:21,090 --> 00:09:24,960
You know how to build custom functions using Tenzer floor piter.

113
00:09:26,040 --> 00:09:31,140
One thing that's not immediately clear is how you can constrain the output of your neural network to

114
00:09:31,140 --> 00:09:32,050
be positive.

115
00:09:32,730 --> 00:09:37,980
As you recall, the final layer of a neural network is typically just linear regression, which can

116
00:09:37,980 --> 00:09:39,420
output any real number.

117
00:09:40,290 --> 00:09:45,530
So the way to handle this is to use an activation function that enforces positivity.

118
00:09:45,900 --> 00:09:51,510
For example, the soft plus by using the soft plus, you can ensure that the output of your neural network

119
00:09:51,720 --> 00:09:53,040
is a positive value.

120
00:09:57,720 --> 00:10:02,670
One final topic to discuss in this lecture, which follows directly from what we've just discussed,

121
00:10:02,940 --> 00:10:05,980
is this idea of combining Arima with garbage.

122
00:10:06,750 --> 00:10:10,790
So earlier we wrote our model down as the mean A plus the error.

123
00:10:11,280 --> 00:10:17,130
So it may seem as if we can simply train and arima to learn the mean subtracted from the time series

124
00:10:17,130 --> 00:10:20,040
to get the error and then train Aghajan the error.

125
00:10:20,550 --> 00:10:23,860
In fact, this approach appears in a few Python blogs.

126
00:10:24,300 --> 00:10:26,730
Unfortunately, this approach is suboptimal.

127
00:10:27,660 --> 00:10:33,420
As you saw earlier, the lost function actually incorporates both the mean model and the variance model

128
00:10:33,600 --> 00:10:34,810
at the same time.

129
00:10:35,580 --> 00:10:38,370
What this means is that you find the optimal answer.

130
00:10:38,460 --> 00:10:40,420
Both have to be optimized jointly.

131
00:10:41,040 --> 00:10:47,490
As mentioned, there is no Arima in the Python Arch Library, but there is one in our Yuga, which is

132
00:10:47,490 --> 00:10:49,150
a library for the language.

133
00:10:49,500 --> 00:10:52,890
So if you really need to use a Rhema garage, that would be an option.

134
00:10:54,330 --> 00:11:00,480
One fact to recognise, however, is that it is often the case that an eye one is the most parsimonious

135
00:11:00,480 --> 00:11:03,460
arima for stock prices, as you recall.

136
00:11:03,690 --> 00:11:05,610
This corresponds to a random walk.

137
00:11:06,510 --> 00:11:12,360
When this is the case, there are no parameters to optimize for the Arima power since anyone does not

138
00:11:12,360 --> 00:11:13,580
have any parameters.

139
00:11:14,010 --> 00:11:19,710
So if it is the case that any one is the best model for the price, then there is no difference whether

140
00:11:19,710 --> 00:11:22,740
you choose to fit the model jointly or separately.