Hello, everyone. In this lecture, we'll start fitting
SARIMA processes to real-world datasets, and the first dataset we're going
to look at is something familiar. We're going to look at quarterly
earnings per Johnson & Johnson share. Objectives are to fit SARIMA models to
quarterly earnings of Johnson & Johnson shares, and to forecast future
values of the examined time series, in this case would be the earnings
of Johnson & Johnson shares. All right, so when we do the modeling,
we're going to look at a few steps here, and some of those steps we'll
have already talked about before. So the first thing we're going to do
as always is to look at the time plot. We're going to look at the time
plot of the data set and try to see If there is an outlier. If there is a change in the trend,
a change in the variance and so forth. And if we need, we're going to
transform the data set, right? So transformation will help us, for
example, try to stabilize the variance. And then if you need to remove
the trend or seasonal trend. We're going to do differencing so we can
do just non-seasonal differencing, or seasonal differencing, or
both at the same time. As we go through, we're also going
to use Ljung-Box test to check if there is an autocorrelation
between the previous lags. We're going to use ACF
as one of our tools. And in the ACF,
if there is closers spikes, then that will suggest MA order for us. If there is spikes around seasonal
lags that will give us SMA order, in other words,
seasonal moving average order. We're going to look at PACF,
partial autocorrelation function. Closer spikes will suggest
autoregressive order and the spikes around the seasonal lags will
suggest seasonal autoregressive orders. Then we're going to look
at a few different models. We're going to compare their
information criterion, and we're going to choose
the model with a minimum AIC. Now as we do all of these, we're going
to keep in mind the parsimony principle which I highlighted the green here. I'm going to talk about
this in the next slide. We have already talked
about this a little bit. We are trying to find the simplest
model that fits the data. In this lecture I'm going to quantify
what I mean with the parsimony principle. And once we have our model using
the parsimony principle and comparing the AICs and
choose the minimum AICs, we're going to look at the residuals,
right? We're going to look at the time plot
of the residuals, ACF and PACF of the residuals and we're going to look
at the Ljung-Box test of the residuals. So we will expect to get a white noise. And this residual analysis, these last
two steps will tell us if there is white noise or non white noise in the residuals. Now the parsimony principle,
we're going to use the following. If you have a SARIMA, which is p,d,q,
capital P, capital D, capital Q, s. S is the span of the seasonality,
p is order of autoregressive process, d is the order of differencing,
non-seasonal difference. D is seasonal differencing,
q is order of moving average process. Q is order of seasonal
moving average process and P is the order of seasonal
autoregressive process. And if I add them up, I do not want
to have that complicated model I do. We do not want to overfit the time series. So we're going to basically use
this parsimony principle that these parameters should add up to
something less than or equal to six. So let's start looking at the Johnson and
Johnson data set again, this is the quarterly earnings for Johnson
& Johnson shares, from 1960 until 1980. The source is the astsa package,
that is basically a package accompanying the book
by Shumway and Stoffer. The book is Time Series Analysis and
its Applications. And let's remind that, let's remember that the Johnson & Johnson
quarterly data is a quarterly time series. As before we looked at the time plot. As you can see from the time plot there
is definitely a trend going up and as trend goes up the variance is changing. We have a higher variance here,
lower variance here. That tells me I have what is
called heteroschocasticity, as variance is increasing in this case,
changing. We also have a seasonality,
by the nature of the data. It's a quarterly earnings, and every
quarter we might see some cyclic behavior. So what we do first,
we'll have to transform, right? We talked about this before. The transformation is going to,
basically, the logarithm. So we take the logarithm of
the data to stabilize the variance. And to remove the trend,
we take the difference. So difference of
the logarithm of the dataset. This is also effectively called log-return
in specifically financial time series. So rt is difference of the logarithms,
in other words logarithm of the division. This is a side note,
we are not modelling rt. We are basically modelling
the logarithm of the transformation. This is basically time series
of the log returns, and we Help to see a stationary
time series here. We can see that variance is different in
the middle part of the data rather than the end point, but
if you're going to ignore that, and you're going to say okay maybe restabilize
by taking a lower [INAUDIBLE]. And I look at ACF and PACF, as you can see
we do have a strong auto correlation with lag four, lag eight and
that is because of the seasonality. So what we want to do we would like to
take the seasonal differencing in this case the capital D is going to be 1. In R this is basically the transformed and
difference data, we take the difference with lag 4. This becomes seasonal differencing,
and if we plot the data set, now our data set jj is differenced
seasonally and non-seasonally. Actually, the lower item of the jj is
differenced seasonally and non-seasonally. And we have some stationary,
Time series here. So what we're going to do. We're going to look at,
as we said, the Ljung-Box test. So Ljung-Box test is
basically a Box.test in R. And we're going to take the lag
as the logarithm of the data. This is the common adoption. And then we'll look at the p value and
p value is very, very small. So if the P value is small then we reject
the null hypothesis that there is no auto correlation between previous lags so
there is some auto correlation between previous lags and we're going
to find them using ACF and PACF. So let's look at ACF. This is the ACF of the resulted data and
this is the PACF of the resulted data. ACF if I look at the closer spikes here,
I have a spike at Lag 1 and then it dies off, so
this suggest MA1 models. So the order of moving average
terms would be one, but if I look at lags,
the seasonal lags, which is four. In this case, it's period one,
but the lag is four. It is almost significant. Not really significant because it's below. It does not cross this dashed line. But it's almost significant,
so we're going to assume that. So we might have some
seasonal correlation, and so this will tell us that
maybe we have Order 1, seasonal moving average term. If I look at PACF, I see that PACF,
there's a significant lag at 1, again, then this dies off. This will tell me, suggest me that maybe
order of auto-regressive term is one, and I see the other significant
other correlation at lag four, that it will tell me maybe the order of
the seasonal auto-regressive term is one. And then the other correlation dies off. Okay, so ACF told us that q is either 1 or
maybe it's 0, we get to look at both of them,
and capital Q is 0, 1. Partial auto-correlation told us that p is
maybe 0 and 1, and capital P is 0 and 1. So we'll look at this SARIMA model's p,
1, q, capital P, 1, Q, 4. 4 is my span of the seasonality. And these are the models for
logarithm of the data, and immediately just determine that PQ, capital P capital Q,
is going to be either zero or one. We're going to use ARIMA routine from R,
basically, we have the order. This is the order for non-seasonal part. And then we have a seasonal
part including the period. And we carry out this for
all these possible values of p,d, p, q, P, Q and we basically print them. So these first six numbers
are my orders p, d, q, P, D, Q. This is s which is 4 for all of them. Then we look at AIC values. This is Akaike Information Criterion. We also looked at the sum of
the squares errors, this is SSE. And we look at the p-value from
the Ljung-Box test for the residuals. So we do not want auto correlations
left in the residuals. So if you want high p-VALUE and
we want smallest AIC and smallest SSE. Our principle is going to be
choosing the smallest AIC. And I highlighted here
the AIC is -150,9134, even though the smallest SSE in
this In this output is here, which corresponds to different model. But we're ging to go with the smallest
AIC, because of negative, this is the smallest AIC. So the model that we'll agree on is
going to basically 0, 1, 1, 1, 1, 0, 4. And you can see from the p-value, p-value
is high, so we cannot reject the null hypothesis, there is no auto-correlation
in the residuals in this case. So our model is SARIMA ( 0,1,1,1,1 0)4. Remember Xt is going to
be model as our earnings, but what we found this model is for Yt. So we transformed Xt, and
we have logarithm of Xt called Yt, and we fit the SARIMA model using
SARIMA routine or ARIMA routine, the routine that we discussed, and
we obtain the following result here. These are ma1 because if there's ma1 here this is the coefficient for ma1. This is seasonal autoregressive order,
this is corresponding to this one. This is our coefficient. This is standard errors and
the p values are so small, so both of these coefficients
are very significant. We could also use SARIMA
routine from asta package. Instead of using ARIMA, SARIMA routine would basically
take the logarithm of the data. This is the transformation,
and we put the order, the parameters of our model and
at the end, this is the period. That will give us the results that we
obtained with also this residual analysis. This is our time plot of the residuals. No evidence that this is not white noise. We see almost a straight line here. There are normal. And we have no significant
autocorrelation coefficients. And all of our p values from Ljung-Box
statistics tells me that there is nothing left as an autocorrelation
in the residuals. Okay, so if I want to rewrite the model,
then remember, Xt is our earning. So Yt is the logarithm of Xt,
and then Yt is the SARIMA. We have 1- B,
this is non-seasonal differencing. 1- B to the 4,
this is seasonal differencing. Four is my span of the seasonality. One minus phi b to the 4. This is because I have
seasonal auto-regressive term. There is no non-seasonal
auto-regressive term. And the right hand side I have my
noise applied by this polynomial which is coming from a non-seasonal
moving average. If I expand these,
then Yt will become the following. I have my phi. So what is this, so I have my theta. The first number is my theta and the second number is auto-regressive
part which is my phi. So if I plug them in,
I obtain the following model. This is my fitted model to
the logarithm of the earnings, and here y is the logarithm of x t. And the Z t, which is my noise,
is normal, with expectation 0 and the variance 0.0079. We can also forecast, at this point. We can use the forecast from
the forecast package if I write plot forecast of the model that I've
specified, we obtain the forecast for the next two cycles,
the next two years basically. This is next 1981 and this is 1982. If I actually write forecast model into
the R it will tell me point estimation. Also the confidence interval,
80% confidence interval and 95% confidence interval for
my forecast next two years. So this 1981, 1982. So what have we learned? We have learned how to fit SARIMA models
to quarterly earnings of Johnson & Johnson share. We also learned how to forecast
future values of a time series.