Hello, everyone. In this lecture, we'll start fitting SARIMA processes to real-world datasets, and the first dataset we're going to look at is something familiar. We're going to look at quarterly earnings per Johnson & Johnson share. Objectives are to fit SARIMA models to quarterly earnings of Johnson & Johnson shares, and to forecast future values of the examined time series, in this case would be the earnings of Johnson & Johnson shares. All right, so when we do the modeling, we're going to look at a few steps here, and some of those steps we'll have already talked about before. So the first thing we're going to do as always is to look at the time plot. We're going to look at the time plot of the data set and try to see If there is an outlier. If there is a change in the trend, a change in the variance and so forth. And if we need, we're going to transform the data set, right? So transformation will help us, for example, try to stabilize the variance. And then if you need to remove the trend or seasonal trend. We're going to do differencing so we can do just non-seasonal differencing, or seasonal differencing, or both at the same time. As we go through, we're also going to use Ljung-Box test to check if there is an autocorrelation between the previous lags. We're going to use ACF as one of our tools. And in the ACF, if there is closers spikes, then that will suggest MA order for us. If there is spikes around seasonal lags that will give us SMA order, in other words, seasonal moving average order. We're going to look at PACF, partial autocorrelation function. Closer spikes will suggest autoregressive order and the spikes around the seasonal lags will suggest seasonal autoregressive orders. Then we're going to look at a few different models. We're going to compare their information criterion, and we're going to choose the model with a minimum AIC. Now as we do all of these, we're going to keep in mind the parsimony principle which I highlighted the green here. I'm going to talk about this in the next slide. We have already talked about this a little bit. We are trying to find the simplest model that fits the data. In this lecture I'm going to quantify what I mean with the parsimony principle. And once we have our model using the parsimony principle and comparing the AICs and choose the minimum AICs, we're going to look at the residuals, right? We're going to look at the time plot of the residuals, ACF and PACF of the residuals and we're going to look at the Ljung-Box test of the residuals. So we will expect to get a white noise. And this residual analysis, these last two steps will tell us if there is white noise or non white noise in the residuals. Now the parsimony principle, we're going to use the following. If you have a SARIMA, which is p,d,q, capital P, capital D, capital Q, s. S is the span of the seasonality, p is order of autoregressive process, d is the order of differencing, non-seasonal difference. D is seasonal differencing, q is order of moving average process. Q is order of seasonal moving average process and P is the order of seasonal autoregressive process. And if I add them up, I do not want to have that complicated model I do. We do not want to overfit the time series. So we're going to basically use this parsimony principle that these parameters should add up to something less than or equal to six. So let's start looking at the Johnson and Johnson data set again, this is the quarterly earnings for Johnson & Johnson shares, from 1960 until 1980. The source is the astsa package, that is basically a package accompanying the book by Shumway and Stoffer. The book is Time Series Analysis and its Applications. And let's remind that, let's remember that the Johnson & Johnson quarterly data is a quarterly time series. As before we looked at the time plot. As you can see from the time plot there is definitely a trend going up and as trend goes up the variance is changing. We have a higher variance here, lower variance here. That tells me I have what is called heteroschocasticity, as variance is increasing in this case, changing. We also have a seasonality, by the nature of the data. It's a quarterly earnings, and every quarter we might see some cyclic behavior. So what we do first, we'll have to transform, right? We talked about this before. The transformation is going to, basically, the logarithm. So we take the logarithm of the data to stabilize the variance. And to remove the trend, we take the difference. So difference of the logarithm of the dataset. This is also effectively called log-return in specifically financial time series. So rt is difference of the logarithms, in other words logarithm of the division. This is a side note, we are not modelling rt. We are basically modelling the logarithm of the transformation. This is basically time series of the log returns, and we Help to see a stationary time series here. We can see that variance is different in the middle part of the data rather than the end point, but if you're going to ignore that, and you're going to say okay maybe restabilize by taking a lower [INAUDIBLE]. And I look at ACF and PACF, as you can see we do have a strong auto correlation with lag four, lag eight and that is because of the seasonality. So what we want to do we would like to take the seasonal differencing in this case the capital D is going to be 1. In R this is basically the transformed and difference data, we take the difference with lag 4. This becomes seasonal differencing, and if we plot the data set, now our data set jj is differenced seasonally and non-seasonally. Actually, the lower item of the jj is differenced seasonally and non-seasonally. And we have some stationary, Time series here. So what we're going to do. We're going to look at, as we said, the Ljung-Box test. So Ljung-Box test is basically a Box.test in R. And we're going to take the lag as the logarithm of the data. This is the common adoption. And then we'll look at the p value and p value is very, very small. So if the P value is small then we reject the null hypothesis that there is no auto correlation between previous lags so there is some auto correlation between previous lags and we're going to find them using ACF and PACF. So let's look at ACF. This is the ACF of the resulted data and this is the PACF of the resulted data. ACF if I look at the closer spikes here, I have a spike at Lag 1 and then it dies off, so this suggest MA1 models. So the order of moving average terms would be one, but if I look at lags, the seasonal lags, which is four. In this case, it's period one, but the lag is four. It is almost significant. Not really significant because it's below. It does not cross this dashed line. But it's almost significant, so we're going to assume that. So we might have some seasonal correlation, and so this will tell us that maybe we have Order 1, seasonal moving average term. If I look at PACF, I see that PACF, there's a significant lag at 1, again, then this dies off. This will tell me, suggest me that maybe order of auto-regressive term is one, and I see the other significant other correlation at lag four, that it will tell me maybe the order of the seasonal auto-regressive term is one. And then the other correlation dies off. Okay, so ACF told us that q is either 1 or maybe it's 0, we get to look at both of them, and capital Q is 0, 1. Partial auto-correlation told us that p is maybe 0 and 1, and capital P is 0 and 1. So we'll look at this SARIMA model's p, 1, q, capital P, 1, Q, 4. 4 is my span of the seasonality. And these are the models for logarithm of the data, and immediately just determine that PQ, capital P capital Q, is going to be either zero or one. We're going to use ARIMA routine from R, basically, we have the order. This is the order for non-seasonal part. And then we have a seasonal part including the period. And we carry out this for all these possible values of p,d, p, q, P, Q and we basically print them. So these first six numbers are my orders p, d, q, P, D, Q. This is s which is 4 for all of them. Then we look at AIC values. This is Akaike Information Criterion. We also looked at the sum of the squares errors, this is SSE. And we look at the p-value from the Ljung-Box test for the residuals. So we do not want auto correlations left in the residuals. So if you want high p-VALUE and we want smallest AIC and smallest SSE. Our principle is going to be choosing the smallest AIC. And I highlighted here the AIC is -150,9134, even though the smallest SSE in this In this output is here, which corresponds to different model. But we're ging to go with the smallest AIC, because of negative, this is the smallest AIC. So the model that we'll agree on is going to basically 0, 1, 1, 1, 1, 0, 4. And you can see from the p-value, p-value is high, so we cannot reject the null hypothesis, there is no auto-correlation in the residuals in this case. So our model is SARIMA ( 0,1,1,1,1 0)4. Remember Xt is going to be model as our earnings, but what we found this model is for Yt. So we transformed Xt, and we have logarithm of Xt called Yt, and we fit the SARIMA model using SARIMA routine or ARIMA routine, the routine that we discussed, and we obtain the following result here. These are ma1 because if there's ma1 here this is the coefficient for ma1. This is seasonal autoregressive order, this is corresponding to this one. This is our coefficient. This is standard errors and the p values are so small, so both of these coefficients are very significant. We could also use SARIMA routine from asta package. Instead of using ARIMA, SARIMA routine would basically take the logarithm of the data. This is the transformation, and we put the order, the parameters of our model and at the end, this is the period. That will give us the results that we obtained with also this residual analysis. This is our time plot of the residuals. No evidence that this is not white noise. We see almost a straight line here. There are normal. And we have no significant autocorrelation coefficients. And all of our p values from Ljung-Box statistics tells me that there is nothing left as an autocorrelation in the residuals. Okay, so if I want to rewrite the model, then remember, Xt is our earning. So Yt is the logarithm of Xt, and then Yt is the SARIMA. We have 1- B, this is non-seasonal differencing. 1- B to the 4, this is seasonal differencing. Four is my span of the seasonality. One minus phi b to the 4. This is because I have seasonal auto-regressive term. There is no non-seasonal auto-regressive term. And the right hand side I have my noise applied by this polynomial which is coming from a non-seasonal moving average. If I expand these, then Yt will become the following. I have my phi. So what is this, so I have my theta. The first number is my theta and the second number is auto-regressive part which is my phi. So if I plug them in, I obtain the following model. This is my fitted model to the logarithm of the earnings, and here y is the logarithm of x t. And the Z t, which is my noise, is normal, with expectation 0 and the variance 0.0079. We can also forecast, at this point. We can use the forecast from the forecast package if I write plot forecast of the model that I've specified, we obtain the forecast for the next two cycles, the next two years basically. This is next 1981 and this is 1982. If I actually write forecast model into the R it will tell me point estimation. Also the confidence interval, 80% confidence interval and 95% confidence interval for my forecast next two years. So this 1981, 1982. So what have we learned? We have learned how to fit SARIMA models to quarterly earnings of Johnson & Johnson share. We also learned how to forecast future values of a time series.