R语言时间序列分析小例

复习心烦，偶遇大作业，故摸鱼

作业题目

自由选取一组数据(可以是R 自带的数据集、或者其它来源，鼓励选取一些有趣的课题进行数据分析)，利用我们这学期所学知识建立恰当模型（ARIMA、GARCH 等），小组成员（不超过三人一组, 自由组队）中的一人安排 5 至 10 分钟左右的PPT 课堂展示（教学周第16 周、无需展示代码，仅汇报初步结果即可，我给出点评后再完善后提交）。内容基本要求如下:

• 说明数据来源背景, 需要分析解决的问题;

• 包含完整的建模过程(拟合、估计、诊断检验、预测)；

• 结论或者建议(依赖于你拟解决的问题)。

注意！！！数据量的选取需要适中, 留取部分原始数据作为评估模型预测能力使用。

此部分对最后成绩的贡献比例为20%。

Background of Data Set

A multivariate yearly time series from 1960 to 1979 with variables (Included in dataset TSA)

Test set and fit set

Select the first 60 data for fitting , while the last 12 (one year) as the test data to test the final model we struct (cut the data frame by using R function window( ))

Data preprocessing

Observe the image directly and there is an obvious logarithmic upward trend. Carry out differential processing on dataset——wages to eliminate the upward trend,

test the unit root(DK test), the result are as followings:

> adf.test(model)Augmented Dickey-Fuller Testdata: modelDickey-Fuller = -3.6536, Lag order = 3, p-value = 0.03639alternative hypothesis: stationary

We are not satisfied with the ppp value. In order to avoid excessive difference, we try to use Box-Cox transformation. From the results, it can be seen that the value of λ\lambdaλ is 1. Although the value of ppp value for significance 0.01 is not good, it can only reduce the significance requirement (α\alphaα = 0.05), rather than searching complex alternatives.

We finally preprocess the dataset by difference : model = diff (log(waves)). From the figure below, we can see that the peaks and troughs are distributed regularly. Combined with the actual background and common sense of life of the data set, workers’ salary will increase at the end of the year. After all, everyone wants to run. Workers’ salary should be fluctuated in an annual cycle. We can see from the image of data after difference. The salary at the end of each year is relatively high, but there are three hidden dangers:

① the periodicity is not very obvious. Although it has practical significance, for the size of sample is so small.

② Although it is expected that the salary at the end of middle age should be high, the data in 1986 conflicts with this statement, which is likely to be related to the social background at that time, and this information is unknown to us. The prediction work we do later may not be so good

③ It can also be seen from the figure that the growth rate of wages has slowed down (because we have just done differential processing, in fact, wages have been rising all the time)

Choose Model

Calculate the ACF and PACF of the model. In fact, the diagrams of ACF and PACF can’t put forward feasible suggestions for us to choose ppp and qqq, hence it’s not necessary to calculate eacf. The results given must not be so good. We try to make a seasonal difference to the model.

After seasonal difference, it is also difficult to select appropriate models according to ACF and PACF images,

This model is a little troublesome, for model ARIMA(p,d,q)×(P,D,Q)s\rm ARIMA(p,d,q)\times(P,D,Q)_sARIMA(p,d,q)×(P,D,Q)s，now we juat know the value of ddd and DDD ARIMA(p,1,q)×(P,1,Q)11ARIMA(p,1,q)\times(P,1,Q)_{11}ARIMA(p,1,q)×(P,1,Q)11, Next we select the optimal subset based on the BIC criterion. The following two figures respectively select the optimal subset from fit_set and model2.

We consider the models：ARIMA(0,1,0)×(0,1,1)12ARIMA(0,1,0)\times(0,1,1)_{12}ARIMA(0,1,0)×(0,1,1)12,ARIMA(0,1,0)×(1,1,1)12ARIMA(0,1,0)\times(1,1,1)_{12}ARIMA(0,1,0)×(1,1,1)12,ARIMA(0,1,1)×(0,1,0)12ARIMA(0,1,1)\times(0,1,0)_{12}ARIMA(0,1,1)×(0,1,0)12,

ARIMA(0,1,0)×(1,1,0)12ARIMA(0,1,0)\times(1,1,0)_{12}ARIMA(0,1,0)×(1,1,0)12,ARIMA(1,1,0)×(0,1,0)12ARIMA(1,1,0)\times(0,1,0)_{12}ARIMA(1,1,0)×(0,1,0)12,ARIMA(1,1,0)×(1,1,1)12ARIMA(1,1,0)\times(1,1,1)_{12}ARIMA(1,1,0)×(1,1,1)12

Parameter estimation

a1=arima(fit_set,order=c(0,1,0),seasonal = list(order=c(0,1,1),period = 12))a2=arima(fit_set,order=c(0,1,0),seasonal = list(order=c(1,1,1),period = 12))a3=arima(fit_set,order=c(0,1,1),seasonal = list(order=c(0,1,0),period = 12))a4=arima(fit_set,order=c(0,1,0),seasonal = list(order=c(1,1,0),period = 12))a5=arima(fit_set,order=c(1,1,0),seasonal = list(order=c(0,1,0),period = 12))a6=arima(fit_set,order=c(1,1,0),seasonal = list(order=c(1,1,1),period = 12))

Model diagnosis

Diagnose the six models by R function tsdiag( ). During the diagnosing, it was found that the Ljung box test of model a1 was not good, one point was below the critical line and one point was very close to the critical line; The residual autocorrelation diagram of model a3 and model a5 is also not good, there’re some multiple outliers, so we exclude model a1, model a3 and model a5 first. There are still model a2, model a4 and model a6 models left. From the results of parameter estimation and model diagnosis, these models have no obvious advantages or disadvantages. However, based on the modeling principle of “simple but not simpler”, we exclude model a6 and leave model a2 and model a4 models for final prediction. The following are the results of testing a2 and a4 model.

Model prediction

a2=stats::arima(fit_set,order=c(0,1,0),seasonal = list(order=c(1,1,1),period = 12))a2_fore=forecast(a2,h=11,level = c(95))plot(a2_fore)lines(a2_fore$fitted,col="black")lines(wages,col="red")

a4=stats::arima(fit_set,order=c(0,1,0),seasonal = list(order=c(1,1,0),period = 12))a4_fore=forecast(a4,h=11,level = c(95))plot(a4_fore)lines(a4_fore$fitted,col="black")lines(wages,col="red")

Conclusion

We can see the prediction is not successful, for each predictive value is a little higher than the true value. It may be caused by the logarithmic upward trend of the data. Differential processing is suitable for linear trends, but not for logarithmic trend. A possible solution is that we can add a approxiate penalization on t, in order to slow down the upward trend.

Code

library(forecast)library(TSA)library(xts)library(tseries)data(wages)help(wages)plot(wages)# 分组fit_set=window(wages,start=c(1981,7),end=c(1986,6))test_set=window(wages,start=c(1986,7),end=c(1987,6))# # 确定趋势# vec_wage=as.vector(wages)# newwage=data.frame(log_wage=log(vec_wage),day=c(1:length(vec_wage)))# fit=lm(newwage$log_wage~newwage$day,data=newwage)# summary(fit)# rm(fit)# 一次差分model=diff(fit_set)plot(model,ylab="model")abline(h=0)abline(v=c(1982:1986),lty=2)adf.test(model)# 尝试Box-Cox变换b=BoxCox.lambda(model)lambda=which.max(b)b=BoxCox.ar(model-min(model)*1.1)# 模型识别par(mfrow=c(1,2))acf(as.vector(model),lag.max=60,main="model")pacf(as.vector(model),lag.max = 60,main="model")# 考虑季节模型model2=diff(model,lag = 12)acf(as.vector(model2),lag.max=60,main="model2")pacf(as.vector(model2),lag.max = 60,main="model2")par(mfrow=c(1,1))# 模型识别par(mfrow=c(1,2))dis=armasubsets(y=fit_set,nar=13,nma=13,y.name="test",ar.method="ols")plot(dis)dis=armasubsets(y=model2,nar=4,nma=4,y.name="diff(1)",ar.method="ols")plot(dis)# 模型拟合a1=arima(fit_set,order=c(0,1,0),seasonal = list(order=c(0,1,1),period = 12))a2=arima(fit_set,order=c(0,1,0),seasonal = list(order=c(1,1,1),period = 12))a3=arima(fit_set,order=c(0,1,1),seasonal = list(order=c(0,1,0),period = 12))a4=arima(fit_set,order=c(0,1,0),seasonal = list(order=c(1,1,0),period = 12))a5=arima(fit_set,order=c(1,1,0),seasonal = list(order=c(0,1,0),period = 12))a6=arima(fit_set,order=c(1,1,0),seasonal = list(order=c(1,1,1),period = 12))# 模型诊断tsdiag(a1)tsdiag(a2)tsdiag(a3)tsdiag(a4)tsdiag(a5)tsdiag(a6)# 模型预测a2=stats::arima(fit_set,order=c(0,1,0),seasonal = list(order=c(1,1,1),period = 12))a2_fore=forecast(a2,h=11,level = c(95))plot(a2_fore)lines(a2_fore$fitted,col="black")lines(wages,col="red")a4=stats::arima(fit_set,order=c(0,1,0),seasonal = list(order=c(1,1,0),period = 12))a4_fore=forecast(a4,h=11,level = c(95))plot(a4_fore)lines(a4_fore$fitted,col="black")lines(wages,col="red")