Summary: ARIMA which stands for Autoregressive Integrated Moving Average helps you forecast data for seasonal and nonseasonal data
STEP 1: TIME SERIES DECOMPOSITION PLOT A time series decomposition plot allows you to observe the seasonality, trend, and error/remainder terms of a time series.
Useful Alteryx tool: TS Plot
STEP 2: DETERMINE THE ARIMA TERMS Nonseasonal ARIMA models are displayed in the terms (p,d,q) which stand for p - periods to lag for, d - number of transformations used to make the data stationary, q - lags of the error component
Stationary - mean and variance are constant over time vs Non-Stationary - mean and variance change over time
Differencing - take the value in the current period and subtract it by the value from the previous period. You might have to do this several times to make the data stationary. This is the Integrated component which is d in the model terms.
Autocorrelation - How correlated a time series is with its past values, if positive at Lag-1 then AR if negative then MA
Partial Autocorrelation - The correlation between 2 variables controlling for the values of another set of variables. If the partial autocorrelation drops of quickly then AR terms, if it slowly decays then MA Seasonal ARIMA models are denoted (p,d,q)(P,D,Q)m
These models may require seasonal differencing in addition to non-seasonal differencing. Seasonal differencing is when you subtract the value from a year previous of the current value.
Useful Alteryx tool: TS Plot
STEP 3: BUILD AND VALIDATE THE ARIMA MODEL Build the ARIMA model using the terms determined in step 2. You can use internal and external validation to validate the quality of the model.
Internal validation: Look at in-sample error measures, particularly RMSE (Root-Mean-Square Error) and MASE (Mean Absolute Scaled Error).
External validation: Determine the accuracy measures by comparing the forecasted values with the holdout sample. This is especially important for comparing ARIMA models to other types of models, such as ETS.
Pick the ARIMA model with lowest AIC value. If the AIC values are comparable, use calculated errors to pick one that minimizes error the most. Many software tools will automate the selection of the model by minimizing AIC.
Useful Alteryx tools: ARIMA, TS Compare
STEP 4: FORECAST! Use the best ARIMA model to forecast for the desired time period. Make sure to add the holdout sample back into the model. Plot the results along with 80% and 95% confidence intervals.
Useful Alteryx tool: TS Forecast
For this example, I’ll load a .csv file to a dataframe. The dataframe will need converted to a time series. Then we can plot the time series to get a first look. This will give us an immediate and intuitive look at potential seasonality and trend.
library(xts)
library(dygraphs)
step_data <- read.csv("fb_steps.csv", header=TRUE)
step_ts <- xts(step_data[,-1], order.by=as.Date(step_data[,1], "%m/%d/%Y"), frequency=7)
dygraph(step_ts, main = "Daily Steps") %>%
dyRangeSelector()
Ensure a constant mean, variance, and covariance of nth term. - Detrending / Differencing - Dickey Fuller test of Stationarity - This data has a weekly trend, so we will difference vs. the 7 day lag
Run
library(tseries)
library(forecast)
single_diff <- diff(log(step_ts),lag= 7)
dygraph(single_diff, main = "Daily Steps - differenced") %>%
dyRangeSelector()
Determine optimal parameters. AR or MA series?
Plot of non-stationary:
Acf(step_ts)
Pacf(step_ts)
Plot of single diff:
ggAcf(single_diff)
ggPacf(single_diff)
Positive correlation at lag1 = AR 1 difference gradual decay in pacf = MA
fit <- arima(step_ts, c(1, 1, 0),seasonal = list(order = c(1, 1, 0), period = 7))
pred <- predict(fit, n.ahead = 30)
Returns best ARIMA model according to either AIC, AICc or BIC value. The function conducts a search over possible model within the order constraints provided.
auto_pred <- auto.arima(step_ts)
#pred = predict(auto_pred,n.ahead=14)
#auto_pred
#plot(forecast(auto_pred))
#use summary to get the model fit by forecast package
pred_df <- summary(forecast(auto_pred, h=10))
##
## Forecast method: ARIMA(1,0,2) with non-zero mean
##
## Model Information:
## Series: step_ts
## ARIMA(1,0,2) with non-zero mean
##
## Coefficients:
## ar1 ma1 ma2 mean
## 0.9471 -0.7398 -0.0584 8502.4318
## s.e. 0.0141 0.0294 0.0261 258.5981
##
## sigma^2 estimated as 7037288: log likelihood=-13933.07
## AIC=27876.14 AICc=27876.18 BIC=27902.7
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -2.307862 2649.245 1956.252 -9.294898 25.11213 0.8266592
## ACF1
## Training set 0.001190209
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1499 7671.147 4271.462 11070.83 2471.777 12870.52
## 1500 7909.556 4437.645 11381.47 2599.727 13219.38
## 1501 7940.941 4437.522 11444.36 2582.925 13298.96
## 1502 7970.665 4439.225 11502.11 2569.793 13371.54
## 1503 7998.816 4442.430 11555.20 2559.793 13437.84
## 1504 8025.477 4446.864 11604.09 2552.462 13498.49
## 1505 8050.726 4452.295 11649.16 2547.401 13554.05
## 1506 8074.638 4458.524 11690.75 2544.269 13605.01
## 1507 8097.285 4465.383 11729.19 2542.771 13651.80
## 1508 8118.732 4472.729 11764.74 2542.651 13694.81
pred_df <- data.frame(cbind(c('2016-06-25','2016-06-26','2016-06-27','2016-06-28',
'2016-06-29','2016-06-30','2016-07-01','2016-07-02',
'2016-07-03','2016-07-04','2016-07-05','2016-07-06',
'2016-07-07','2016-07-08'), pred_df$`Point Forecast`))
#convert into a Time-Series class
pred_dfTs <- xts(pred_df[,-1], order.by=as.Date(pred_df[,1]), frequency = 7)
#plot
dygraph(pred_dfTs, main = "Daily Steps - AutoArima") %>%
dyRangeSelector()
combined <- cbind(pred_dfTs, actual=step_ts)
dygraph(combined, main = "Daily Steps - AutoArima") %>%
dyRangeSelector()