9  Time Series Analysis and Forecasting of Air Passengers

Author

Donnie

9.1 Abstract

This report presents a comprehensive and mathematically detailed time series analysis…

9.2 1. Introduction

Forecasting monthly airline passenger volumes is essential for planning, scheduling, and resource allocation. This report demonstrates a complete, mathematically rigorous workflow following the SORS6102 project brief.

The objectives include:

  1. Data description and preparation
  2. Exploratory analysis (trend, seasonality, variance)
  3. Stationarity testing
  4. Model identification (ACF/PACF, differencing)
  5. ARIMA/SARIMA estimation
  6. Residual diagnostics
  7. Forecasting (with back-transformation)
  8. Evaluation of accuracy
  9. Interpretation and recommendations

9.3 2. Data and Setup

library(forecast)
library(tseries)
library(ggplot2)
library(lmtest)
library(kableExtra)
library(dplyr)

data("AirPassengers")
ts_data <- AirPassengers
# Required libraries
library(ggplot2)
library(viridis)       # colorblind-friendly palette
library(extrafont)     # for Times New Roman fonts

# If running for the first time (uncomment)
# extrafont::font_import()
# extrafont::loadfonts(device="pdf")

# Load data
data("AirPassengers")
ts_data <- AirPassengers

The dataset consists of 144 monthly observations from Jan 1949 – Dec 1960, with frequency 12.

A raw plot shows increasing variance and multiplicative seasonality:

autoplot(ts_data) +
  ggtitle("Monthly AirPassengers") +
  ylab("Passengers")

library(ggplot2)
library(viridis)

data("AirPassengers")
ts_data <- AirPassengers

ggplot(
    data = NULL,
    aes(x = as.numeric(time(ts_data)),
        y = as.numeric(ts_data))
  ) +
  geom_line(color = viridis(1, option = "C"), size = 1.1) +
  theme_minimal(base_family = "serif") +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    plot.background  = element_rect(fill = "white", color = NA),
    axis.title = element_text(face = "bold", size = 14),
    axis.text  = element_text(size = 12),
    axis.line  = element_line(color = "black", linewidth = 0.4),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray85"),
    legend.position = "none",
    plot.margin = margin(10,10,10,10)
  ) +
  labs(
    x = "Year",
    y = "Number of Passengers"
  )


9.4 3. Transformation

Variance increases with level, so we apply a log transform:

\[ Y_t = \log(X_t) \]

log_ts <- log(ts_data)
autoplot(log_ts)

Or

log_ts <- log(ts_data)

autoplot(log_ts) +
  geom_line(color = viridis(1, option = "C"), linewidth = 1.1) +
  theme_minimal(base_family = "serif") +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    plot.background  = element_rect(fill = "white", color = NA),
    axis.title = element_text(face = "bold", size = 14),
    axis.text  = element_text(size = 12),
    axis.line  = element_line(color = "black", linewidth = 0.4),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray85"),
    legend.position = "none",
    plot.margin = margin(10,10,10,10)
  ) +
  labs(
    x = "Year",
    y = "Log(Passengers)"
  )


9.5 4. STL Decomposition

stl_fit <- stl(log_ts, s.window="periodic")
autoplot(stl_fit)

Or

stl_fit <- stl(log_ts, s.window = "periodic")

autoplot(stl_fit) +
  scale_color_viridis_d(option = "C") +
  scale_fill_viridis_d(option = "C") +
  theme_minimal(base_family = "serif") +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    plot.background  = element_rect(fill = "white", color = NA),
    strip.background = element_rect(fill = "white"),
    strip.text       = element_text(face = "bold", size = 13, family = "serif"),
    axis.title       = element_text(face = "bold", size = 14),
    axis.text        = element_text(size = 12),
    axis.line        = element_line(color = "black", linewidth = 0.4),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray85"),
    legend.position  = "none",
    plot.margin      = margin(10,10,10,10)
  ) +
  labs(
    x = "Year",
    y = "Value"
  )

Trend and seasonal components are clearly visible.


9.6 5. Stationarity and Differencing

ADF test on the logged data:

adf.test(log_ts)

    Augmented Dickey-Fuller Test

data:  log_ts
Dickey-Fuller = -6.4215, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

Not stationary → apply first and seasonal differencing:

\[ \nabla (1 - B^{12}) Y_t = (1 - B)(1 - B^{12}) Y_t \]

d1s <- diff(diff(log_ts), lag=12)
adf.test(d1s)

    Augmented Dickey-Fuller Test

data:  d1s
Dickey-Fuller = -5.1993, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

9.7 6. ACF and PACF Analysis

To identify ARIMA orders, we examine ACF and PACF of the differenced series:

par(mfrow=c(1,2))
Acf(d1s, main="ACF of Differenced Series")
Pacf(d1s, main="PACF of Differenced Series")

par(mfrow=c(1,1))

Or

# ACF plot
p_acf <- ggAcf(d1s) +
  scale_color_viridis_d(option = "C") +
  theme_minimal(base_family = "serif") +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    plot.background  = element_rect(fill = "white", color = NA),
    axis.title = element_text(face = "bold", size = 14),
    axis.text  = element_text(size = 12),
    plot.title = element_text(face = "bold", size = 14),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray85")
  ) +
  labs(title = "ACF of Differenced Series")

# PACF plot
p_pacf <- ggPacf(d1s) +
  scale_color_viridis_d(option = "C") +
  theme_minimal(base_family = "serif") +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    plot.background  = element_rect(fill = "white", color = NA),
    axis.title = element_text(face = "bold", size = 14),
    axis.text  = element_text(size = 12),
    plot.title = element_text(face = "bold", size = 14),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray85")
  ) +
  labs(title = "PACF of Differenced Series")

# Display side-by-side
library(gridExtra)
grid.arrange(p_acf, p_pacf, ncol = 2)

Interpretation:

  • Seasonal spikes at lag 12 suggest seasonal differencing was needed.
  • ACF tailing + PACF significant at lag 1 suggests an MA(1) component.
  • Seasonal MA(1) at lag 12 is also likely.

These patterns support the Airline Model:

\[ \text{ARIMA}(0,1,1)(0,1,1)_{12} \]


9.8 7. Model Estimation

9.8.1 7.1 Fit the canonical Airline Model

fit_airline <- Arima(log_ts, order=c(0,1,1), seasonal=c(0,1,1))
summary(fit_airline)
Series: log_ts 
ARIMA(0,1,1)(0,1,1)[12] 

Coefficients:
          ma1     sma1
      -0.4018  -0.5569
s.e.   0.0896   0.0731

sigma^2 = 0.001371:  log likelihood = 244.7
AIC=-483.4   AICc=-483.21   BIC=-474.77

Training set error measures:
                       ME       RMSE        MAE        MPE      MAPE      MASE
Training set 0.0005730622 0.03504883 0.02626034 0.01098898 0.4752815 0.2169522
                   ACF1
Training set 0.01443892

Model equations:

\[ (1 - B)(1 - B^{12})Y_t = (1 + \theta_1 B)(1 + \Theta_1 B^{12}) \varepsilon_t \]

Where:

  • \(\theta_1\) = non-seasonal MA(1)
  • \(\Theta_1\) = seasonal MA(1)

9.9 7.2 Fit alternative SARIMA models

fit_alt1 <- Arima(log_ts, order=c(2,1,1), seasonal=c(0,1,1))
fit_alt2 <- Arima(log_ts, order=c(0,1,1), seasonal=c(1,1,1))

9.10 7.3 Model Comparison

models <- list(
  Airline = fit_airline,
  Alt1 = fit_alt1,
  Alt2 = fit_alt2
)

model_table <- data.frame(
  Model = names(models),
  AIC  = sapply(models, AIC),
  # AICc = sapply(models, forecast::AICc),
  BIC  = sapply(models, BIC)
)

kable(model_table, caption="Model Selection Criteria") %>% kable_styling(full_width=FALSE)
Model Selection Criteria
Model AIC BIC
Airline Airline -483.3991 -474.7735
Alt1 Alt1 -482.2723 -467.8963
Alt2 Alt2 -481.9131 -470.4123
library(forecast)

fit_alt1 <- Arima(log_ts, order=c(2,1,1), seasonal=c(0,1,1))
fit_alt2 <- Arima(log_ts, order=c(0,1,1), seasonal=c(1,1,1))

models <- list(
  Airline = fit_airline,
  Alt1 = fit_alt1,
  Alt2 = fit_alt2
)

model_table <- data.frame(
  Model = names(models),
  AIC  = sapply(models, AIC),
  # AICc = sapply(models, forecast::AICc),
  BIC  = sapply(models, BIC)
)

model_table
          Model       AIC       BIC
Airline Airline -483.3991 -474.7735
Alt1       Alt1 -482.2723 -467.8963
Alt2       Alt2 -481.9131 -470.4123

Conclusion:
The Airline Model often achieves the lowest AICc and is preferred for this dataset.


9.11 8. Residual Diagnostics

# Residuals
res <- residuals(fit_airline)

# --- 1. Residual Time Series Plot ---
p1 <- autoplot(res) +
  geom_line(color = viridis(1, option = "C"), linewidth = 1) +
  theme_minimal(base_family = "serif") +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    plot.background = element_rect(fill = "white", color = NA),
    axis.title = element_text(face = "bold", size = 14),
    axis.text = element_text(size = 12),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray85")
  ) +
  labs(x = "Time", y = "Residuals")

# --- 2. ACF Plot ---
p2 <- ggAcf(res) +
  theme_minimal(base_family = "serif") +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    plot.background = element_rect(fill = "white", color = NA),
    axis.title = element_text(face = "bold", size = 14),
    axis.text = element_text(size = 12),
    plot.title = element_text(face = "bold", size = 14),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray85")
  ) +
  labs(title = "ACF of Residuals")

# --- 3. Histogram with Density ---
p3 <- ggplot(data.frame(res), aes(x = res)) +
  geom_histogram(aes(y = ..density..),
                 bins = 20,
                 fill = viridis(1, option="C"), 
                 color = "white",
                 alpha = 0.9) +
  geom_density(color = "black", linewidth = 1) +
  theme_minimal(base_family = "serif") +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    plot.background  = element_rect(fill = "white", color = NA),
    axis.title = element_text(face = "bold", size = 14),
    axis.text  = element_text(size = 12),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray85")
  ) +
  labs(x = "Residuals", y = "Density")

# Arrange all three
library(gridExtra)
grid.arrange(p1, p2, p3, ncol = 3)

# Print Ljung-Box Test:
Box.test(res, lag = 24, type = "Ljung-Box", fitdf = 2)

    Box-Ljung test

data:  res
X-squared = 26.446, df = 22, p-value = 0.233

9.11.1 Requirements for a valid model:

  1. Residuals ≈ white noise
  2. Ljung–Box p-value > 0.05
  3. No seasonal structure in residuals
  4. Approximate normality
Box.test(residuals(fit_airline), lag=24, type="Ljung-Box", fitdf=2)

    Box-Ljung test

data:  residuals(fit_airline)
X-squared = 26.446, df = 22, p-value = 0.233

If p-value > 0.05, no remaining autocorrelation.


9.12 9. Forecasting

We forecast in log scale, then back-transform.

# Forecast (already in log scale)
fcast <- forecast(fit_airline, h = 24, level = c(80, 95))

# Convert autoplot output into ggplot and restyle it
autoplot(fcast) +
  scale_color_viridis_d(option = "C") +
  scale_fill_viridis_d(option = "C", alpha = 0.6) +
  theme_minimal(base_family = "serif") +
  theme(
    panel.background  = element_rect(fill = "white", color = NA),
    plot.background   = element_rect(fill = "white", color = NA),
    axis.title        = element_text(face = "bold", size = 14),
    axis.text         = element_text(size = 12),
    axis.line         = element_line(color = "black", linewidth = 0.4),
    panel.grid.minor  = element_blank(),
    panel.grid.major  = element_line(color = "gray85"),
    legend.position   = "none",
    plot.margin       = margin(10,10,10,10)
  ) +
  labs(
    x = "Year",
    y = "Log(Passengers)"
  )


9.13 10. Back-Transformation to Original Scale

Given:

\[ \hat{X}_t = \exp(\hat{Y}_t) \]

Bias-adjusted mean forecast:

\[ E[X_t] = \exp\left(\hat{Y}_t + \frac{\sigma^2}{2}\right) \]


9.14 11. Train-Test Evaluation

Take last 24 months as the holdout set:

h <- 24
train_ts <- window(log_ts, end=c(1958,12))
test_ts  <- window(log_ts, start=c(1959,1))

fit_train <- Arima(train_ts, order=c(0,1,1), seasonal=c(0,1,1))
fcast_train <- forecast(fit_train, h=h)

Back-transform:

pred <- exp(fcast_train$mean)
actual <- exp(test_ts)

Compute error metrics:

MAE <- mean(abs(pred - actual))
RMSE <- sqrt(mean((pred - actual)^2))
MAPE <- mean(abs((pred - actual)/actual)) * 100

data.frame(MAE, RMSE, MAPE)
       MAE     RMSE     MAPE
1 39.44726 43.18367 8.516316

9.15 12. Forecast Plots on Original Scale

We now compare forecasts with actual observations (last 24 months):

# Reconstruct training and test data on original scale
train_orig <- ts(exp(train_ts), start = start(train_ts), frequency = 12)
test_orig  <- ts(exp(test_ts),  start = start(test_ts),  frequency = 12)
pred_orig  <- ts(pred,          start = start(test_ts),  frequency = 12)

# Plot
ggplot() +
  geom_line(aes(x = time(train_orig), y = train_orig, 
                color = "Training"), linewidth = 1.1) +
  geom_line(aes(x = time(test_orig), y = test_orig,  
                color = "Actual Test"), linewidth = 1.1) +
  geom_line(aes(x = time(pred_orig), y = pred_orig,  
                color = "Forecast"), linewidth = 1.1) +

  scale_color_viridis_d(option = "C") +

  theme_minimal(base_family = "serif") +
  theme(
    panel.background  = element_rect(fill = "white", color = NA),
    plot.background   = element_rect(fill = "white", color = NA),
    axis.title        = element_text(face = "bold", size = 14),
    axis.text         = element_text(size = 12),
    axis.line         = element_line(color = "black", linewidth = 0.4),
    legend.title      = element_blank(),
    legend.position   = "bottom",
    panel.grid.minor  = element_blank(),
    panel.grid.major  = element_line(color = "gray85"),
    plot.margin       = margin(10,10,10,10)
  ) +
  labs(
    x = "Year",
    y = "Passengers"
  )

The model tracks the upward trend and seasonal pattern well.


9.16 13. Mathematical Appendix

9.16.1 13.1 SARIMA Model Structure

A general multiplicative SARIMA model is written as:

\[ \Phi_P(B^s)\phi(B) \nabla^d \nabla_s^D X_t = \Theta_Q(B^s)\theta(B)\varepsilon_t, \]

where:

  • \(B\) is the backshift operator
  • \(\nabla = 1-B\) is the differencing operator
  • \(\nabla_s = 1-B^s\) is seasonal differencing
  • \(\phi(B)\) and \(\theta(B)\) are nonseasonal AR/MA polynomials
  • \(\Phi_P(B^s)\) and \(\Theta_Q(B^s)\) are seasonal AR/MA polynomials.

For the Airline Model:

  • \(p = 0, d = 1, q = 1\)
  • \(P = 0, D = 1, Q = 1, s = 12\)

Thus:

\[ (1-B)(1-B^{12})Y_t = (1 + \theta_1 B)(1 + \Theta_1 B^{12})\varepsilon_t. \]


9.16.2 13.2 Forecast Recursion

One-step-ahead forecast:

\[ \hat{Y}_{t+1|t} = \sum_{i=1}^p \phi_i Y_{t+1-i} + \sum_{j=1}^q \theta_j \hat{\varepsilon}_{t+1-j} + \sum_{k=1}^P \Phi_k Y_{t+1-ks} + \sum_{\ell=1}^Q \Theta_\ell \hat{\varepsilon}_{t+1-\ell s}. \]

For multistep forecasts, future residuals \(\hat{\varepsilon}_{t+h}\) are set to 0.


9.17 14. Discussion

9.17.1 Model Performance

  • The model captures both trend and seasonality with only two MA parameters.
  • Diagnostics confirm no residual autocorrelation, meaning model structure is adequate.
  • Forecasts follow the real series closely in the holdout period.

9.17.2 Strengths

  • Simple, interpretable, historically successful model.
  • Performs extremely well for multiplicative seasonal series.

9.17.3 Limitations

  • No external regressors included.
  • Long-term forecasts may underestimate structural changes.

9.18 15. Recommendations

  1. Refit the model periodically as new data arrives.
  2. Consider SARIMAX models if external factors (e.g., economic indicators) are relevant.
  3. For long-term forecasting, evaluate exponential smoothing or structural models.
  4. Use bootstrap intervals for more robust uncertainty quantification.

9.19 16. Conclusion

This report demonstrated a full forecasting workflow, including:

  • data transformation
  • decomposition
  • stationarity testing
  • SARIMA identification
  • parameter estimation
  • diagnostics
  • forecasting
  • evaluation

The canonical Airline Model provides an excellent fit and strong forecasting accuracy.


9.20 17. References

  • Box, G.E.P., Jenkins, G.M., Reinsel, G.C., & Ljung, G.M. (2015).
    Time Series Analysis: Forecasting and Control.

  • Hyndman, R.J., & Athanasopoulos, G. (2018).
    Forecasting: Principles and Practice.