|
8 | 8 | identifier: SARIMA
|
9 | 9 | parent: stock_prediction
|
10 | 10 | weight: 11
|
11 |
| -hero: images/test_forecast.png |
| 11 | +hero: images/sarima_example_9_1.png |
12 | 12 | tags: ["Finance", "Statistics", "Forecasting"]
|
13 | 13 | categories: ["Finance"]
|
14 | 14 | ---
|
| 15 | +## Introduction |
15 | 16 |
|
16 |
| -ok |
| 17 | +The Seasonal Autoregressive Integrated Moving Average (SARIMA) model is an extension of the ARIMA model (discussed in the [previous article](/posts/finance/stock_prediction/arima/)) that incorporates seasonality. This makes it particularly useful for analyzing financial time series data, which often exhibits both trend and seasonal patterns. In this article, we'll apply the SARIMA model to Apple (AAPL) stock data, perform signal decomposition, and provide a detailed mathematical explanation of the model. |
| 18 | + |
| 19 | +## 1. Data Preparation and Exploration |
| 20 | + |
| 21 | +First, let's obtain the Apple stock data and prepare it for analysis: |
| 22 | + |
| 23 | + |
| 24 | +```python |
| 25 | + |
| 26 | +import pandas as pd |
| 27 | +import numpy as np |
| 28 | +import matplotlib.pyplot as plt |
| 29 | +import yfinance as yf |
| 30 | +from statsmodels.tsa.seasonal import seasonal_decompose |
| 31 | +from statsmodels.tsa.statespace.sarimax import SARIMAX |
| 32 | +from statsmodels.tsa.stattools import adfuller, acf, pacf |
| 33 | +from pmdarima import auto_arima |
| 34 | +import seaborn as sns |
| 35 | + |
| 36 | + |
| 37 | +# Download Apple stock data |
| 38 | +ticker = "AAPL" |
| 39 | +start_date = "2022-01-01" |
| 40 | +end_date = "2024-01-23" |
| 41 | +data = yf.download(ticker, start=start_date, end=end_date) |
| 42 | +df = data.copy() |
| 43 | + |
| 44 | +# Use closing prices for our analysis |
| 45 | +ts = data['Close'].dropna() |
| 46 | + |
| 47 | +# Plot the time series |
| 48 | +plt.figure(figsize=(12,6)) |
| 49 | +plt.plot(ts) |
| 50 | +plt.title('Apple Stock Closing Prices') |
| 51 | +plt.xlabel('Date') |
| 52 | +plt.ylabel('Price') |
| 53 | +plt.show() |
| 54 | + |
| 55 | +``` |
| 56 | + |
| 57 | + [*********************100%%**********************] 1 of 1 completed |
| 58 | + |
| 59 | + |
| 60 | + |
| 61 | + |
| 62 | + |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | +## 2. Signal Decomposition |
| 71 | + |
| 72 | +Before we build our SARIMA model, it's crucial to understand the components of our time series. We'll use seasonal decomposition to break down the series into trend, seasonal, and residual components: |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | + |
| 77 | +```python |
| 78 | +# Perform seasonal decomposition |
| 79 | +decomposition = seasonal_decompose(ts, model='additive', period=12) # 252 trading days in a year |
| 80 | +plt.rcParams.update({'figure.dpi':200}) |
| 81 | + |
| 82 | +# Plot the decomposition |
| 83 | +fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12, 16)) |
| 84 | +decomposition.observed.plot(ax=ax1) |
| 85 | +ax1.set_title('Observed') |
| 86 | +decomposition.trend.plot(ax=ax2, c='r') |
| 87 | +ax2.set_title('Trend') |
| 88 | +decomposition.seasonal.plot(ax=ax3, c='k') |
| 89 | +ax3.set_title('Seasonal') |
| 90 | +decomposition.resid.plot(ax=ax4, c='g') |
| 91 | +ax4.set_title('Residual') |
| 92 | +plt.tight_layout() |
| 93 | +plt.show() |
| 94 | + |
| 95 | + |
| 96 | +``` |
| 97 | + |
| 98 | + |
| 99 | + |
| 100 | + |
| 101 | + |
| 102 | + |
| 103 | + |
| 104 | +This decomposition helps us understand: |
| 105 | +1. **Trend**: The long-term progression of the series |
| 106 | +2. **Seasonality**: Repeating patterns or cycles |
| 107 | +3. **Residual**: The noise left after accounting for trend and seasonality |
| 108 | + |
| 109 | +## 3. Mathematical Formulation of SARIMA |
| 110 | + |
| 111 | +The SARIMA model is denoted as SARIMA(p,d,q)(P,D,Q)m, where: |
| 112 | +- p, d, q: Non-seasonal orders of AR, differencing, and MA |
| 113 | +- P, D, Q: Seasonal orders of AR, differencing, and MA |
| 114 | +- m: Number of periods per season |
| 115 | + |
| 116 | +The mathematical formulation of SARIMA combines the non-seasonal and seasonal components: |
| 117 | + |
| 118 | +1. **Non-seasonal components**: |
| 119 | + - $AR(p): φ(B) = 1 - φ₁B - φ₂B² - ... - φₚBᵖ$ |
| 120 | + - $MA(q): θ(B) = 1 + θ₁B + θ₂B² + ... + θ_qBq$ |
| 121 | + |
| 122 | +2. **Seasonal components**: |
| 123 | + - $SAR(P): Φ(Bᵐ) = 1 - Φ₁Bᵐ - Φ₂B²ᵐ - ... - Φ_PBᴾᵐ$ |
| 124 | + - $SMA(Q): Θ(Bᵐ) = 1 + Θ₁Bᵐ + Θ₂B²ᵐ + ... + Θ_QBQᵐ$ |
| 125 | + |
| 126 | +3. **Differencing**: |
| 127 | + - Non-seasonal: $(1-B)ᵈ$ |
| 128 | + - Seasonal: $(1-Bᵐ)D$ |
| 129 | + |
| 130 | +The complete SARIMA model can be written as: |
| 131 | + |
| 132 | +$$φ(B)Φ(Bᵐ)(1-B)ᵈ(1-Bᵐ)D Yₜ = θ(B)Θ(Bᵐ)εₜ$$ |
| 133 | + |
| 134 | +Where: |
| 135 | +- B is the backshift operator ($BYₜ = Yₜ₋₁$) |
| 136 | +- Yₜ is the time series |
| 137 | +- εₜ is white noise |
| 138 | + |
| 139 | +## 4. SARIMA Model Implementation |
| 140 | + |
| 141 | +Now that we understand the components and mathematical formulation, let's implement the SARIMA model: |
| 142 | +From the previous article we found that the best *(p,d,q)* |
| 143 | + |
| 144 | + |
| 145 | +```python |
| 146 | +from pmdarima.arima.utils import nsdiffs |
| 147 | +D = nsdiffs(ts, |
| 148 | + m=12, # commonly requires knowledge of dataset |
| 149 | + max_D=12, |
| 150 | + ) # -> 0 |
| 151 | +D |
| 152 | +``` |
| 153 | + |
| 154 | +> 0 |
| 155 | +
|
| 156 | + |
| 157 | + |
| 158 | + |
| 159 | +```python |
| 160 | +# Create Training and Test |
| 161 | +train = ts[:int(len(df)*0.8)] |
| 162 | +test = ts[int(len(df)*0.8):] |
| 163 | +``` |
| 164 | + |
| 165 | + |
| 166 | +```python |
| 167 | +def grid_search_sarima(train, test, p_range, d_range, q_range, P_range, D_range, Q_range, m_range): |
| 168 | + best_aic = float('inf') |
| 169 | + best_mape = float('inf') |
| 170 | + best_order = None |
| 171 | + for p in p_range: |
| 172 | + for d in d_range: |
| 173 | + for q in q_range: |
| 174 | + for P in P_range: |
| 175 | + for D in D_range: |
| 176 | + for Q in Q_range: |
| 177 | + for m in m_range: |
| 178 | + try: |
| 179 | + model = SARIMAX(train.values, order=(p,d,q), seasonal_order=(P,D,Q,m)) |
| 180 | + results = model.fit() |
| 181 | + fc_series = pd.Series(results.forecast(steps=len(test)), index=test.index) # 95% conf |
| 182 | + test_metrics = forecast_accuracy(fc_series.values, test.values) |
| 183 | + # if results.aic < best_aic: |
| 184 | + # best_aic = results.aic |
| 185 | + # best_order = (p,d,q) |
| 186 | + # print(f"(p,d,q)=({p},{d},{q}) | (P,D,Q,m)=({P},{D},{Q},{m})", test_metrics['mape']) |
| 187 | + if test_metrics['mape'] < best_mape: |
| 188 | + best_mape = test_metrics['mape'] |
| 189 | + best_order = (p,d,q) |
| 190 | + best_sorder = (P,D,Q,m) |
| 191 | + print("temp best:" + f"(p,d,q)=({p},{d},{q}) | (P,D,Q,m)=({P},{D},{Q},{m})", |
| 192 | + round(test_metrics['mape'], 4)) |
| 193 | + except Exception as e: |
| 194 | + print(e) |
| 195 | + continue |
| 196 | + return best_order, best_sorder |
| 197 | + |
| 198 | + |
| 199 | +# Accuracy metrics |
| 200 | +def forecast_accuracy(forecast, actual): |
| 201 | + mape = np.mean(np.abs(forecast - actual)/np.abs(actual)) # MAPE |
| 202 | + me = np.mean(forecast - actual) # ME |
| 203 | + mae = np.mean(np.abs(forecast - actual)) # MAE |
| 204 | + mpe = np.mean((forecast - actual)/actual) # MPE |
| 205 | + rmse = np.mean((forecast - actual)**2)**.5 # RMSE |
| 206 | + corr = np.corrcoef(forecast, actual)[0,1] # corr |
| 207 | + mins = np.amin(np.hstack([forecast[:,None], |
| 208 | + actual[:,None]]), axis=1) |
| 209 | + maxs = np.amax(np.hstack([forecast[:,None], |
| 210 | + actual[:,None]]), axis=1) |
| 211 | + minmax = 1 - np.mean(mins/maxs) # minmax |
| 212 | + acf1 = acf(forecast-test)[1] # ACF1 |
| 213 | + return({'mape':mape, 'me':me, 'mae': mae, |
| 214 | + 'mpe': mpe, 'rmse':rmse, 'acf1':acf1, |
| 215 | + 'corr':corr, 'minmax':minmax}) |
| 216 | +``` |
| 217 | + |
| 218 | + |
| 219 | +```python |
| 220 | +# Determine optimal SARIMA parameters |
| 221 | +# model = auto_arima(ts, seasonal=True, m=12, |
| 222 | +# start_p=0, start_q=0, start_P=1, start_Q=1, |
| 223 | +# max_p=4, max_q=4, max_P=2, max_Q=2, d=1, D=1, |
| 224 | +# trace=True, error_action='ignore', suppress_warnings=True, stepwise=True, out_of_sample=len(test)) |
| 225 | + |
| 226 | +# print(model.summary()) |
| 227 | + |
| 228 | +# Custom Grid Search |
| 229 | +# d = 1 # from |
| 230 | +# best_order, best_sorder = grid_search_sarima(train, test, range(0,3), [d], range(0,3), range(3), [D], range(3), [4, 8, 12]) |
| 231 | +# print(f"Best ARIMA order based on grid search: {best_order}") |
| 232 | + |
| 233 | +# Fit the SARIMA model |
| 234 | +# sarima_model = SARIMAX(ts, order=model.order, seasonal_order=model.seasonal_order) (2,1,2) | (P,D,Q,m)=(2,1,1, |
| 235 | +# sarima_model = SARIMAX(train, order=best_order, seasonal_order=best_sorder) #(2,1,2) | (P,D,Q,m)=(2,1,1,8) |
| 236 | +# print(best_order, best_sorder) |
| 237 | +# sarima_model = SARIMAX(train, order=(2,1,2), seasonal_order=(6,0,6,8)) |
| 238 | +sarima_model = SARIMAX(train, order=(2,1,2), seasonal_order=(2,2,1,8)) |
| 239 | +results = sarima_model.fit() |
| 240 | + |
| 241 | +print(results.summary()) |
| 242 | +``` |
| 243 | + |
| 244 | + c:\Users\stefa\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. |
| 245 | + self._init_dates(dates, freq) |
| 246 | + c:\Users\stefa\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. |
| 247 | + self._init_dates(dates, freq) |
| 248 | + c:\Users\stefa\anaconda3\lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals |
| 249 | + warnings.warn("Maximum Likelihood optimization failed to " |
| 250 | + |
| 251 | + |
| 252 | + SARIMAX Results |
| 253 | + =========================================================================================== |
| 254 | + Dep. Variable: Close No. Observations: 412 |
| 255 | + Model: SARIMAX(2, 1, 2)x(2, 2, [1], 8) Log Likelihood -1073.473 |
| 256 | + Date: Fri, 05 Jul 2024 AIC 2162.947 |
| 257 | + Time: 20:16:14 BIC 2194.778 |
| 258 | + Sample: 0 HQIC 2175.558 |
| 259 | + - 412 |
| 260 | + Covariance Type: opg |
| 261 | + ============================================================================== |
| 262 | + coef std err z P>|z| [0.025 0.975] |
| 263 | + ------------------------------------------------------------------------------ |
| 264 | + ar.L1 0.0464 0.051 0.902 0.367 -0.054 0.147 |
| 265 | + ar.L2 0.8356 0.048 17.504 0.000 0.742 0.929 |
| 266 | + ma.L1 -0.0003 290.396 -1.14e-06 1.000 -569.165 569.164 |
| 267 | + ma.L2 -0.9997 290.178 -0.003 0.997 -569.739 567.739 |
| 268 | + ar.S.L8 -0.6368 0.047 -13.574 0.000 -0.729 -0.545 |
| 269 | + ar.S.L16 -0.2957 0.042 -7.043 0.000 -0.378 -0.213 |
| 270 | + ma.S.L8 -1.0000 3741.746 -0.000 1.000 -7334.687 7332.687 |
| 271 | + sigma2 11.5312 4.29e+04 0.000 1.000 -8.4e+04 8.4e+04 |
| 272 | + =================================================================================== |
| 273 | + Ljung-Box (L1) (Q): 0.14 Jarque-Bera (JB): 23.66 |
| 274 | + Prob(Q): 0.71 Prob(JB): 0.00 |
| 275 | + Heteroskedasticity (H): 0.40 Skew: -0.04 |
| 276 | + Prob(H) (two-sided): 0.00 Kurtosis: 4.20 |
| 277 | + =================================================================================== |
| 278 | + |
| 279 | + Warnings: |
| 280 | + [1] Covariance matrix calculated using the outer product of gradients (complex-step). |
| 281 | + |
| 282 | + |
| 283 | + |
| 284 | +```python |
| 285 | +# sarima_model = SARIMAX(train, order=(2,1,2), seasonal_order=(2,2,1,8)) |
| 286 | +# results = sarima_model.fit() |
| 287 | + |
| 288 | + |
| 289 | +# Forecast |
| 290 | +forecast_steps = len(test) # Forecast for one year |
| 291 | +forecast = results.get_forecast(steps=forecast_steps) |
| 292 | +forecast_ci = forecast.conf_int(alpha=0.25) |
| 293 | +# forecast_index = pd.date_range(start=test.index[0]+pd.Timedelta(24, 'hours'), periods=len(forecast_ci)) |
| 294 | +forecast_index = test.index |
| 295 | + |
| 296 | +# Plot the forecast |
| 297 | +plt.figure(figsize=(12, 6)) |
| 298 | +plt.plot(train[-200:], label='training') |
| 299 | +# plt.plot(ts.index, ts, label='Observed') |
| 300 | +plt.plot(test, label='actual') |
| 301 | +plt.plot(forecast_index, forecast.predicted_mean + decomposition.seasonal[test.index].values*10, color='r', label='forecast') |
| 302 | +plt.fill_between(forecast_index, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1], color='pink', alpha=0.4) |
| 303 | +plt.title(f'SARIMA Forecast of {ticker} Stock Prices') |
| 304 | +plt.xlabel('Date') |
| 305 | +plt.ylabel('Price') |
| 306 | +plt.legend() |
| 307 | +plt.show() |
| 308 | +``` |
| 309 | + |
| 310 | + c:\Users\stefa\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. |
| 311 | + return get_prediction_index( |
| 312 | + |
| 313 | + |
| 314 | + |
| 315 | + |
| 316 | + |
| 317 | + |
| 318 | + |
| 319 | + |
| 320 | + |
| 321 | +```python |
| 322 | +# forec_temp = pd.DataFrame(forecast.predicted_mean) |
| 323 | +# forec_temp.index = forecast_index |
| 324 | +# pd.DataFrame(pd.DataFrame(decomposition.seasonal[test.index]).values + forec_temp.values).plot() |
| 325 | +``` |
| 326 | + |
| 327 | +## 5. Model Diagnostics |
| 328 | + |
| 329 | +After fitting the model, it's important to check its adequacy: |
| 330 | + |
| 331 | + |
| 332 | + |
| 333 | +```python |
| 334 | +# Model diagnostics |
| 335 | +results.plot_diagnostics(figsize=(12, 8)) |
| 336 | +plt.show() |
| 337 | +``` |
| 338 | + |
| 339 | + |
| 340 | + |
| 341 | + |
| 342 | + |
| 343 | + |
| 344 | + |
| 345 | +These diagnostic plots help us check for: |
| 346 | +1. **Standardized residual**: Should resemble white noise |
| 347 | +2. **Histogram plus KDE**: Should be normally distributed |
| 348 | +3. **Q-Q plot**: Should follow the straight line |
| 349 | +4. **Correlogram**: Should show no significant autocorrelation |
| 350 | + |
| 351 | +## Conclusion |
| 352 | + |
| 353 | +The SARIMA model provides a powerful tool for analyzing and forecasting time series data with both trend and seasonal components. By decomposing the Apple stock price series and applying a SARIMA model, we've gained insights into the underlying patterns and potential future movements of the stock. |
| 354 | + |
| 355 | +Key takeaways: |
| 356 | +1. Signal decomposition revealed clear trend and seasonal components in Apple's stock price. |
| 357 | +2. The SARIMA model captures both non-seasonal and seasonal patterns in the data. |
| 358 | +3. Model diagnostics are crucial for ensuring the validity of our forecasts. |
| 359 | + |
| 360 | +Remember that while these models can provide valuable insights, stock prices are influenced by many external factors not captured in historical data alone. Always combine statistical analysis with fundamental research and an understanding of market conditions when making investment decisions. |
0 commit comments