Skip to content

Commit b9726eb

Browse files
committed
sarima model article completed
1 parent 0b1ef49 commit b9726eb

33 files changed

+1299
-714
lines changed
Loading
Loading
Loading
Loading

content/posts/finance/stock_prediction/SARIMA/index.md

+346-2
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,353 @@ menu:
88
identifier: SARIMA
99
parent: stock_prediction
1010
weight: 11
11-
hero: images/test_forecast.png
11+
hero: images/sarima_example_9_1.png
1212
tags: ["Finance", "Statistics", "Forecasting"]
1313
categories: ["Finance"]
1414
---
15+
## Introduction
1516

16-
ok
17+
The Seasonal Autoregressive Integrated Moving Average (SARIMA) model is an extension of the ARIMA model (discussed in the [previous article](/posts/finance/stock_prediction/arima/)) that incorporates seasonality. This makes it particularly useful for analyzing financial time series data, which often exhibits both trend and seasonal patterns. In this article, we'll apply the SARIMA model to Apple (AAPL) stock data, perform signal decomposition, and provide a detailed mathematical explanation of the model.
18+
19+
## 1. Data Preparation and Exploration
20+
21+
First, let's obtain the Apple stock data and prepare it for analysis:
22+
23+
24+
```python
25+
26+
import pandas as pd
27+
import numpy as np
28+
import matplotlib.pyplot as plt
29+
import yfinance as yf
30+
from statsmodels.tsa.seasonal import seasonal_decompose
31+
from statsmodels.tsa.statespace.sarimax import SARIMAX
32+
from statsmodels.tsa.stattools import adfuller, acf, pacf
33+
from pmdarima import auto_arima
34+
import seaborn as sns
35+
36+
37+
# Download Apple stock data
38+
ticker = "AAPL"
39+
start_date = "2022-01-01"
40+
end_date = "2024-01-23"
41+
data = yf.download(ticker, start=start_date, end=end_date)
42+
df = data.copy()
43+
44+
# Use closing prices for our analysis
45+
ts = data['Close'].dropna()
46+
47+
# Plot the time series
48+
plt.figure(figsize=(12,6))
49+
plt.plot(ts)
50+
plt.title('Apple Stock Closing Prices')
51+
plt.xlabel('Date')
52+
plt.ylabel('Price')
53+
plt.show()
54+
55+
```
56+
57+
[*********************100%%**********************] 1 of 1 completed
58+
59+
60+
61+
62+
![png](images/sarima_example_1_1.png)
63+
64+
65+
66+
67+
68+
69+
70+
## 2. Signal Decomposition
71+
72+
Before we build our SARIMA model, it's crucial to understand the components of our time series. We'll use seasonal decomposition to break down the series into trend, seasonal, and residual components:
73+
74+
75+
76+
77+
```python
78+
# Perform seasonal decomposition
79+
decomposition = seasonal_decompose(ts, model='additive', period=12) # 252 trading days in a year
80+
plt.rcParams.update({'figure.dpi':200})
81+
82+
# Plot the decomposition
83+
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12, 16))
84+
decomposition.observed.plot(ax=ax1)
85+
ax1.set_title('Observed')
86+
decomposition.trend.plot(ax=ax2, c='r')
87+
ax2.set_title('Trend')
88+
decomposition.seasonal.plot(ax=ax3, c='k')
89+
ax3.set_title('Seasonal')
90+
decomposition.resid.plot(ax=ax4, c='g')
91+
ax4.set_title('Residual')
92+
plt.tight_layout()
93+
plt.show()
94+
95+
96+
```
97+
98+
99+
100+
![png](images/sarima_example_3_0.png)
101+
102+
103+
104+
This decomposition helps us understand:
105+
1. **Trend**: The long-term progression of the series
106+
2. **Seasonality**: Repeating patterns or cycles
107+
3. **Residual**: The noise left after accounting for trend and seasonality
108+
109+
## 3. Mathematical Formulation of SARIMA
110+
111+
The SARIMA model is denoted as SARIMA(p,d,q)(P,D,Q)m, where:
112+
- p, d, q: Non-seasonal orders of AR, differencing, and MA
113+
- P, D, Q: Seasonal orders of AR, differencing, and MA
114+
- m: Number of periods per season
115+
116+
The mathematical formulation of SARIMA combines the non-seasonal and seasonal components:
117+
118+
1. **Non-seasonal components**:
119+
- $AR(p): φ(B) = 1 - φ₁B - φ₂B² - ... - φₚBᵖ$
120+
- $MA(q): θ(B) = 1 + θ₁B + θ₂B² + ... + θ_qBq$
121+
122+
2. **Seasonal components**:
123+
- $SAR(P): Φ(Bᵐ) = 1 - Φ₁Bᵐ - Φ₂B²ᵐ - ... - Φ_PBᴾᵐ$
124+
- $SMA(Q): Θ(Bᵐ) = 1 + Θ₁Bᵐ + Θ₂B²ᵐ + ... + Θ_QBQᵐ$
125+
126+
3. **Differencing**:
127+
- Non-seasonal: $(1-B)ᵈ$
128+
- Seasonal: $(1-Bᵐ)D$
129+
130+
The complete SARIMA model can be written as:
131+
132+
$$φ(B)Φ(Bᵐ)(1-B)ᵈ(1-Bᵐ)D Yₜ = θ(B)Θ(Bᵐ)εₜ$$
133+
134+
Where:
135+
- B is the backshift operator ($BYₜ = Yₜ₋₁$)
136+
- Yₜ is the time series
137+
- εₜ is white noise
138+
139+
## 4. SARIMA Model Implementation
140+
141+
Now that we understand the components and mathematical formulation, let's implement the SARIMA model:
142+
From the previous article we found that the best *(p,d,q)*
143+
144+
145+
```python
146+
from pmdarima.arima.utils import nsdiffs
147+
D = nsdiffs(ts,
148+
m=12, # commonly requires knowledge of dataset
149+
max_D=12,
150+
) # -> 0
151+
D
152+
```
153+
154+
> 0
155+
156+
157+
158+
159+
```python
160+
# Create Training and Test
161+
train = ts[:int(len(df)*0.8)]
162+
test = ts[int(len(df)*0.8):]
163+
```
164+
165+
166+
```python
167+
def grid_search_sarima(train, test, p_range, d_range, q_range, P_range, D_range, Q_range, m_range):
168+
best_aic = float('inf')
169+
best_mape = float('inf')
170+
best_order = None
171+
for p in p_range:
172+
for d in d_range:
173+
for q in q_range:
174+
for P in P_range:
175+
for D in D_range:
176+
for Q in Q_range:
177+
for m in m_range:
178+
try:
179+
model = SARIMAX(train.values, order=(p,d,q), seasonal_order=(P,D,Q,m))
180+
results = model.fit()
181+
fc_series = pd.Series(results.forecast(steps=len(test)), index=test.index) # 95% conf
182+
test_metrics = forecast_accuracy(fc_series.values, test.values)
183+
# if results.aic < best_aic:
184+
# best_aic = results.aic
185+
# best_order = (p,d,q)
186+
# print(f"(p,d,q)=({p},{d},{q}) | (P,D,Q,m)=({P},{D},{Q},{m})", test_metrics['mape'])
187+
if test_metrics['mape'] < best_mape:
188+
best_mape = test_metrics['mape']
189+
best_order = (p,d,q)
190+
best_sorder = (P,D,Q,m)
191+
print("temp best:" + f"(p,d,q)=({p},{d},{q}) | (P,D,Q,m)=({P},{D},{Q},{m})",
192+
round(test_metrics['mape'], 4))
193+
except Exception as e:
194+
print(e)
195+
continue
196+
return best_order, best_sorder
197+
198+
199+
# Accuracy metrics
200+
def forecast_accuracy(forecast, actual):
201+
mape = np.mean(np.abs(forecast - actual)/np.abs(actual)) # MAPE
202+
me = np.mean(forecast - actual) # ME
203+
mae = np.mean(np.abs(forecast - actual)) # MAE
204+
mpe = np.mean((forecast - actual)/actual) # MPE
205+
rmse = np.mean((forecast - actual)**2)**.5 # RMSE
206+
corr = np.corrcoef(forecast, actual)[0,1] # corr
207+
mins = np.amin(np.hstack([forecast[:,None],
208+
actual[:,None]]), axis=1)
209+
maxs = np.amax(np.hstack([forecast[:,None],
210+
actual[:,None]]), axis=1)
211+
minmax = 1 - np.mean(mins/maxs) # minmax
212+
acf1 = acf(forecast-test)[1] # ACF1
213+
return({'mape':mape, 'me':me, 'mae': mae,
214+
'mpe': mpe, 'rmse':rmse, 'acf1':acf1,
215+
'corr':corr, 'minmax':minmax})
216+
```
217+
218+
219+
```python
220+
# Determine optimal SARIMA parameters
221+
# model = auto_arima(ts, seasonal=True, m=12,
222+
# start_p=0, start_q=0, start_P=1, start_Q=1,
223+
# max_p=4, max_q=4, max_P=2, max_Q=2, d=1, D=1,
224+
# trace=True, error_action='ignore', suppress_warnings=True, stepwise=True, out_of_sample=len(test))
225+
226+
# print(model.summary())
227+
228+
# Custom Grid Search
229+
# d = 1 # from
230+
# best_order, best_sorder = grid_search_sarima(train, test, range(0,3), [d], range(0,3), range(3), [D], range(3), [4, 8, 12])
231+
# print(f"Best ARIMA order based on grid search: {best_order}")
232+
233+
# Fit the SARIMA model
234+
# sarima_model = SARIMAX(ts, order=model.order, seasonal_order=model.seasonal_order) (2,1,2) | (P,D,Q,m)=(2,1,1,
235+
# sarima_model = SARIMAX(train, order=best_order, seasonal_order=best_sorder) #(2,1,2) | (P,D,Q,m)=(2,1,1,8)
236+
# print(best_order, best_sorder)
237+
# sarima_model = SARIMAX(train, order=(2,1,2), seasonal_order=(6,0,6,8))
238+
sarima_model = SARIMAX(train, order=(2,1,2), seasonal_order=(2,2,1,8))
239+
results = sarima_model.fit()
240+
241+
print(results.summary())
242+
```
243+
244+
c:\Users\stefa\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
245+
self._init_dates(dates, freq)
246+
c:\Users\stefa\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
247+
self._init_dates(dates, freq)
248+
c:\Users\stefa\anaconda3\lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
249+
warnings.warn("Maximum Likelihood optimization failed to "
250+
251+
252+
SARIMAX Results
253+
===========================================================================================
254+
Dep. Variable: Close No. Observations: 412
255+
Model: SARIMAX(2, 1, 2)x(2, 2, [1], 8) Log Likelihood -1073.473
256+
Date: Fri, 05 Jul 2024 AIC 2162.947
257+
Time: 20:16:14 BIC 2194.778
258+
Sample: 0 HQIC 2175.558
259+
- 412
260+
Covariance Type: opg
261+
==============================================================================
262+
coef std err z P>|z| [0.025 0.975]
263+
------------------------------------------------------------------------------
264+
ar.L1 0.0464 0.051 0.902 0.367 -0.054 0.147
265+
ar.L2 0.8356 0.048 17.504 0.000 0.742 0.929
266+
ma.L1 -0.0003 290.396 -1.14e-06 1.000 -569.165 569.164
267+
ma.L2 -0.9997 290.178 -0.003 0.997 -569.739 567.739
268+
ar.S.L8 -0.6368 0.047 -13.574 0.000 -0.729 -0.545
269+
ar.S.L16 -0.2957 0.042 -7.043 0.000 -0.378 -0.213
270+
ma.S.L8 -1.0000 3741.746 -0.000 1.000 -7334.687 7332.687
271+
sigma2 11.5312 4.29e+04 0.000 1.000 -8.4e+04 8.4e+04
272+
===================================================================================
273+
Ljung-Box (L1) (Q): 0.14 Jarque-Bera (JB): 23.66
274+
Prob(Q): 0.71 Prob(JB): 0.00
275+
Heteroskedasticity (H): 0.40 Skew: -0.04
276+
Prob(H) (two-sided): 0.00 Kurtosis: 4.20
277+
===================================================================================
278+
279+
Warnings:
280+
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
281+
282+
283+
284+
```python
285+
# sarima_model = SARIMAX(train, order=(2,1,2), seasonal_order=(2,2,1,8))
286+
# results = sarima_model.fit()
287+
288+
289+
# Forecast
290+
forecast_steps = len(test) # Forecast for one year
291+
forecast = results.get_forecast(steps=forecast_steps)
292+
forecast_ci = forecast.conf_int(alpha=0.25)
293+
# forecast_index = pd.date_range(start=test.index[0]+pd.Timedelta(24, 'hours'), periods=len(forecast_ci))
294+
forecast_index = test.index
295+
296+
# Plot the forecast
297+
plt.figure(figsize=(12, 6))
298+
plt.plot(train[-200:], label='training')
299+
# plt.plot(ts.index, ts, label='Observed')
300+
plt.plot(test, label='actual')
301+
plt.plot(forecast_index, forecast.predicted_mean + decomposition.seasonal[test.index].values*10, color='r', label='forecast')
302+
plt.fill_between(forecast_index, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1], color='pink', alpha=0.4)
303+
plt.title(f'SARIMA Forecast of {ticker} Stock Prices')
304+
plt.xlabel('Date')
305+
plt.ylabel('Price')
306+
plt.legend()
307+
plt.show()
308+
```
309+
310+
c:\Users\stefa\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
311+
return get_prediction_index(
312+
313+
314+
315+
316+
![png](images/sarima_example_9_1.png)
317+
318+
319+
320+
321+
```python
322+
# forec_temp = pd.DataFrame(forecast.predicted_mean)
323+
# forec_temp.index = forecast_index
324+
# pd.DataFrame(pd.DataFrame(decomposition.seasonal[test.index]).values + forec_temp.values).plot()
325+
```
326+
327+
## 5. Model Diagnostics
328+
329+
After fitting the model, it's important to check its adequacy:
330+
331+
332+
333+
```python
334+
# Model diagnostics
335+
results.plot_diagnostics(figsize=(12, 8))
336+
plt.show()
337+
```
338+
339+
340+
341+
![png](images/sarima_example_12_0.png)
342+
343+
344+
345+
These diagnostic plots help us check for:
346+
1. **Standardized residual**: Should resemble white noise
347+
2. **Histogram plus KDE**: Should be normally distributed
348+
3. **Q-Q plot**: Should follow the straight line
349+
4. **Correlogram**: Should show no significant autocorrelation
350+
351+
## Conclusion
352+
353+
The SARIMA model provides a powerful tool for analyzing and forecasting time series data with both trend and seasonal components. By decomposing the Apple stock price series and applying a SARIMA model, we've gained insights into the underlying patterns and potential future movements of the stock.
354+
355+
Key takeaways:
356+
1. Signal decomposition revealed clear trend and seasonal components in Apple's stock price.
357+
2. The SARIMA model captures both non-seasonal and seasonal patterns in the data.
358+
3. Model diagnostics are crucial for ensuring the validity of our forecasts.
359+
360+
Remember that while these models can provide valuable insights, stock prices are influenced by many external factors not captured in historical data alone. Always combine statistical analysis with fundamental research and an understanding of market conditions when making investment decisions.

content/posts/finance/stock_prediction/SARIMA/sarima_example.ipynb

+111-306
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)