Introduction
Autocorrelation analysis is an important step in the Exploratory Data Analysis (EDA) of time series. The autocorrelation analysis helps in detecting hidden patterns and seasonality and in checking for randomness. It is especially important when you intend to use an ARIMA model for forecasting because the autocorrelation analysis helps to identify the AR and MA parameters for the ARIMA model.
Overview
- Fundamentals
- Auto-Regressive and Moving Average Models
- Stationarity
- Autocorrelation Function and Partial Autocorrelation Function
- Order of AR, MA, and ARMA Model
- Examples
- AR(1) Process
- AR(2) Process
- MA(1) Process
- MA(2) Process
- Periodical
- Trend
- White Noise
- Random-Walk
- Constant
- 🚀 Cheat Sheet
- Case Study
- Bitcoin
- Ethereum
- Discussion on Random-Walk
import numpy as np # linear algebra
from numpy.random import seed
import math
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from datetime import datetime, date
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 14})
import seaborn as sns
import warnings # Supress warnings
warnings.filterwarnings('ignore')
import statsmodels as sm
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima_model import ARMA
# Fix seed for reproducible results
SEED = 42
np.random.seed(SEED)
# Visualizations
lag_acf = 15
lag_pacf = 15
height = 4
width = 12
Fundamentals
Auto-Regressive and Moving Average Models
Auto-Regressive (AR) Model
y^t=α1yt−1+⋯+αpyt−py^t=α1yt−1+⋯+αpyt−p
The AR model assumes that the current value (ytyt) is dependent on previous values (yt−1,yt−2,yt−3,…yt−1,yt−2,yt−3,…). Because of this assumption, we can build a linear regression model.
To figure out the order of an AR model, you would use the PACF.
Moving Average (MA) Model
y^t=ϵt+β1ϵt−1+⋯+βqϵt−qy^t=ϵt+β1ϵt−1+⋯+βqϵt−q
The MA model assumes that the current value (ytyt) is dependent on the error terms including the current error (ϵt,ϵt−1,ϵt−2,ϵt−3,…ϵt,ϵt−1,ϵt−2,ϵt−3,…). Because error terms are random, there is no linear relationship between the current value and the error terms.
To figure out the order of an MA model, you would use the ACF.
Stationarity
ACF and PACF assume stationarity of the underlying time series. Staionarity can be checked by performing an Augmented Dickey-Fuller (ADF) test:
- p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
- p-value <= 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.
If the time series is stationary, continue to the next steps. If the time series is not stationary, try differencing the time series and check its stationarity again.
from statsmodels.tsa.stattools import adfuller
def check_stationarity(series):
# Copied from https://machinelearningmastery.com/time-series-data-stationary-python/
result = adfuller(series.values)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
if (result[1] <= 0.05) & (result[4]['5%'] > result[0]):
print("\u001b[32mStationary\u001b[0m")
else:
print("\x1b[31mNon-stationary\x1b[0m")
Autocorrelation Function and Partial Autocorrelation Function
Autocorrelation Function (ACF)
Correlation between time series with a lagged version of itself. The correlation between the observation at the current time spot and the observations at previous time spots.The autocorrelation function starts a lag 0, which is the correlation of the time series with itself and therefore results in a correlation of 1.
We will be using the plot_acf
function from the statsmodels.graphics.tsaplots
library. (See statsmodels.tsa.stattools.acf)
The ACF plot can provide answers to the following questions:
- Is the observed time series white noise / random?
- Is an observation related to an adjacent observation, an observation twice-removed, and so on?
- Can the observed time series be modeled with an MA model? If yes, what is the order?
Partial Autocorrelation Function (PACF)
Additional correlation explained by each successive lagged term. The correlation between pbservations at two time spots given that we consider both observations are correlated to observations at other time spots.
The partial autocorrelation at lag k is the autocorrelation between XtXt and Xt−kXt−k that is not accounted for by lags 1 through k−1k−1.
We will be using the plot_pacf
function from the statsmodels.graphics.tsaplots
library with the parameter method = "ols"
(regression of time series on lags of it and on constant). (See statsmodels.tsa.stattools.pacf)
Sidenote: The default parameter for method
is yw
(Yule-Walker with sample-size adjustment in denominator for acovf). However, this default value is causing some implausible autocorrelations higher than 1 on the sample data. Therefore, we change the method
parameter to one that is not causing this issue. ywmle
would also work fine as suggested in this StackExchange post
The PACF plot can provide answers to the following questions:
- Can the observed time series be modeled with an AR model? If yes, what is the order?
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
Both the ACF and PACF start with a lag of 0, which is the correlation of the time series with itself and therefore results in a correlation of 1.
The difference between ACF and PACF is the inclusion or exclusion of indirect correlations in the calculation.
Furthermore, you will see a blue area in the ACF and PACF plots, which depicts the 95% confidence interval and is in indicator for the significance threshold. That means, anything within the blue area is statistically close to zero and anything outside the blue area is statistically non-zero.
Order of AR, MA, and ARMA Model
To determine the order of the model, you can use the following table:
AR(pp) | MA(qq) | ARMA(pp, qq) | |
---|---|---|---|
ACF | Tails off (Geometric decay) | Significant at lag qq / Cuts off after lag qq | Tails off (Geometric decay) |
PACF | Significant at each lag pp / Cuts off after lag pp | Tails off (Geometric decay) | Tails off (Geometric decay) |
Examples
AR(1) Process
The following time series is an AR(1) process with 128 timesteps and the following parameters:
alpha_1 = 0.5
num_samples = 128
np.random.seed(SEED)
ar = np.r_[1, -np.array([alpha_1])] # add zero-lag and negate
ma = np.r_[1] # add zero-lag
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
})
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

1. Check Stationarity
The sample data is stationary. Therefore, we do not need to difference the time series.
check_stationarity(sample['t'])
ADF Statistic: -6.918528 p-value: 0.000000 Critical Values: 1%: -3.483 5%: -2.885 10%: -2.579 Stationary
2. Check ACF and PACF
We can make the following observations:
- There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
- High degree of autocorrelation between adjacent (lag = 1)
AR(pp) | MA(qq) | ARMA(pp, qq) | |
---|---|---|---|
ACF | Tails off (Geometric decay) | Significant at lag qq / Cuts off after lag qq | Tails off (Geometric decay) |
PACF | Significant at each lag pp / Cuts off after lag pp | Tails off (Geometric decay) | Tails off (Geometric decay) |
-> We can use an AR(1) model to model this process.
So that for AR(1), we would model the AR(p) formula y^t=α1yt−1+⋯+αpyt−py^t=α1yt−1+⋯+αpyt−p to the following:
y^t=α1yt−1
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')
ax[1].annotate('Strong correlation at lag = 1', xy=(1, 0.6), xycoords='data',
xytext=(0.17, 0.75), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
plt.tight_layout()
plt.show()

3. Modelling
train_len = int(0.8* num_samples)
train = sample['t'][:train_len]
ar_model = AutoReg(train, lags=1).fit()
print(ar_model.summary())
pred = ar_model.predict(start=train_len, end=num_samples, dynamic=False)
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=sample.t[train_len:num_samples], marker='o', label='test', color='grey')
sns.lineplot(x=sample.timestamp[:train_len], y=train, marker='o', label='train')
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=pred, marker='o', label='pred')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
AutoReg Model Results ============================================================================== Dep. Variable: t No. Observations: 102 Model: AutoReg(1) Log Likelihood -133.385 Method: Conditional MLE S.D. of innovations 0.906 Date: Mon, 29 Nov 2021 AIC -0.137 Time: 13:32:54 BIC -0.060 Sample: 1 HQIC -0.106 102 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept -0.1322 0.092 -1.434 0.152 -0.313 0.048 t.L1 0.4710 0.088 5.353 0.000 0.299 0.643 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 2.1231 +0.0000j 2.1231 0.0000 -----------------------------------------------------------------------------

As you can see, the AR(1) model fits an α1=0.4710α1=0.4710, which is quite close to the alpha_1 = 0.5
which we have set. However, the predicted values seem to be quite off in this case.linkcode
AR(2) Process
The following time series is an AR(2) process with 128 timesteps and the following parameters:
alpha_1 = 0.5
alpha_2 = -0.5
np.random.seed(SEED)
ar = np.r_[1, -np.array([alpha_1, alpha_2])] # add zero-lag and negate
ma = np.r_[1] # add zero-lag
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
})
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

1. Check Stationarity
The sample data is stationary. Therefore, we do not need to difference the time series
check_stationarity(sample['t'])
ADF Statistic: -9.171375 p-value: 0.000000 Critical Values: 1%: -3.484 5%: -2.885 10%: -2.579 Stationary
2. Check ACF and PACF
We can make the following observations:
- There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
- High degree of autocorrelation between adjacent (lag = 1) and near-adjacent (lag = 2) observations
AR(pp) | MA(qq) | ARMA(pp, qq) | |
---|---|---|---|
ACF | Tails off (Geometric decay) | Significant at lag qq / Cuts off after lag qq | Tails off (Geometric decay) |
PACF | Significant at each lag pp / Cuts off after lag pp | Tails off (Geometric decay) | Tails off (Geometric decay) |
-> We can use an AR(2) model to model this process.
So that for AR(2), we would model the AR(p) formula y^t=α1yt−1+⋯+αpyt−py^t=α1yt−1+⋯+αpyt−p to the following:
y^t=α1yt−1+α2yt−2
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')
ax[1].annotate('Strong correlation at lag = 1', xy=(1, 0.36), xycoords='data',
xytext=(0.15, 0.7), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
ax[1].annotate('Strong correlation at lag = 2', xy=(2.1, -0.5), xycoords='data',
xytext=(0.25, 0.1), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
plt.tight_layout()
plt.show()

3. Modelling
train = sample['t'][:train_len]
ar_model = AutoReg(train, lags=2).fit()
print(ar_model.summary())
pred = ar_model.predict(start=train_len, end=num_samples, dynamic=False)
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=sample.t[train_len:num_samples], marker='o', label='test', color='grey')
sns.lineplot(x=sample.timestamp[:train_len], y=train, marker='o', label='train')
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=pred, marker='o', label='pred')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
AutoReg Model Results ============================================================================== Dep. Variable: t No. Observations: 102 Model: AutoReg(2) Log Likelihood -132.075 Method: Conditional MLE S.D. of innovations 0.906 Date: Mon, 29 Nov 2021 AIC -0.116 Time: 13:32:57 BIC -0.012 Sample: 2 HQIC -0.074 102 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept -0.1321 0.091 -1.446 0.148 -0.311 0.047 t.L1 0.5191 0.082 6.363 0.000 0.359 0.679 t.L2 -0.5855 0.082 -7.103 0.000 -0.747 -0.424 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 0.4434 -1.2294j 1.3069 -0.1949 AR.2 0.4434 +1.2294j 1.3069 0.1949 -----------------------------------------------------------------------------

As you can see, the AR(2) model fits α1=0.5191α1=0.5191 and α2=−0.5855α2=−0.5855, which is quite close to the alpha_1 = 0.5
and alpha_2 = -0.5
which we have set. However, the predicted values seem to be quite off as well in this case – similarly to the AR(1) case.linkcode
MA(1) Process
The following time series is an MA(1) process with 128 timesteps and the following parameters:
beta_1 = 0.5
np.random.seed(SEED)
ar = np.r_[1] # add zero-lag and negate
ma = np.r_[1, np.array([beta_1])] # add zero-lag
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
})
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

1. Check Stationarity
The sample data is stationary. Therefore, we do not need to difference the time series.
check_stationarity(sample['t'])
ADF Statistic: -7.933570 p-value: 0.000000 Critical Values: 1%: -3.483 5%: -2.885 10%: -2.579 Stationary
2. Check ACF and PACF
We can make the following observations:
- There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
- High degree of autocorrelation between adjacent (lag = 1)
AR(pp) | MA(qq) | ARMA(pp, qq) | |
---|---|---|---|
ACF | Tails off (Geometric decay) | Significant at lag qq / Cuts off after lag qq | Tails off (Geometric decay) |
PACF | Significant at each lag pp / Cuts off after lag pp | Tails off (Geometric decay) | Tails off (Geometric decay) |
-> We can use an MA(1) model to model this process.
So that for MA(1), we would model the MA(q) formula y^t=ϵt+β1ϵt−1+⋯+βqϵt−qy^t=ϵt+β1ϵt−1+⋯+βqϵt−q to the following:
y^t=ϵt+β1ϵt−1
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')
ax[0].annotate('Strong correlation at lag = 1', xy=(1, 0.5), xycoords='data',
xytext=(0.15, 0.7), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
plt.tight_layout()
plt.show()

3. Modelling
train = sample['t'][:train_len]
ma_model = ARMA(train, order=(0,1)).fit()
print(ma_model.summary())
pred = ma_model.predict(start=train_len, end=num_samples, dynamic=False)
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=sample.t[train_len:num_samples], marker='o', label='test', color='grey')
sns.lineplot(x=sample.timestamp[:train_len], y=train, marker='o', label='train')
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=pred, marker='o', label='pred')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 2 M = 12 At X0 0 variables are exactly at the bounds At iterate 0 f= 1.31975D+00 |proj g|= 1.89335D-03 At iterate 5 f= 1.31974D+00 |proj g|= 0.00000D+00 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 2 5 7 1 0 0 0.000D+00 1.320D+00 F = 1.3197385904095194 CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL ARMA Model Results ============================================================================== Dep. Variable: t No. Observations: 102 Model: ARMA(0, 1) Log Likelihood -134.613 Method: css-mle S.D. of innovations 0.904 Date: Mon, 29 Nov 2021 AIC 275.227 Time: 13:33:00 BIC 283.102 Sample: 0 HQIC 278.415 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -0.1765 0.135 -1.303 0.192 -0.442 0.089 ma.L1.t 0.5172 0.100 5.173 0.000 0.321 0.713 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- MA.1 -1.9336 +0.0000j 1.9336 0.5000 -----------------------------------------------------------------------------
This problem is unconstrained.

As you can see, the MA(1) model fits β1=0.5172β1=0.5172, which is quite close to the beta_1 = 0.5
. However, the predicted values seem to be quite off as well in this case – similarly to the AR(p) cases.
MA(2) Process
The following time series is an MA(2) process with 128 timesteps and the following parameters:
beta_1 = 0.5
beta_2 = 0.5
np.random.seed(SEED)
ar = np.r_[1] # add zero-lag and negate
ma = np.r_[1, np.array([beta_1, beta_2])] # add zero-lag
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
})
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

1. Check Stationarity
The sample data is stationary. Therefore, we do not need to difference the time series.
check_stationarity(sample['t'])
ADF Statistic: -3.882572 p-value: 0.002167 Critical Values: 1%: -3.485 5%: -2.886 10%: -2.580 Stationary
2. Check ACF and PACF
We can make the following observations:
- There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
- High degree of autocorrelation between adjacent (lag = 1) and near-adjacent (lag = 2) observations
AR(pp) | MA(qq) | ARMA(pp, qq) | |
---|---|---|---|
ACF | Tails off (Geometric decay) | Significant at lag qq / Cuts off after lag qq | Tails off (Geometric decay) |
PACF | Significant at each lag pp / Cuts off after lag pp | Tails off (Geometric decay) | Tails off (Geometric decay) |
-> We can use an MA(2) model to model this process.
So that for MA(2), we would model the MA(q) formula y^t=ϵt+β1ϵt−1+⋯+βqϵt−qy^t=ϵt+β1ϵt−1+⋯+βqϵt−q to the following:
y^t=ϵt+β1ϵt−1+β2ϵt−2
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')
ax[0].annotate('Strong correlation at lag = 1', xy=(1, 0.65), xycoords='data',
xytext=(0.15, 0.8), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
ax[0].annotate('Strong correlation at lag = 2', xy=(2, 0.5), xycoords='data',
xytext=(0.25, 0.7), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
plt.tight_layout()
plt.show()

3. Modelling
train = sample['t'][:train_len]
ma_model = ARMA(train, order=(0,2)).fit()
print(ma_model.summary())
pred = ma_model.predict(start=train_len, end=num_samples, dynamic=False)
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=sample.t[train_len:num_samples], marker='o', label='test', color='grey')
sns.lineplot(x=sample.timestamp[:train_len], y=train, marker='o', label='train')
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=pred, marker='o', label='pred')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 3 M = 12 At X0 0 variables are exactly at the bounds At iterate 0 f= 1.31825D+00 |proj g|= 4.34901D-03 At iterate 5 f= 1.31811D+00 |proj g|= 3.27072D-05 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 3 9 11 1 0 0 2.220D-08 1.318D+00 F = 1.3181092649688013 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH ARMA Model Results ============================================================================== Dep. Variable: t No. Observations: 102 Model: ARMA(0, 2) Log Likelihood -134.447 Method: css-mle S.D. of innovations 0.900 Date: Mon, 29 Nov 2021 AIC 276.894 Time: 13:33:03 BIC 287.394 Sample: 0 HQIC 281.146 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -0.2292 0.186 -1.230 0.219 -0.594 0.136 ma.L1.t 0.5226 0.084 6.235 0.000 0.358 0.687 ma.L2.t 0.5843 0.110 5.318 0.000 0.369 0.800 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- MA.1 -0.4472 -1.2294j 1.3083 -0.3055 MA.2 -0.4472 +1.2294j 1.3083 0.3055 -----------------------------------------------------------------------------
This problem is unconstrained.

As you can see, the MA(2) model fits β1=0.5226β1=0.5226 and β2=−0.5843β2=−0.5843, which is quite close to the beta_1 = 0.5
and beta_2 = 0.5
which we have set. However, the predicted values seem to be quite off as well in this case – similarly to the MA(1) case.linkcode
Periodical
The following time series is periodical with T=12. It consists of 48 timesteps.
T = 12
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
't' : [1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2]
})
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

1. Check Stationarity
The sample data is stationary. Therefore, we do not need to difference the time series.
check_stationarity(sample['t'])
ADF Statistic: -31248310026371.390625 p-value: 0.000000 Critical Values: 1%: -3.621 5%: -2.944 10%: -2.610 Stationary
2. Check ACF and PACF
We can make the following observations:
- There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
- High degree of autocorrelation between adjacent (lag = 1) and near-adjacent observations
- From both the ACF and PACF plot, we can see a strong correlation with the adjacent observation (lag = 1) and also at a lag of 12, which is the amount of T.
AR(pp) | MA(qq) | ARMA(pp, qq) | |
---|---|---|---|
ACF | Tails off (Geometric decay) | Significant at lag qq / Cuts off after lag qq | Tails off (Geometric decay) |
PACF | Significant at each lag pp / Cuts off after lag p=12p=12 | Tails off (Geometric decay) | Tails off (Geometric decay) |
-> We can use an AR(12) model to model this process.
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')
for i in range(2):
ax[i].axvline(x=T, color='r', linestyle='--')
ax[i].set_xlim([-0.5, lag_acf+0.5])
ax[i].set_ylim([-1.1, 1.1])
plt.tight_layout()
plt.show()

3. Modelling
train = sample['t'][:26]
ar_model = AutoReg(train, lags=12).fit()
print(ar_model.summary())
pred = ar_model.predict(start=26, end=48, dynamic=False)
AutoReg Model Results ============================================================================== Dep. Variable: t No. Observations: 26 Model: AutoReg(12) Log Likelihood 432.082 Method: Conditional MLE S.D. of innovations 0.000 Date: Mon, 29 Nov 2021 AIC -62.564 Time: 13:33:06 BIC -61.925 Sample: 12 HQIC -62.623 26 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept 0.0205 9.48e-18 2.16e+15 0.000 0.021 0.021 t.L1 -0.0004 5.9e-15 -7.17e+10 0.000 -0.000 -0.000 t.L2 -0.0004 6.05e-15 -7e+10 0.000 -0.000 -0.000 t.L3 -0.0004 6.03e-15 -7.01e+10 0.000 -0.000 -0.000 t.L4 -0.0004 5.97e-15 -7.09e+10 0.000 -0.000 -0.000 t.L5 -0.0004 5.56e-15 -7.61e+10 0.000 -0.000 -0.000 t.L6 -0.0004 5.64e-15 -7.5e+10 0.000 -0.000 -0.000 t.L7 -0.0004 5.89e-15 -7.18e+10 0.000 -0.000 -0.000 t.L8 -0.0004 5.67e-15 -7.46e+10 0.000 -0.000 -0.000 t.L9 -0.0004 5.8e-15 -7.3e+10 0.000 -0.000 -0.000 t.L10 -0.0004 6e-15 -7.05e+10 0.000 -0.000 -0.000 t.L11 -0.0004 5.75e-15 -7.35e+10 0.000 -0.000 -0.000 t.L12 0.9996 5.61e-15 1.78e+14 0.000 1.000 1.000 Roots ============================================================================== Real Imaginary Modulus Frequency ------------------------------------------------------------------------------ AR.1 -1.0000 -0.0000j 1.0000 -0.5000 AR.2 -0.8660 -0.5000j 1.0000 -0.4167 AR.3 -0.8660 +0.5000j 1.0000 0.4167 AR.4 -0.5000 -0.8660j 1.0000 -0.3333 AR.5 -0.5000 +0.8660j 1.0000 0.3333 AR.6 0.0000 -1.0000j 1.0000 -0.2500 AR.7 0.0000 +1.0000j 1.0000 0.2500 AR.8 0.5000 -0.8660j 1.0000 -0.1667 AR.9 0.5000 +0.8660j 1.0000 0.1667 AR.10 1.0004 -0.0000j 1.0004 -0.0000 AR.11 0.8660 -0.5000j 1.0000 -0.0833 AR.12 0.8660 +0.5000j 1.0000 0.0833 ------------------------------------------------------------------------------
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp[26:48], y=sample.t[26:48], marker='o', label='test', color='grey')
sns.lineplot(x=sample.timestamp[:26], y=train, marker='o', label='train')
sns.lineplot(x=sample.timestamp[26:48], y=pred, marker='o', label='pred')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

As you can see, the AR(12) model fits the periodical time series perfectly with α1…11=−0.0004α1…11=−0.0004 and α12=0.9996α12=0.9996.
The model y^t=α1yt−1+⋯+αpyt−py^t=α1yt−1+⋯+αpyt−p with p=12p=12 results in: y^t=α1yt−1+α2yt−2+α3yt−3+α4yt−4+α5yt−5+α6yt−6+α7yt−7+α8yt−8+α9yt−9+α10yt−10+α11yt−11+α12yt−12y^t=α1yt−1+α2yt−2+α3yt−3+α4yt−4+α5yt−5+α6yt−6+α7yt−7+α8yt−8+α9yt−9+α10yt−10+α11yt−11+α12yt−12
And with the coefficients α1…11=−0.0004≈0α1…11=−0.0004≈0 and α12=0.9996≈1α12=0.9996≈1, we get:
y^t=yt−12y^t=yt−12linkcode
Trend
The following time series is the same as Periodical (periodical with T=12) with added trend. It consists of 48 timesteps.
time = np.arange(0, 48)
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
't' : [1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2] + ((0.05*time)+20)
})
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

1. Check Stationarity
The sample data is non-stationary.
check_stationarity(sample['t'])
ADF Statistic: -0.117684 p-value: 0.947648 Critical Values: 1%: -3.621 5%: -2.944 10%: -2.610 Non-stationary
Therefore, we need to difference the time series.
sample['t_diff'] = sample['t'].diff().fillna(0)
display(sample.head(5).style.set_caption('Sample Time Series'))
check_stationarity(sample['t_diff'])

ADF Statistic: -21.691240 p-value: 0.000000 Critical Values: 1%: -3.621 5%: -2.944 10%: -2.610 Stationary
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(12, 8))
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[0])
ax[0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[0].set_title('Sample Time Series')
sns.lineplot(x=sample.timestamp, y=sample['t_diff'], marker='o', ax=ax[1])
ax[1].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[1].set_title('Differenced Sample Time Series')
plt.tight_layout()
plt.show()

2. Check ACF and PACF
We can make the following observations:
- There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
- High degree of autocorrelation between adjacent (lag = 1) and near-adjacent observations
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t_diff'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t_diff'],lags=lag_pacf, ax=ax[1], method='ols')
for i in range(2):
ax[i].axvline(x=T, color='r', linestyle='--')
ax[i].set_xlim([-0.5, lag_acf+0.5])
ax[i].set_ylim([-1.1, 1.1])
plt.tight_layout()
plt.show()

3. Modelling
White Noise
The following time series is random. It consists of 48 timesteps.
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
't' : np.random.randint(1,101,len(time))
})
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

1. Check Stationarity
The sample data is stationary. Therefore, we do not need to difference the time series.
check_stationarity(sample['t'])
ADF Statistic: -7.978782 p-value: 0.000000 Critical Values: 1%: -3.578 5%: -2.925 10%: -2.601 Stationary
2. Check ACF and PACF
We can make the following observations:
- There is only one autocorrelation that is significantly non-zero at a lag of 0. Therefore, the time series is random.
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')
plt.tight_layout()
plt.show()

3. Modelling
Modelling white noise is difficult because we cannot retrieve any parameters from the ACF and PACF plots.linkcode
Random-Walk
The following time series is random like White Noise. However, the current value depends on the previous one. It consists of 48 timesteps.
# Copied from https://towardsdatascience.com/time-series-from-scratch-white-noise-and-random-walk-5c96270514d3
# Start with a random number - let's say 0
np.random.seed(42)
random_walk = [0]
for i in range(1, 48):
# Movement direction based on a random number
num = -1 if np.random.random() < 0.5 else 1
random_walk.append(random_walk[-1] + num)
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
't' : random_walk
})
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

1. Check Stationarity
The sample data is non-stationary. Therefore, we need to difference the time series.
check_stationarity(sample['t'])
ADF Statistic: -0.731503 p-value: 0.838420 Critical Values: 1%: -3.578 5%: -2.925 10%: -2.601 Non-stationary
sample['t_diff'] = sample['t'].diff().fillna(0)
check_stationarity(sample['t_diff'])
ADF Statistic: -3.047661 p-value: 0.030677 Critical Values: 1%: -3.597 5%: -2.933 10%: -2.605 Stationary
2. Check ACF and PACF
We can make the following observations:
- In contrast to the ACF and PACF of the White Noise process, we can see that for a lag of 1, we have some light correlation because the current value depends on the previous one.
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')
ax[1].annotate('Correlation to adjacent observation', xy=(1, 0.9), xycoords='data',
xytext=(0.15, 0.7), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
plt.tight_layout()
plt.show()

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t_diff'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t_diff'],lags=lag_pacf, ax=ax[1], method='ols')
plt.tight_layout()
plt.show()

3. Modelling
Similarly to white noise, modelling random-walk is difficult because we cannot retrieve any parameters from the ACF and PACF plots.linkcode
Constant
The following time series is constant. It consists of 48 timesteps.
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
't' : 5
})
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

1. Check Stationarity
The sample data is non-stationary. Therefore, we need to difference the time series.
check_stationarity(sample['t'])
ADF Statistic: nan p-value: nan Critical Values: 1%: -3.578 5%: -2.925 10%: -2.601 Non-stationary
sample['t_diff'] = sample['t'].diff().fillna(0)
check_stationarity(sample['t_diff'])
ADF Statistic: nan p-value: nan Critical Values: 1%: -3.578 5%: -2.925 10%: -2.601 Non-stationary
2. Check ACF and PACF
- ACF/PACF was applied to non-stationary time series
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')
plt.tight_layout()
plt.show()

3. Modelling
Modelling a constant as an AR or MA process is difficult because we cannot retrieve any parameters from the ACF and PACF plots. But on the other hand, if you can retrieve that a time series is constant, it should not be too difficult to forecast it, right?
Cheat Sheet
- Check stationarity. If stationary, continue to step 2. Else use differencing until time series is stationary and then continue to step 2.
- Check ACF and PACF with following table | | AR(pp) | MA(qq) | ARMA(pp, qq) | |-|-|-|-| |ACF|Tails off (Geometric decay) | Significant at lag qq / Cuts off after lag qq|Tails off (Geometric decay) | |PACF| Significant at each lag pp / Cuts off after lag pp|Tails off (Geometric decay) |Tails off (Geometric decay) |
- Modelling
f, ax = plt.subplots(nrows=10, ncols=3, figsize=(2*width, 10*height))
### AR(1) ###
np.random.seed(SEED)
ar = np.r_[1, -np.array([alpha_1])] # add zero-lag and negate
ma = np.r_[1] # add zero-lag
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
})
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[0,0])
ax[0,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[0,0].set_title('Time Series for AR(1)')
plot_acf(sample['t'],lags=lag_acf, ax=ax[0, 1], title='ACF for AR(1)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[0, 2], method='ols', title='PACF for AR(1)')
ax[0,2].annotate('Strong correlation at lag = 1', xy=(1, 0.6), xycoords='data',
xytext=(0.17, 0.75), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
### AR(2) ###
np.random.seed(SEED)
ar = np.r_[1, -np.array([alpha_1, alpha_2])] # add zero-lag and negate
ma = np.r_[1] # add zero-lag
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
})
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[1,0])
ax[1,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[1,0].set_title('Time Series for AR(2)')
plot_acf(sample['t'],lags=lag_acf, ax=ax[1, 1], title='ACF for AR(2)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1, 2], method='ols', title='PACF for AR(2)')
ax[1, 2].annotate('Strong correlation at lag = 1', xy=(1, 0.36), xycoords='data',
xytext=(0.15, 0.7), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
ax[1, 2].annotate('Strong correlation at lag = 2', xy=(2.1, -0.5), xycoords='data',
xytext=(0.25, 0.1), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
### MA(1) ###
np.random.seed(SEED)
ar = np.r_[1] # add zero-lag and negate
ma = np.r_[1, np.array([beta_1])] # add zero-lag
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
})
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[2,0])
ax[2,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[2,0].set_title('Time Series for MA(1)')
plot_acf(sample['t'],lags=lag_acf, ax=ax[2, 1], title='ACF for MA(1)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[2, 2], method='ols', title='PACF for MA(1)')
ax[2,1].annotate('Strong correlation at lag = 1', xy=(1, 0.5), xycoords='data',
xytext=(0.15, 0.7), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
### MA(2) ###
np.random.seed(SEED)
ar = np.r_[1] # add zero-lag and negate
ma = np.r_[1, np.array([beta_1, beta_2])] # add zero-lag
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
})
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[3,0])
ax[3,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[3,0].set_title('Time Series for MA(2)')
plot_acf(sample['t'],lags=lag_acf, ax=ax[3, 1], title='ACF for MA(2)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[3, 2], method='ols', title='PACF for MA(2)')
ax[3, 1].annotate('Strong correlation at lag = 1', xy=(1, 0.65), xycoords='data',
xytext=(0.15, 0.8), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
ax[3, 1].annotate('Strong correlation at lag = 2', xy=(2, 0.5), xycoords='data',
xytext=(0.25, 0.7), textcoords='axes fraction',
arrowprops=dict(color='red', shrink=0.05, width=1))
### Periodical ###
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
't' : [1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2]
})
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[4,0])
ax[4,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[4,0].set_title('Time Series for Periodical')
plot_acf(sample['t'],lags=lag_acf, ax=ax[4, 1], title='ACF for Periodical')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[4, 2], method='ols', title='PACF for Periodical')
ax[4,2].axvline(x=T, color='r', linestyle='--')
### Trend ###
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
't' : ((0.05*time)+20)
})
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[5,0])
ax[5,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[5,0].set_title('Time Series for Trend (NON-STATIONARY!)')
plot_acf(sample['t'],lags=lag_acf, ax=ax[5, 1], title='ACF for Trend (applied to non-stationary)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[5, 2], method='ols', title='PACF for Trend (applied to non-stationary)')
### White Noise ###
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
't' : np.random.randint(1,101,len(time))
})
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[6,0])
ax[6,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[6,0].set_title('Time Series for White Noise')
plot_acf(sample['t'],lags=lag_acf, ax=ax[6, 1], title='ACF for White Noise')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[6, 2], method='ols', title='PACF for White Noise')
### Random-Walk ###
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
't' : random_walk
})
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[7,0])
ax[7,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[7,0].set_title('Time Series for Random-Walk (NON-STATIONARY!)')
plot_acf(sample['t'],lags=lag_acf, ax=ax[7, 1], title='ACF for Random-Walk (applied to non-stationary)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[7, 2], method='ols', title='PACF for Random-Walk (applied to non-stationary)')
sample['t_diff'] = sample['t'].diff().fillna(0)
plot_acf(sample['t_diff'],lags=lag_acf, ax=ax[8, 1], title='ACF for Random-Walk (applied to differenced/stationary)')
plot_pacf(sample['t_diff'],lags=lag_pacf, ax=ax[8, 2], method='ols', title='PACF for Random-Walk (applied to differenced/stationary)')
### Constant ###
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
't' : 5
})
sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[9,0])
ax[9,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[9,0].set_title('Time Series for Constant (NON-STATIONARY!)')
plot_acf(sample['t'],lags=lag_acf, ax=ax[9, 1], title='ACF for Constant (applied to non-stationary)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[9, 2], method='ols', title='PACF for Constant (applied to non-stationary)')
for i in range(9):
ax[i, 1].set_ylim([-1.1, 1.1])
ax[i, 2].set_ylim([-1.1, 1.1])
f.delaxes(ax[8, 0])
plt.tight_layout()
plt.show()

Case Study
Next, we will interpret the ACF and PACF plots in the G-Research Crypto Forecasting Competition. To simplify this case study, we will resample the minute-wise data into daily samples and only have a look at the Close price.
data_folder = "../input/g-research-crypto-forecasting/"
train = pd.read_csv(data_folder + 'train.csv', low_memory=False)
train['timestamp'] = train['timestamp'].astype('datetime64[s]')
Bitcoin
train_mini = train[train.Asset_ID == 1].reset_index()
train_mini = train_mini[['timestamp','Close']].resample('D', on='timestamp').last()['Close']
train_mini = train_mini.to_frame().reset_index(drop=False)
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))
sns.lineplot(data=train_mini, x='timestamp', y='Close')
plt.show()

1. Check Stationarity
The sample data is non-stationary. Therefore, we need to difference the time series.
check_stationarity(train_mini['Close'])
train_mini['Close_diff'] = train_mini['Close'].diff().fillna(0)
check_stationarity(train_mini['Close_diff'])
ADF Statistic: -0.422392 p-value: 0.906293 Critical Values: 1%: -3.435 5%: -2.864 10%: -2.568 Non-stationary ADF Statistic: -7.377094 p-value: 0.000000 Critical Values: 1%: -3.435 5%: -2.864 10%: -2.568 Stationary
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))
sns.lineplot(data=train_mini, x='timestamp', y='Close_diff')
plt.show()
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))
sns.histplot(train_mini['Close_diff'] )
plt.show()


2. Check ACF and PACF
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(2*width, 2*height))
plot_acf(train_mini['Close'], lags=20, ax=ax[0, 0], title='ACF on non-stationary')
plot_pacf(train_mini['Close'], lags=20, ax=ax[0, 1], method='ols', title='PACF on non-stationary')
plot_acf(train_mini['Close_diff'], lags=20, ax=ax[1, 0], title='ACF on differenced/stationary')
plot_pacf(train_mini['Close_diff'], lags=20, ax=ax[1, 1], method='ols', title='PACF on differenced/stationary')
plt.tight_layout()
plt.show()

Ethereum
train_mini = train[train.Asset_ID == 6].reset_index()
train_mini = train_mini[['timestamp','Close']].resample('D', on='timestamp').last()['Close']
train_mini = train_mini.to_frame().reset_index(drop=False)
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))
sns.lineplot(data=train_mini, x='timestamp', y='Close')
plt.show()

1. Check Stationarity
The sample data is non-stationary. Therefore, we need to difference the time series.
check_stationarity(train_mini['Close'])
train_mini['Close_diff'] = train_mini['Close'].diff().fillna(0)
check_stationarity(train_mini['Close_diff'])
ADF Statistic: 0.149049 p-value: 0.969255 Critical Values: 1%: -3.435 5%: -2.864 10%: -2.568 Non-stationary ADF Statistic: -10.075336 p-value: 0.000000 Critical Values: 1%: -3.435 5%: -2.864 10%: -2.568 Stationary
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))
sns.lineplot(data=train_mini, x='timestamp', y='Close_diff')
plt.show()
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))
sns.histplot(train_mini['Close_diff'] )
plt.show()


2. Check ACF and PACF
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(2*width, 2*height))
plot_acf(train_mini['Close'], lags=20, ax=ax[0, 0], title='ACF on non-stationary')
plot_pacf(train_mini['Close'], lags=20, ax=ax[0, 1], method='ols', title='PACF on non-stationary')
plot_acf(train_mini['Close_diff'], lags=20, ax=ax[1, 0], title='ACF on differenced/stationary')
plot_pacf(train_mini['Close_diff'], lags=20, ax=ax[1, 1], method='ols', title='PACF on differenced/stationary')
plt.tight_layout()
plt.show()

Discussion on Random-Walk
For both Bitcoin and Ethereum, we can observe a random-walk behavior (see 🚀 Cheat Sheet). This is fairly common in stock prices (see Random Walk Theory)
A random walk is unpredictable; it cannot reasonably be predicted.
Given the way that the random walk is constructed, we can expect that the best prediction we could make would be to use the observation at the previous time step as what will happen in the next time step.
Simply because we know that the next time step will be a function of the prior time step.
This is often called the naive forecast, or a persistence model. – A Gentle Introduction to the Random Walk for Times Series Forecasting with Python
Important Notice for college students
If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com
For more Programming related blogs Visit Us Geekycodes . Follow us on Instagram.