Interpreting ACF and PACF | Time Series

Crypto

Introduction

Autocorrelation analysis is an important step in the Exploratory Data Analysis (EDA) of time series. The autocorrelation analysis helps in detecting hidden patterns and seasonality and in checking for randomness. It is especially important when you intend to use an ARIMA model for forecasting because the autocorrelation analysis helps to identify the AR and MA parameters for the ARIMA model.

Overview

  • Fundamentals
    • Auto-Regressive and Moving Average Models
    • Stationarity
    • Autocorrelation Function and Partial Autocorrelation Function
    • Order of AR, MA, and ARMA Model
  • Examples
    • AR(1) Process
    • AR(2) Process
    • MA(1) Process
    • MA(2) Process
    • Periodical
    • Trend
    • White Noise
    • Random-Walk
    • Constant
  • 🚀 Cheat Sheet
  • Case Study
    • Bitcoin
    • Ethereum
    • Discussion on Random-Walk
import numpy as np # linear algebra
from numpy.random import seed 
import math 

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from datetime import datetime, date 

import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 14})
import seaborn as sns

import warnings # Supress warnings 
warnings.filterwarnings('ignore')

import statsmodels as sm
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima_model import ARMA

# Fix seed for reproducible results
SEED = 42
np.random.seed(SEED)

# Visualizations
lag_acf = 15
lag_pacf = 15
height = 4
width = 12

Fundamentals

Auto-Regressive and Moving Average Models
Auto-Regressive (AR) Model

y^t=α1yt−1+⋯+αpyt−py^t=α1yt−1+⋯+αpyt−p

The AR model assumes that the current value (ytyt) is dependent on previous values (yt−1,yt−2,yt−3,…yt−1,yt−2,yt−3,…). Because of this assumption, we can build a linear regression model.

To figure out the order of an AR model, you would use the PACF.

Moving Average (MA) Model

y^t=ϵt+β1ϵt−1+⋯+βqϵt−qy^t=ϵt+β1ϵt−1+⋯+βqϵt−q

The MA model assumes that the current value (ytyt) is dependent on the error terms including the current error (ϵt,ϵt−1,ϵt−2,ϵt−3,…ϵt,ϵt−1,ϵt−2,ϵt−3,…). Because error terms are random, there is no linear relationship between the current value and the error terms.

To figure out the order of an MA model, you would use the ACF.

Stationarity

ACF and PACF assume stationarity of the underlying time series. Staionarity can be checked by performing an Augmented Dickey-Fuller (ADF) test:

  • p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
  • p-value <= 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.
[…] We can see that our [ADF] statistic value […] is less than the value […] at 1%. This suggests that we can reject the null hypothesis with a significance level of less than 1% (i.e. a low probability that the result is a statistical fluke). Rejecting the null hypothesis means that the process has no unit root, and in turn that the time series is stationary or does not have time-dependent structure. – Machine Learning Mastery: How to Check if Time Series Data is Stationary with Python

If the time series is stationary, continue to the next steps. If the time series is not stationary, try differencing the time series and check its stationarity again.

from statsmodels.tsa.stattools import adfuller

def check_stationarity(series):
    # Copied from https://machinelearningmastery.com/time-series-data-stationary-python/

    result = adfuller(series.values)

    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
    print('Critical Values:')
    for key, value in result[4].items():
        print('\t%s: %.3f' % (key, value))

    if (result[1] <= 0.05) & (result[4]['5%'] > result[0]):
        print("\u001b[32mStationary\u001b[0m")
    else:
        print("\x1b[31mNon-stationary\x1b[0m")
Autocorrelation Function and Partial Autocorrelation Function
Autocorrelation Function (ACF)

Correlation between time series with a lagged version of itself. The correlation between the observation at the current time spot and the observations at previous time spots.The autocorrelation function starts a lag 0, which is the correlation of the time series with itself and therefore results in a correlation of 1.

We will be using the plot_acf function from the statsmodels.graphics.tsaplots library. (See statsmodels.tsa.stattools.acf)

The ACF plot can provide answers to the following questions:

  • Is the observed time series white noise / random?
  • Is an observation related to an adjacent observation, an observation twice-removed, and so on?
  • Can the observed time series be modeled with an MA model? If yes, what is the order?
Partial Autocorrelation Function (PACF)

Additional correlation explained by each successive lagged term. The correlation between pbservations at two time spots given that we consider both observations are correlated to observations at other time spots.

The partial autocorrelation at lag k is the autocorrelation between XtXt and Xt−kXt−k that is not accounted for by lags 1 through k−1k−1.

We will be using the plot_pacf function from the statsmodels.graphics.tsaplots library with the parameter method = "ols" (regression of time series on lags of it and on constant). (See statsmodels.tsa.stattools.pacf)

Sidenote: The default parameter for method is yw (Yule-Walker with sample-size adjustment in denominator for acovf). However, this default value is causing some implausible autocorrelations higher than 1 on the sample data. Therefore, we change the method parameter to one that is not causing this issue. ywmle would also work fine as suggested in this StackExchange post

The PACF plot can provide answers to the following questions:

  • Can the observed time series be modeled with an AR model? If yes, what is the order?
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf

Both the ACF and PACF start with a lag of 0, which is the correlation of the time series with itself and therefore results in a correlation of 1.

The difference between ACF and PACF is the inclusion or exclusion of indirect correlations in the calculation.

Furthermore, you will see a blue area in the ACF and PACF plots, which depicts the 95% confidence interval and is in indicator for the significance threshold. That means, anything within the blue area is statistically close to zero and anything outside the blue area is statistically non-zero.

Order of AR, MA, and ARMA Model

To determine the order of the model, you can use the following table:

AR(pp)MA(qq)ARMA(pp, qq)
ACFTails off (Geometric decay)Significant at lag qq / Cuts off after lag qqTails off (Geometric decay)
PACFSignificant at each lag pp / Cuts off after lag ppTails off (Geometric decay)Tails off (Geometric decay)

Examples

AR(1) Process

The following time series is an AR(1) process with 128 timesteps and the following parameters:

alpha_1 = 0.5
num_samples =  128

np.random.seed(SEED)
ar = np.r_[1, -np.array([alpha_1])] # add zero-lag and negate
ma = np.r_[1] # add zero-lag

sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
                       't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
                      })

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
1. Check Stationarity

The sample data is stationary. Therefore, we do not need to difference the time series.

check_stationarity(sample['t'])
ADF Statistic: -6.918528
p-value: 0.000000
Critical Values:
	1%: -3.483
	5%: -2.885
	10%: -2.579
Stationary
2. Check ACF and PACF

We can make the following observations:

  • There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
  • High degree of autocorrelation between adjacent (lag = 1)
AR(pp)MA(qq)ARMA(pp, qq)
ACFTails off (Geometric decay)Significant at lag qq / Cuts off after lag qqTails off (Geometric decay)
PACFSignificant at each lag pp / Cuts off after lag ppTails off (Geometric decay)Tails off (Geometric decay)

-> We can use an AR(1) model to model this process.

So that for AR(1), we would model the AR(p) formula y^t=α1yt−1+⋯+αpyt−py^t=α1yt−1+⋯+αpyt−p to the following:

y^t=α1yt−1

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')

ax[1].annotate('Strong correlation at lag = 1', xy=(1, 0.6),  xycoords='data',
            xytext=(0.17, 0.75), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

plt.tight_layout()
plt.show()
3. Modelling
train_len = int(0.8* num_samples)

train = sample['t'][:train_len]
ar_model = AutoReg(train, lags=1).fit()

print(ar_model.summary())
pred = ar_model.predict(start=train_len, end=num_samples, dynamic=False)

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=sample.t[train_len:num_samples], marker='o', label='test', color='grey')
sns.lineplot(x=sample.timestamp[:train_len], y=train, marker='o', label='train')
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=pred, marker='o', label='pred')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      t   No. Observations:                  102
Model:                     AutoReg(1)   Log Likelihood                -133.385
Method:               Conditional MLE   S.D. of innovations              0.906
Date:                Mon, 29 Nov 2021   AIC                             -0.137
Time:                        13:32:54   BIC                             -0.060
Sample:                             1   HQIC                            -0.106
                                  102                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept     -0.1322      0.092     -1.434      0.152      -0.313       0.048
t.L1           0.4710      0.088      5.353      0.000       0.299       0.643
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            2.1231           +0.0000j            2.1231            0.0000
-----------------------------------------------------------------------------

As you can see, the AR(1) model fits an α1=0.4710α1=0.4710, which is quite close to the alpha_1 = 0.5 which we have set. However, the predicted values seem to be quite off in this case.linkcode

AR(2) Process

The following time series is an AR(2) process with 128 timesteps and the following parameters:

alpha_1 = 0.5
alpha_2 = -0.5
np.random.seed(SEED)
ar = np.r_[1, -np.array([alpha_1, alpha_2])] # add zero-lag and negate
ma = np.r_[1] # add zero-lag

sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
                       't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
                      })

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
1. Check Stationarity

The sample data is stationary. Therefore, we do not need to difference the time series

check_stationarity(sample['t'])
ADF Statistic: -9.171375
p-value: 0.000000
Critical Values:
	1%: -3.484
	5%: -2.885
	10%: -2.579
Stationary
2. Check ACF and PACF

We can make the following observations:

  • There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
  • High degree of autocorrelation between adjacent (lag = 1) and near-adjacent (lag = 2) observations
AR(pp)MA(qq)ARMA(pp, qq)
ACFTails off (Geometric decay)Significant at lag qq / Cuts off after lag qqTails off (Geometric decay)
PACFSignificant at each lag pp / Cuts off after lag ppTails off (Geometric decay)Tails off (Geometric decay)

-> We can use an AR(2) model to model this process.

So that for AR(2), we would model the AR(p) formula y^t=α1yt−1+⋯+αpyt−py^t=α1yt−1+⋯+αpyt−p to the following:

y^t=α1yt−1+α2yt−2

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')

ax[1].annotate('Strong correlation at lag = 1', xy=(1, 0.36),  xycoords='data',
            xytext=(0.15, 0.7), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

ax[1].annotate('Strong correlation at lag = 2', xy=(2.1, -0.5),  xycoords='data',
            xytext=(0.25, 0.1), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

plt.tight_layout()
plt.show()
3. Modelling
train = sample['t'][:train_len]
ar_model = AutoReg(train, lags=2).fit()

print(ar_model.summary())
pred = ar_model.predict(start=train_len, end=num_samples, dynamic=False)

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=sample.t[train_len:num_samples], marker='o', label='test', color='grey')
sns.lineplot(x=sample.timestamp[:train_len], y=train, marker='o', label='train')
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=pred, marker='o', label='pred')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      t   No. Observations:                  102
Model:                     AutoReg(2)   Log Likelihood                -132.075
Method:               Conditional MLE   S.D. of innovations              0.906
Date:                Mon, 29 Nov 2021   AIC                             -0.116
Time:                        13:32:57   BIC                             -0.012
Sample:                             2   HQIC                            -0.074
                                  102                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept     -0.1321      0.091     -1.446      0.148      -0.311       0.047
t.L1           0.5191      0.082      6.363      0.000       0.359       0.679
t.L2          -0.5855      0.082     -7.103      0.000      -0.747      -0.424
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            0.4434           -1.2294j            1.3069           -0.1949
AR.2            0.4434           +1.2294j            1.3069            0.1949
-----------------------------------------------------------------------------

As you can see, the AR(2) model fits α1=0.5191α1=0.5191 and α2=−0.5855α2=−0.5855, which is quite close to the alpha_1 = 0.5 and alpha_2 = -0.5 which we have set. However, the predicted values seem to be quite off as well in this case – similarly to the AR(1) case.linkcode

MA(1) Process

The following time series is an MA(1) process with 128 timesteps and the following parameters:

beta_1 = 0.5
np.random.seed(SEED)
ar = np.r_[1] # add zero-lag and negate
ma = np.r_[1, np.array([beta_1])] # add zero-lag

sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
                       't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
                      })

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
1. Check Stationarity

The sample data is stationary. Therefore, we do not need to difference the time series.

check_stationarity(sample['t'])
ADF Statistic: -7.933570
p-value: 0.000000
Critical Values:
	1%: -3.483
	5%: -2.885
	10%: -2.579
Stationary
2. Check ACF and PACF

We can make the following observations:

  • There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
  • High degree of autocorrelation between adjacent (lag = 1)
AR(pp)MA(qq)ARMA(pp, qq)
ACFTails off (Geometric decay)Significant at lag qq / Cuts off after lag qqTails off (Geometric decay)
PACFSignificant at each lag pp / Cuts off after lag ppTails off (Geometric decay)Tails off (Geometric decay)

-> We can use an MA(1) model to model this process.

So that for MA(1), we would model the MA(q) formula y^t=ϵt+β1ϵt−1+⋯+βqϵt−qy^t=ϵt+β1ϵt−1+⋯+βqϵt−q to the following:

y^t=ϵt+β1ϵt−1

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')

ax[0].annotate('Strong correlation at lag = 1', xy=(1, 0.5),  xycoords='data',
            xytext=(0.15, 0.7), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

plt.tight_layout()
plt.show()
3. Modelling
train = sample['t'][:train_len]
ma_model = ARMA(train, order=(0,1)).fit()

print(ma_model.summary())
pred = ma_model.predict(start=train_len, end=num_samples, dynamic=False)

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=sample.t[train_len:num_samples], marker='o', label='test', color='grey')
sns.lineplot(x=sample.timestamp[:train_len], y=train, marker='o', label='train')
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=pred, marker='o', label='pred')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            2     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.31975D+00    |proj g|=  1.89335D-03

At iterate    5    f=  1.31974D+00    |proj g|=  0.00000D+00

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    2      5      7      1     0     0   0.000D+00   1.320D+00
  F =   1.3197385904095194     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
                              ARMA Model Results                              
==============================================================================
Dep. Variable:                      t   No. Observations:                  102
Model:                     ARMA(0, 1)   Log Likelihood                -134.613
Method:                       css-mle   S.D. of innovations              0.904
Date:                Mon, 29 Nov 2021   AIC                            275.227
Time:                        13:33:00   BIC                            283.102
Sample:                             0   HQIC                           278.415
                                                                              
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1765      0.135     -1.303      0.192      -0.442       0.089
ma.L1.t        0.5172      0.100      5.173      0.000       0.321       0.713
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
MA.1           -1.9336           +0.0000j            1.9336            0.5000
-----------------------------------------------------------------------------
 This problem is unconstrained.

As you can see, the MA(1) model fits β1=0.5172β1=0.5172, which is quite close to the beta_1 = 0.5. However, the predicted values seem to be quite off as well in this case – similarly to the AR(p) cases.

MA(2) Process

The following time series is an MA(2) process with 128 timesteps and the following parameters:

beta_1 = 0.5
beta_2 = 0.5
np.random.seed(SEED)
ar = np.r_[1] # add zero-lag and negate
ma = np.r_[1, np.array([beta_1, beta_2])] # add zero-lag

sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
                       't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
                      })

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

1. Check Stationarity

The sample data is stationary. Therefore, we do not need to difference the time series.

check_stationarity(sample['t'])
ADF Statistic: -3.882572
p-value: 0.002167
Critical Values:
	1%: -3.485
	5%: -2.886
	10%: -2.580
Stationary
2. Check ACF and PACF

We can make the following observations:

  • There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
  • High degree of autocorrelation between adjacent (lag = 1) and near-adjacent (lag = 2) observations
AR(pp)MA(qq)ARMA(pp, qq)
ACFTails off (Geometric decay)Significant at lag qq / Cuts off after lag qqTails off (Geometric decay)
PACFSignificant at each lag pp / Cuts off after lag ppTails off (Geometric decay)Tails off (Geometric decay)

-> We can use an MA(2) model to model this process.

So that for MA(2), we would model the MA(q) formula y^t=ϵt+β1ϵt−1+⋯+βqϵt−qy^t=ϵt+β1ϵt−1+⋯+βqϵt−q to the following:

y^t=ϵt+β1ϵt−1+β2ϵt−2

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')

ax[0].annotate('Strong correlation at lag = 1', xy=(1, 0.65),  xycoords='data',
            xytext=(0.15, 0.8), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

ax[0].annotate('Strong correlation at lag = 2', xy=(2, 0.5),  xycoords='data',
            xytext=(0.25, 0.7), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

plt.tight_layout()
plt.show()
3. Modelling
train = sample['t'][:train_len]
ma_model = ARMA(train, order=(0,2)).fit()

print(ma_model.summary())
pred = ma_model.predict(start=train_len, end=num_samples, dynamic=False)

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=sample.t[train_len:num_samples], marker='o', label='test', color='grey')
sns.lineplot(x=sample.timestamp[:train_len], y=train, marker='o', label='train')
sns.lineplot(x=sample.timestamp[train_len:num_samples], y=pred, marker='o', label='pred')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.31825D+00    |proj g|=  4.34901D-03

At iterate    5    f=  1.31811D+00    |proj g|=  3.27072D-05

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3      9     11      1     0     0   2.220D-08   1.318D+00
  F =   1.3181092649688013     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
                              ARMA Model Results                              
==============================================================================
Dep. Variable:                      t   No. Observations:                  102
Model:                     ARMA(0, 2)   Log Likelihood                -134.447
Method:                       css-mle   S.D. of innovations              0.900
Date:                Mon, 29 Nov 2021   AIC                            276.894
Time:                        13:33:03   BIC                            287.394
Sample:                             0   HQIC                           281.146
                                                                              
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.2292      0.186     -1.230      0.219      -0.594       0.136
ma.L1.t        0.5226      0.084      6.235      0.000       0.358       0.687
ma.L2.t        0.5843      0.110      5.318      0.000       0.369       0.800
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
MA.1           -0.4472           -1.2294j            1.3083           -0.3055
MA.2           -0.4472           +1.2294j            1.3083            0.3055
-----------------------------------------------------------------------------
 This problem is unconstrained.

As you can see, the MA(2) model fits β1=0.5226β1=0.5226 and β2=−0.5843β2=−0.5843, which is quite close to the beta_1 = 0.5 and beta_2 = 0.5 which we have set. However, the predicted values seem to be quite off as well in this case – similarly to the MA(1) case.linkcode

Periodical

The following time series is periodical with T=12. It consists of 48 timesteps.

T = 12

sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
                       't' : [1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2]
                      })

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
1. Check Stationarity

The sample data is stationary. Therefore, we do not need to difference the time series.

check_stationarity(sample['t'])
ADF Statistic: -31248310026371.390625
p-value: 0.000000
Critical Values:
	1%: -3.621
	5%: -2.944
	10%: -2.610
Stationary
2. Check ACF and PACF

We can make the following observations:

  • There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
  • High degree of autocorrelation between adjacent (lag = 1) and near-adjacent observations
  • From both the ACF and PACF plot, we can see a strong correlation with the adjacent observation (lag = 1) and also at a lag of 12, which is the amount of T.
AR(pp)MA(qq)ARMA(pp, qq)
ACFTails off (Geometric decay)Significant at lag qq / Cuts off after lag qqTails off (Geometric decay)
PACFSignificant at each lag pp / Cuts off after lag p=12p=12Tails off (Geometric decay)Tails off (Geometric decay)

-> We can use an AR(12) model to model this process.


f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')

for i in range(2):
    ax[i].axvline(x=T, color='r', linestyle='--')
    ax[i].set_xlim([-0.5, lag_acf+0.5])
    ax[i].set_ylim([-1.1, 1.1])
    
plt.tight_layout()
plt.show()
3. Modelling
train = sample['t'][:26]
ar_model = AutoReg(train, lags=12).fit()

print(ar_model.summary())
pred = ar_model.predict(start=26, end=48, dynamic=False)
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:                      t   No. Observations:                   26
Model:                    AutoReg(12)   Log Likelihood                 432.082
Method:               Conditional MLE   S.D. of innovations              0.000
Date:                Mon, 29 Nov 2021   AIC                            -62.564
Time:                        13:33:06   BIC                            -61.925
Sample:                            12   HQIC                           -62.623
                                   26                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept      0.0205   9.48e-18   2.16e+15      0.000       0.021       0.021
t.L1          -0.0004    5.9e-15  -7.17e+10      0.000      -0.000      -0.000
t.L2          -0.0004   6.05e-15     -7e+10      0.000      -0.000      -0.000
t.L3          -0.0004   6.03e-15  -7.01e+10      0.000      -0.000      -0.000
t.L4          -0.0004   5.97e-15  -7.09e+10      0.000      -0.000      -0.000
t.L5          -0.0004   5.56e-15  -7.61e+10      0.000      -0.000      -0.000
t.L6          -0.0004   5.64e-15   -7.5e+10      0.000      -0.000      -0.000
t.L7          -0.0004   5.89e-15  -7.18e+10      0.000      -0.000      -0.000
t.L8          -0.0004   5.67e-15  -7.46e+10      0.000      -0.000      -0.000
t.L9          -0.0004    5.8e-15   -7.3e+10      0.000      -0.000      -0.000
t.L10         -0.0004      6e-15  -7.05e+10      0.000      -0.000      -0.000
t.L11         -0.0004   5.75e-15  -7.35e+10      0.000      -0.000      -0.000
t.L12          0.9996   5.61e-15   1.78e+14      0.000       1.000       1.000
                                    Roots                                     
==============================================================================
                   Real          Imaginary           Modulus         Frequency
------------------------------------------------------------------------------
AR.1            -1.0000           -0.0000j            1.0000           -0.5000
AR.2            -0.8660           -0.5000j            1.0000           -0.4167
AR.3            -0.8660           +0.5000j            1.0000            0.4167
AR.4            -0.5000           -0.8660j            1.0000           -0.3333
AR.5            -0.5000           +0.8660j            1.0000            0.3333
AR.6             0.0000           -1.0000j            1.0000           -0.2500
AR.7             0.0000           +1.0000j            1.0000            0.2500
AR.8             0.5000           -0.8660j            1.0000           -0.1667
AR.9             0.5000           +0.8660j            1.0000            0.1667
AR.10            1.0004           -0.0000j            1.0004           -0.0000
AR.11            0.8660           -0.5000j            1.0000           -0.0833
AR.12            0.8660           +0.5000j            1.0000            0.0833
------------------------------------------------------------------------------
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))
sns.lineplot(x=sample.timestamp[26:48], y=sample.t[26:48], marker='o', label='test', color='grey')
sns.lineplot(x=sample.timestamp[:26], y=train, marker='o', label='train')
sns.lineplot(x=sample.timestamp[26:48], y=pred, marker='o', label='pred')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()

As you can see, the AR(12) model fits the periodical time series perfectly with α1…11=−0.0004α1…11=−0.0004 and α12=0.9996α12=0.9996.

The model y^t=α1yt−1+⋯+αpyt−py^t=α1yt−1+⋯+αpyt−p with p=12p=12 results in: y^t=α1yt−1+α2yt−2+α3yt−3+α4yt−4+α5yt−5+α6yt−6+α7yt−7+α8yt−8+α9yt−9+α10yt−10+α11yt−11+α12yt−12y^t=α1yt−1+α2yt−2+α3yt−3+α4yt−4+α5yt−5+α6yt−6+α7yt−7+α8yt−8+α9yt−9+α10yt−10+α11yt−11+α12yt−12

And with the coefficients α1…11=−0.0004≈0α1…11=−0.0004≈0 and α12=0.9996≈1α12=0.9996≈1, we get:

y^t=yt−12y^t=yt−12linkcode

Trend

The following time series is the same as Periodical (periodical with T=12) with added trend. It consists of 48 timesteps.

time = np.arange(0, 48)
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
                       't' : [1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2] + ((0.05*time)+20)
                      })

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
1. Check Stationarity

The sample data is non-stationary.

check_stationarity(sample['t'])
ADF Statistic: -0.117684
p-value: 0.947648
Critical Values:
	1%: -3.621
	5%: -2.944
	10%: -2.610
Non-stationary

Therefore, we need to difference the time series.

sample['t_diff'] = sample['t'].diff().fillna(0)

display(sample.head(5).style.set_caption('Sample Time Series'))

check_stationarity(sample['t_diff'])
ADF Statistic: -21.691240
p-value: 0.000000
Critical Values:
	1%: -3.621
	5%: -2.944
	10%: -2.610
Stationary
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(12, 8))

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[0])
ax[0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])

ax[0].set_title('Sample Time Series')

sns.lineplot(x=sample.timestamp, y=sample['t_diff'], marker='o', ax=ax[1])
ax[1].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[1].set_title('Differenced Sample Time Series')
plt.tight_layout()
plt.show()

2. Check ACF and PACF

We can make the following observations:

  • There are several autocorrelations that are significantly non-zero. Therefore, the time series is non-random.
  • High degree of autocorrelation between adjacent (lag = 1) and near-adjacent observations
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t_diff'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t_diff'],lags=lag_pacf, ax=ax[1], method='ols')

for i in range(2):
    ax[i].axvline(x=T, color='r', linestyle='--')
    ax[i].set_xlim([-0.5, lag_acf+0.5])
    ax[i].set_ylim([-1.1, 1.1])
    
plt.tight_layout()
plt.show()
3. Modelling

White Noise

The following time series is random. It consists of 48 timesteps.

sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
                       't' : np.random.randint(1,101,len(time))
                      })

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
1. Check Stationarity

The sample data is stationary. Therefore, we do not need to difference the time series.

check_stationarity(sample['t'])
ADF Statistic: -7.978782
p-value: 0.000000
Critical Values:
	1%: -3.578
	5%: -2.925
	10%: -2.601
Stationary
2. Check ACF and PACF

We can make the following observations:

  • There is only one autocorrelation that is significantly non-zero at a lag of 0. Therefore, the time series is random.
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')
    
plt.tight_layout()
plt.show()
3. Modelling

Modelling white noise is difficult because we cannot retrieve any parameters from the ACF and PACF plots.linkcode

Random-Walk

The following time series is random like White Noise. However, the current value depends on the previous one. It consists of 48 timesteps.

# Copied from https://towardsdatascience.com/time-series-from-scratch-white-noise-and-random-walk-5c96270514d3
# Start with a random number - let's say 0
np.random.seed(42)

random_walk = [0]

for i in range(1, 48):
    # Movement direction based on a random number
    num = -1 if np.random.random() < 0.5 else 1
    random_walk.append(random_walk[-1] + num)
    
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
                       't' : random_walk
                      })

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
1. Check Stationarity

The sample data is non-stationary. Therefore, we need to difference the time series.

check_stationarity(sample['t'])
ADF Statistic: -0.731503
p-value: 0.838420
Critical Values:
	1%: -3.578
	5%: -2.925
	10%: -2.601
Non-stationary
sample['t_diff'] = sample['t'].diff().fillna(0)

check_stationarity(sample['t_diff'])
ADF Statistic: -3.047661
p-value: 0.030677
Critical Values:
	1%: -3.597
	5%: -2.933
	10%: -2.605
Stationary
2. Check ACF and PACF

We can make the following observations:

  • In contrast to the ACF and PACF of the White Noise process, we can see that for a lag of 1, we have some light correlation because the current value depends on the previous one.
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')

ax[1].annotate('Correlation to adjacent observation', xy=(1, 0.9),  xycoords='data',
            xytext=(0.15, 0.7), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))
plt.tight_layout()
plt.show()
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t_diff'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t_diff'],lags=lag_pacf, ax=ax[1], method='ols')
plt.tight_layout()
plt.show()
3. Modelling

Similarly to white noise, modelling random-walk is difficult because we cannot retrieve any parameters from the ACF and PACF plots.linkcode

Constant

The following time series is constant. It consists of 48 timesteps.

sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
                       't' : 5
                      })

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 4))

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o')
ax.set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax.set_title('Sample Time Series')
plt.tight_layout()
plt.show()
1. Check Stationarity

The sample data is non-stationary. Therefore, we need to difference the time series.

check_stationarity(sample['t'])
ADF Statistic: nan
p-value: nan
Critical Values:
	1%: -3.578
	5%: -2.925
	10%: -2.601
Non-stationary
sample['t_diff'] = sample['t'].diff().fillna(0)

check_stationarity(sample['t_diff'])
ADF Statistic: nan
p-value: nan
Critical Values:
	1%: -3.578
	5%: -2.925
	10%: -2.601
Non-stationary
2. Check ACF and PACF
  • ACF/PACF was applied to non-stationary time series
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(width, 2*height))
plot_acf(sample['t'],lags=lag_acf, ax=ax[0])
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1], method='ols')
    
plt.tight_layout()
plt.show()

3. Modelling

Modelling a constant as an AR or MA process is difficult because we cannot retrieve any parameters from the ACF and PACF plots. But on the other hand, if you can retrieve that a time series is constant, it should not be too difficult to forecast it, right?

Cheat Sheet

  1. Check stationarity. If stationary, continue to step 2. Else use differencing until time series is stationary and then continue to step 2.
  2. Check ACF and PACF with following table | | AR(pp) | MA(qq) | ARMA(pp, qq) | |-|-|-|-| |ACF|Tails off (Geometric decay) | Significant at lag qq / Cuts off after lag qq|Tails off (Geometric decay) | |PACF| Significant at each lag pp / Cuts off after lag pp|Tails off (Geometric decay) |Tails off (Geometric decay) |
  3. Modelling
f, ax = plt.subplots(nrows=10, ncols=3, figsize=(2*width, 10*height))

### AR(1) ###
np.random.seed(SEED)
ar = np.r_[1, -np.array([alpha_1])] # add zero-lag and negate
ma = np.r_[1] # add zero-lag

sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
                       't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
                      })

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[0,0])
ax[0,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[0,0].set_title('Time Series for AR(1)')

plot_acf(sample['t'],lags=lag_acf, ax=ax[0, 1], title='ACF for AR(1)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[0, 2], method='ols', title='PACF for AR(1)')
ax[0,2].annotate('Strong correlation at lag = 1', xy=(1, 0.6),  xycoords='data',
            xytext=(0.17, 0.75), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

### AR(2) ###
np.random.seed(SEED)
ar = np.r_[1, -np.array([alpha_1, alpha_2])] # add zero-lag and negate
ma = np.r_[1] # add zero-lag

sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
                       't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
                      })

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[1,0])
ax[1,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[1,0].set_title('Time Series for AR(2)')

plot_acf(sample['t'],lags=lag_acf, ax=ax[1, 1], title='ACF for AR(2)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[1, 2], method='ols', title='PACF for AR(2)')

ax[1, 2].annotate('Strong correlation at lag = 1', xy=(1, 0.36),  xycoords='data',
            xytext=(0.15, 0.7), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

ax[1, 2].annotate('Strong correlation at lag = 2', xy=(2.1, -0.5),  xycoords='data',
            xytext=(0.25, 0.1), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

### MA(1) ###
np.random.seed(SEED)
ar = np.r_[1] # add zero-lag and negate
ma = np.r_[1, np.array([beta_1])] # add zero-lag

sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
                       't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
                      })    

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[2,0])
ax[2,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[2,0].set_title('Time Series for MA(1)')

plot_acf(sample['t'],lags=lag_acf, ax=ax[2, 1], title='ACF for MA(1)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[2, 2], method='ols', title='PACF for MA(1)')

ax[2,1].annotate('Strong correlation at lag = 1', xy=(1, 0.5),  xycoords='data',
            xytext=(0.15, 0.7), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

### MA(2) ###
np.random.seed(SEED)
ar = np.r_[1] # add zero-lag and negate
ma = np.r_[1, np.array([beta_1, beta_2])] # add zero-lag

sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=num_samples, freq='MS'),
                       't' : sm.tsa.arima_process.arma_generate_sample(ar, ma, num_samples)
                      })    

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[3,0])
ax[3,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[3,0].set_title('Time Series for MA(2)')

plot_acf(sample['t'],lags=lag_acf, ax=ax[3, 1], title='ACF for MA(2)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[3, 2], method='ols', title='PACF for MA(2)')

ax[3, 1].annotate('Strong correlation at lag = 1', xy=(1, 0.65),  xycoords='data',
            xytext=(0.15, 0.8), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

ax[3, 1].annotate('Strong correlation at lag = 2', xy=(2, 0.5),  xycoords='data',
            xytext=(0.25, 0.7), textcoords='axes fraction',
            arrowprops=dict(color='red', shrink=0.05, width=1))

### Periodical ###
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
                       't' : [1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2, 1, 2, 3, 4, 4.5, 5, 7, 8, 6, 4, 2, 2]
                      })

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[4,0])
ax[4,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[4,0].set_title('Time Series for Periodical')

plot_acf(sample['t'],lags=lag_acf, ax=ax[4, 1], title='ACF for Periodical')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[4, 2], method='ols', title='PACF for Periodical')

ax[4,2].axvline(x=T, color='r', linestyle='--')

### Trend ###
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
                       't' : ((0.05*time)+20)
                      })

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[5,0])
ax[5,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[5,0].set_title('Time Series for Trend (NON-STATIONARY!)')

plot_acf(sample['t'],lags=lag_acf, ax=ax[5, 1], title='ACF for Trend (applied to non-stationary)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[5, 2], method='ols', title='PACF for Trend (applied to non-stationary)')

### White Noise ###
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
                       't' : np.random.randint(1,101,len(time))
                      })

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[6,0])
ax[6,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[6,0].set_title('Time Series for White Noise')

plot_acf(sample['t'],lags=lag_acf, ax=ax[6, 1], title='ACF for White Noise')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[6, 2], method='ols', title='PACF for White Noise')

### Random-Walk ###
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
                       't' : random_walk
                      })

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[7,0])
ax[7,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[7,0].set_title('Time Series for Random-Walk (NON-STATIONARY!)')

plot_acf(sample['t'],lags=lag_acf, ax=ax[7, 1], title='ACF for Random-Walk (applied to non-stationary)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[7, 2], method='ols', title='PACF for Random-Walk (applied to non-stationary)')

sample['t_diff'] = sample['t'].diff().fillna(0)

plot_acf(sample['t_diff'],lags=lag_acf, ax=ax[8, 1], title='ACF for Random-Walk (applied to differenced/stationary)')
plot_pacf(sample['t_diff'],lags=lag_pacf, ax=ax[8, 2], method='ols', title='PACF for Random-Walk (applied to differenced/stationary)')


### Constant ###
sample = pd.DataFrame({'timestamp' : pd.date_range('2021-01-01', periods=48, freq='MS'),
                       't' : 5
                      })

sns.lineplot(x=sample.timestamp, y=sample['t'], marker='o', ax=ax[9,0])
ax[9,0].set_xlim([sample.timestamp.iloc[0], sample.timestamp.iloc[-1]])
ax[9,0].set_title('Time Series for Constant (NON-STATIONARY!)')

plot_acf(sample['t'],lags=lag_acf, ax=ax[9, 1], title='ACF for Constant (applied to non-stationary)')
plot_pacf(sample['t'],lags=lag_pacf, ax=ax[9, 2], method='ols', title='PACF for Constant (applied to non-stationary)')

for i in range(9):
    ax[i, 1].set_ylim([-1.1, 1.1])
    ax[i, 2].set_ylim([-1.1, 1.1])

    
f.delaxes(ax[8, 0])
plt.tight_layout()
plt.show()
Case Study
Next, we will interpret the ACF and PACF plots in the G-Research Crypto Forecasting Competition. To simplify this case study, we will resample the minute-wise data into daily samples and only have a look at the Close price.
data_folder = "../input/g-research-crypto-forecasting/"
train = pd.read_csv(data_folder + 'train.csv', low_memory=False)
train['timestamp'] = train['timestamp'].astype('datetime64[s]')
Bitcoin
train_mini = train[train.Asset_ID == 1].reset_index()

train_mini = train_mini[['timestamp','Close']].resample('D', on='timestamp').last()['Close']
train_mini = train_mini.to_frame().reset_index(drop=False)

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))

sns.lineplot(data=train_mini, x='timestamp', y='Close')
plt.show()
1. Check Stationarity

The sample data is non-stationary. Therefore, we need to difference the time series.

check_stationarity(train_mini['Close'])
train_mini['Close_diff'] = train_mini['Close'].diff().fillna(0)
check_stationarity(train_mini['Close_diff'])
ADF Statistic: -0.422392
p-value: 0.906293
Critical Values:
	1%: -3.435
	5%: -2.864
	10%: -2.568
Non-stationary
ADF Statistic: -7.377094
p-value: 0.000000
Critical Values:
	1%: -3.435
	5%: -2.864
	10%: -2.568
Stationary
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))
sns.lineplot(data=train_mini, x='timestamp', y='Close_diff')
plt.show()
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))
sns.histplot(train_mini['Close_diff'] )
plt.show()
2. Check ACF and PACF
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(2*width, 2*height))

plot_acf(train_mini['Close'], lags=20, ax=ax[0, 0], title='ACF on non-stationary')
plot_pacf(train_mini['Close'], lags=20, ax=ax[0, 1], method='ols', title='PACF on non-stationary')

plot_acf(train_mini['Close_diff'], lags=20, ax=ax[1, 0], title='ACF on differenced/stationary')
plot_pacf(train_mini['Close_diff'], lags=20, ax=ax[1, 1], method='ols', title='PACF on differenced/stationary')

plt.tight_layout()
plt.show()

Ethereum
train_mini = train[train.Asset_ID == 6].reset_index()

train_mini = train_mini[['timestamp','Close']].resample('D', on='timestamp').last()['Close']
train_mini = train_mini.to_frame().reset_index(drop=False)

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))

sns.lineplot(data=train_mini, x='timestamp', y='Close')
plt.show()
1. Check Stationarity

The sample data is non-stationary. Therefore, we need to difference the time series.

check_stationarity(train_mini['Close'])

train_mini['Close_diff'] = train_mini['Close'].diff().fillna(0)

check_stationarity(train_mini['Close_diff'])
ADF Statistic: 0.149049
p-value: 0.969255
Critical Values:
	1%: -3.435
	5%: -2.864
	10%: -2.568
Non-stationary
ADF Statistic: -10.075336
p-value: 0.000000
Critical Values:
	1%: -3.435
	5%: -2.864
	10%: -2.568
Stationary
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))

sns.lineplot(data=train_mini, x='timestamp', y='Close_diff')
plt.show()

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,5))

sns.histplot(train_mini['Close_diff'] )
plt.show()
2. Check ACF and PACF
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(2*width, 2*height))

plot_acf(train_mini['Close'], lags=20, ax=ax[0, 0], title='ACF on non-stationary')
plot_pacf(train_mini['Close'], lags=20, ax=ax[0, 1], method='ols', title='PACF on non-stationary')

plot_acf(train_mini['Close_diff'], lags=20, ax=ax[1, 0], title='ACF on differenced/stationary')
plot_pacf(train_mini['Close_diff'], lags=20, ax=ax[1, 1], method='ols', title='PACF on differenced/stationary')

plt.tight_layout()
plt.show()
Discussion on Random-Walk

For both Bitcoin and Ethereum, we can observe a random-walk behavior (see 🚀 Cheat Sheet). This is fairly common in stock prices (see Random Walk Theory)

random walk is unpredictable; it cannot reasonably be predicted.

Given the way that the random walk is constructed, we can expect that the best prediction we could make would be to use the observation at the previous time step as what will happen in the next time step.

Simply because we know that the next time step will be a function of the prior time step.

This is often called the naive forecast, or a persistence model. – A Gentle Introduction to the Random Walk for Times Series Forecasting with Python

Important Notice for college students

If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com

For more Programming related blogs Visit Us Geekycodes . Follow us on Instagram.

By geekycodesco

Leave a Reply

%d bloggers like this: