Machine Learning Pandas Python Time Series

Stationarity Analysis in Time Series Data

Hey Geeks !!! in this blog, we’ll dive into the concept of stationarity using time series data. We’ll first understand what is time-series data, what is stationarity, why and when data should be stationary etc…

We’ll use the dataset I created specifically for this blog to analyze whether the data is stationary or not. We’ll also see how to convert the non-stationary data to stationary.

Index

  1. Introduction
  2. Import Libraries and Dependencies
  3. Define TimeSeriesData Class
  4. Import Dataset
  5. Accumulating Number of Sales by month
  6. Create object
  7. Stationarity Tests
    1. Graphical Test
    2. Rolling-Statistics Test
    3. Augmented Dickey-Fuller Test (ADF)
    4. Kwiatkowski-Phillips-Schmidt-Shin Test (KPSS)
    5. Zivot-Andrews Test
  8. Conclusion
  9. Convert data to Stationary
    1. Derivatives
    2. Transformation using Logarithmic Function
      1. ADF Test
      2. KPSS Test
      3. Zivot-Andrews Test
      4. Rolling-Statistics Test
  10. Conclusion

1. Introduction

1.1 What is time-series data?
A time-series data is a dataset that tracks the movement of data points over a period of time, recorded at regular intervals.

1.2 What is data stationarity?
Time series data are said to be stationary if they do not have any seasonal effects or any trends.

A stationary data has the property that the mean, variance, and autocorrelation remain almost the same over various time intervals.

1.3 Why is stationary data necessary for forecasting?
When forecasting or predicting the future, most time series models assume that each point is independent of one another.

Therefore, stationary time series data is necessary for forecasting in order to obtain acceptable results.

1.4 What will happen if data is not stationary?
In non-stationary data, summary statistics like the mean and variance do change over time, providing a drift in the concepts a model may try to capture.

Therefore, the data cannot be forecasted using traditional time series models if the data is not stationary.

2. Import Libraries and Dependencies

Let’s import the packages required and install the dependencies needed in our code.

!pip install statsmodels --upgrade
!pip install openpyxl==3.0.0

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.tsa.stattools import zivot_andrews

3. Define TimeSeriesData Class

Let’s define a class TimeSeriesData that contains necessary methods which are defined below.

We’ll look into the function definitions and explanation of the code in this section.
The implementation cases are explained in the fore-coming sections.

1. Constructor

First of all, a constructor is defined that taking a dataset and the target column specific to the dataset as arguments.

def __init__(self, dataset, target_column): #constructor
    #dataset - time series dataset
    #target_column - name of target column in the dataset
    self.dataset = dataset    self.target_column = target_column
  • self.dataset – Contains the dataset
  • self.target_column – Contains the target column specific to each dataset
2. Display Column Names

A method to display column names of the dataset.

def display_columns(self): #display column names present in the dataset
    print(self.dataset.columns)
3. Display random samples

A method to display random samples using random_state = 42 as default.

def display_samples(self, random_state=42): #display random samples from the dataset
    display(self.dataset.sample(10, random_state=random_state))
4. Drop Columns

Method to drop the columns given as arguments

def drop_columns(self, columns): # drop columns inplace from the dataset
    # columns - list of columns to drop from the dataset
    self.dataset.drop(columns, axis=1, inplace=True)
5. Graphical Analysis

Now, let’s see some mathematical/statistical functions needed to analyze stationarity in the dataset.

def graphical_analysis(self): #analyse the stationarity by histogram
    self.dataset[[self.target_column]].hist()

The above method just plots the histogram of the target variable of the dataset. The interpretation from the graph will be explained in later sections.

6. Distribution Plot

Next, the distribution plot of the target variable with respect to time can be plotted by using the below method.

def distribution_plot(self, column_name): #plot graph
    plt.figure(figsize=(22,8))
    plt.title(self.target_column)
    plt.xlabel('Date')
    plt.ylabel(self.target_column)
    plt.plot(self.dataset[column_name]);
    plt.show()
7. Mean Variance Stationarity Analysis

As we already know that stationary data has the property that the mean, variance remains almost the same over various time intervals.

So let’s define a method that splits the data into two contiguous sequences. Then the method computes the mean and variance for the two splits. After that, both mean are compared to check if they are near to each other. Similarly, nearness for variance is checked.

From the comparisons, we can conclude that, if the corresponding measures for the different time intervals are nearer to each other (with some considered significance level) then the data is stationary else it is non-stationary data.

def mean_variance_stationary_analysis(self, column_name):
    # Splitting the time series data into two contiguous sequence and calculating mean and variance to compare the means and variances of the two sequence.
    #column_name - name of the column to be analysed

    X = self.dataset[[column_name]].values
    split = round(len(X) / 2)
    X1, X2 = X[0:split], X[split:]

    mean1, mean2 = X1.mean(), X2.mean()
    var1, var2 = X1.var(), X2.var()

    print("Mean :",mean1, mean2)
    print("Variance :", var1, var2)
8. Rolling Statistics Test

Next, we’ll define a method, which I call The Rolling Statistics Test, that plots the mean and standard deviation of the target variable for every x (Here, x=12) time intervals cumulatively.

So that, from the graph, we can infer whether the mean and standard deviation are almost constant throughout the data.

def rolling_statistics_test(self, column_name):
    # Function to give a visual representation of the data to define its stationarity.
    #column_name - name of the column to be tested

    X = self.dataset[column_name]
    rolling_mean = X.rolling(window=12).mean()
    rolling_std = X.rolling(window=12).std()

    plt.figure(figsize=(20,8))
    orignal_data = plt.plot(X , color='black', label='Original') #original data
    roll_mean_plot = plt.plot(rolling_mean , color='red', label='Rolling Mean')  #rolling mean
    roll_std_plot = plt.plot(rolling_std, color='blue', label = 'Rolling Standard Deviation')  #rolling SD
    plt.legend(loc='best')
    plt.title("Rolling mean and Standard Deviation")
    plt.show(block=False)
9. ADF Test

Next comes the Augmented Dickey-Fuller Test to analyze stationarity using hypothesis.

The Augmented Dickey-Fuller test is one of the more widely used type of statistical test (called a unit root test) that it determines how strongly a time series is defined by a trend.

  • The null hypothesis of the test is that it is not stationary (has some time-dependent structure).
  • The alternate hypothesis (rejecting the null hypothesis) is that the time series is stationary.

We interpret this result using the p-value from the test.

  • p-value > 0.05: Fail to reject the null hypothesis (H0), the data is non-stationary.
  • p-value <= 0.05: Reject the null hypothesis (H0), the data is stationary.
def augmented_dickey_fuller_test(self, column_name):
    #The Augmented Dickey-Fuller test is one of the more widely used type of statistical test (called a unit root test)
    #that it determines how strongly a time series is defined by a trend.
    #column_name - name of the column to be tested

    X = self.dataset[column_name].dropna()
    adf_test_result = adfuller(X)

    print(f'ADF Statistic: {adf_test_result[0]}')
    print(f'p-value: {adf_test_result[1]}')
    print('Critial Values:')

    for key, value in adf_test_result[4].items():
        print(f'   {key}, {value}')

    if(adf_test_result[0] < adf_test_result[4]['1%']):
      print("\n The Data is Stationary")
    else:
      print("\nThe Data is Non-Stationary")
10. KPSS Test

Next comes, Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test

If p-value < 0.05, then the series is non-stationary.

  • Null Hypothesis (HO): Series is trend stationary.
  • Alternate Hypothesis(HA): Series is non-stationary.

Note: Hypothesis is reversed in KPSS test compared to ADF Test.

If the null hypothesis is failed to be rejected, this test may provide evidence that the series is trend stationery.

def kwiatkowski_phillips_schmidt_shin_test(self, column_name):
    # The Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test figures out if a time series is stationary around
    # a mean or linear trend, or is non-stationary due to a unit root.
    #column_name - name of the column to be tested

    X = self.dataset[column_name].dropna()
    print ('Results of KPSS Test:')
    kpss_test = kpss(X, regression='c')
    kpss_test_output = pd.Series(kpss_test[0:3], index=['Test Statistic','p-value','#Lags Used'])

    for key,value in kpss_test[3].items():
        kpss_test_output['Critical Value (%s)'%key] = value
    print(kpss_test_output)

    if(kpss_test[1] > 0.05):
      print("\n The Data is Stationary\n\n")
    else:
      print("\nThe Data is Non-Stationary\n\n")
11. Zivot-Andrews Test

The final test we will be using for our analysis is Zivot-Andrews Test

In this test, if p-value is <= 0.05, our data is stationary else it is non-stationary.

def zivot_andrews_test(self, column_name):
    #column_name - name of the column to be tested

    X = self.dataset[column_name].dropna()
    t_stat, p_value, critical_values, _, _ = zivot_andrews(X)
    print(f'Zivot-Andrews Statistic: {t_stat:.2f}')

    for key, value in critical_values.items():
        print('Critial Values:')
        print(f'   {key}, {value:.2f}')
    print(f'\np-value: {p_value:.6f}')
    
    if(p_value <= 0.05):
      print("Stationary")
    else:
      print("Non-Stationary")

Methods for stationarity analysis ends here. We can be able to analyze the stationarity of our data by using the above tests.

Now, as we have seen earlier that in order to perform forecasting, most time series models assume that each point is independent of one another, which means the data should be stationary.

So the tests above conclude whether our data is stationary or not. If the data is stationary, the data can be straight-away be used for predictive modeling using time series models (if necessary).

Contrary to that, if the data we considered is not stationary, then it is our task to transform the data to stationary.

So, Let’s discuss some methods of our class TimeSeriesData which transforms the data to stationary if they are not.

We will be following two methodologies :

  • Calculating three-order of derivatives
  • Transformation using logarithmic function and then calculating first derivative
12. Calculating Derivatives

The method to calculate three orders of derivatives is defined below

def calculating_derivatives(self): #calculating three orders of differentiations
    self.dataset['diff_1'] = self.dataset[self.target_column].diff(periods=1)
    self.dataset['diff_2'] = self.dataset[self.target_column].diff(periods=2)
    self.dataset['diff_3'] = self.dataset[self.target_column].diff(periods=3)
13. Logarithmic Transformation

Method to perform logarithmic transformation and the first derivative is defined below

def log_transform_derivative_1(self): #transform the column by logarithmic and then calculating the first order derivative
    self.dataset['log_diff_1'] = np.log(self.dataset[self.target_column]).diff().dropna()
Class TimeSeriesData

So that’s all our methods of the class. Now the TimeSeriesData class is defined as below,

class TimeSeriesData:
  def __init__(self, dataset, target_column): #constructor
    #dataset - time series dataset
    #target_column - name of target column in the dataset
    self.dataset = dataset
    self.target_column = target_column

  def display_columns(self): #display column names present in the dataset
    print(self.dataset.columns)

  def display_samples(self, random_state=42): #display random samples from the dataset
    display(self.dataset.sample(10, random_state=random_state))

  def drop_columns(self, columns): # drop columns inplace from the dataset
    # columns - list of columns to drop from the dataset
    self.dataset.drop(columns, axis=1, inplace=True)

  def graphical_analysis(self): #analyse the stationarity by histogram
    self.dataset[[self.target_column]].hist()

  def distribution_plot(self, column_name): #plot graph
    plt.figure(figsize=(22,8))
    plt.title(self.target_column)
    plt.xlabel('Date')
    plt.ylabel(self.target_column)
    plt.plot(self.dataset[column_name]);
    plt.show()

  def mean_variance_stationary_analysis(self, column_name):
    # Splitting the time series data into two contiguous sequence and calculating mean and variance to compare the means and variances of the two sequence.
    #column_name - name of the column to be analysed

    X = self.dataset[[column_name]].values
    split = round(len(X) / 2)
    X1, X2 = X[0:split], X[split:]

    mean1, mean2 = X1.mean(), X2.mean()
    var1, var2 = X1.var(), X2.var()

    print("Mean :",mean1, mean2)
    print("Variance :", var1, var2)

  def rolling_statistics_test(self, column_name):
    # Function to give a visual representation of the data to define its stationarity.
    #column_name - name of the column to be tested

    X = self.dataset[column_name]
    rolling_mean = X.rolling(window=12).mean()
    rolling_std = X.rolling(window=12).std()

    plt.figure(figsize=(20,8))
    orignal_data = plt.plot(X , color='black', label='Original') #original data
    roll_mean_plot = plt.plot(rolling_mean , color='red', label='Rolling Mean')  #rolling mean
    roll_std_plot = plt.plot(rolling_std, color='blue', label = 'Rolling Standard Deviation')  #rolling SD
    plt.legend(loc='best')
    plt.title("Rolling mean and Standard Deviation")
    plt.show(block=False)

  def augmented_dickey_fuller_test(self, column_name):
    #The Augmented Dickey-Fuller test is one of the more widely used type of statistical test (called a unit root test)
    #that it determines how strongly a time series is defined by a trend.
    #column_name - name of the column to be tested

    X = self.dataset[column_name].dropna()
    adf_test_result = adfuller(X)

    print(f'ADF Statistic: {adf_test_result[0]}')
    print(f'p-value: {adf_test_result[1]}')
    print('Critial Values:')

    for key, value in adf_test_result[4].items():
        print(f'   {key}, {value}')

    if(adf_test_result[0] < adf_test_result[4]['1%']):
      print("\n The Data is Stationary")
    else:
      print("\nThe Data is Non-Stationary")

  def kwiatkowski_phillips_schmidt_shin_test(self, column_name):
    # The Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test figures out if a time series is stationary around
    # a mean or linear trend, or is non-stationary due to a unit root.
    #column_name - name of the column to be tested

    X = self.dataset[column_name].dropna()
    print ('Results of KPSS Test:')
    kpss_test = kpss(X, regression='c')
    kpss_test_output = pd.Series(kpss_test[0:3], index=['Test Statistic','p-value','#Lags Used'])

    for key,value in kpss_test[3].items():
        kpss_test_output['Critical Value (%s)'%key] = value
    print(kpss_test_output)

    if(kpss_test[1] > 0.05):
      print("\n The Data is Stationary\n\n")
    else:
      print("\nThe Data is Non-Stationary\n\n")
    
  def zivot_andrews_test(self, column_name):
    #column_name - name of the column to be tested

    X = self.dataset[column_name].dropna()
    t_stat, p_value, critical_values, _, _ = zivot_andrews(X)
    print(f'Zivot-Andrews Statistic: {t_stat:.2f}')

    for key, value in critical_values.items():
        print('Critial Values:')
        print(f'   {key}, {value:.2f}')
    print(f'\np-value: {p_value:.6f}')
    
    if(p_value <= 0.05):
      print("Stationary")
    else:
      print("Non-Stationary")

  def calculating_derivatives(self): #calculating three orders of differentiations
    self.dataset['diff_1'] = self.dataset[self.target_column].diff(periods=1)
    self.dataset['diff_2'] = self.dataset[self.target_column].diff(periods=2)
    self.dataset['diff_3'] = self.dataset[self.target_column].diff(periods=3)

  def log_transform_derivative_1(self): #transform the column by logarithmic and then calculating the first order derivative
    self.dataset['log_diff_1'] = np.log(self.dataset[self.target_column]).diff().dropna()

4. Import Dataset

The dataset we will be using is the sales of Maruti Suzuki cars from the year 2018 to 2021 for every months in a particular location.
Actually, the dataset is not from original source and since the aim of this blog is to perform analysis and transformation only, the dataset is not important.

Dataset Link : https://docs.google.com/spreadsheets/d/1LuMrs8IONus2wT_JgvbdhrlJ7SPNveR-/edit?usp=sharing&ouid=101071717239207047296&rtpof=true&sd=true

The dataset contains 3 columns,

  • Cars – Type of car
  • Date – Date in yyyy-mm-dd
  • Number of Sales – Number of car sales in the location. (Location is not specified)

Let’s import our dataset

cars_data  = pd.read_excel("/content/drive/MyDrive/Datasets/Car_Sales.xlsx")
cars_data 
	Cars	        Date	        Number of Sales 
0	Swift Dzire	2018-01-01	101.0 
1	Swift Dzire	2018-02-01	99.0 
2	Swift Dzire	2018-03-01	101.0 
3	Swift Dzire	2018-04-01	89.0 
4	Swift Dzire	2018-05-01	99.0 
...	...	...	... 
283	Celerio	        2021-08-01	118.0 
284	Celerio	        2021-09-01	124.0 
285	Celerio	        2021-10-01	108.0 
286	Celerio	        2021-11-01	106.0 
287	Celerio	        2021-12-01	67.0 
288 rows × 3 columns

5. Accumulating Number of Sales by month

Let’s accumulate the number of sales for all cars with respect to date

cars_data  = cars_data .groupby('Date').sum()
cars_data.sample(10)
	
Date	        Number of Sales 
2020-04-01	786.0 
2021-11-01	610.0 
2018-11-01	518.0 
2019-10-01	602.0 
2018-12-01	593.0 
2020-03-01	798.0 
2018-06-01	557.0 
2021-02-01	663.0 
2020-05-01	814.0 
2019-04-01	630.0

6. Create Object

cars_data  = TimeSeriesData(cars_data , target_column='Number of Sales')
cars_data.display_columns()
Index(['Number of Sales'], dtype='object')
cars_data.display_samples()
Date	        Number of Sales
2020-04-01	786.0 
2021-05-01	640.0 
2020-03-01	798.0 
2021-08-01	589.0 
2020-01-01	791.0 
2021-02-01	663.0 
2019-01-01	566.0 
2019-08-01	647.0 
2018-05-01	676.0 
2020-02-01	789.0

7. Stationarity Tests

Below are the techniques we’ll follow in this blog to analyze the stationarity of our dataset.

  • Graphical
  • Rolling-Statistics Test
  • ADF test
  • KPSS test
  • Zivot-Andrews Test

1. Graphical

The plot below depicts the number of sales from the year January 2018 to December 2021.

From the plot, it can be seen that there is a trend in the data in the date January 2020 till December 2020. From this, it can be proved that the data is non-stationary.

But still, let us perform some tests to prove mathematically / statistically whether the data is stationary or non-stationary.

cars_data.distribution_plot('Number of Sales')
cars_data.graphical_analysis()

Let’s split the time series data into two contiguous sequences and calculate the mean and variance for each split. Then we’ll compare the corresponding means and variances of the two splits.

If the means are nearer to each other and the variances are nearer to each other, then the data can be said to be in stationary.

cars_data.mean_variance_stationary_analysis('Number of Sales')
Mean : 608.625 701.4583333333334
Variance : 2126.0677083333335 12394.164930555555

From the output displayed above, the mean varies significantly (so does the variance). Therefore once again it is proved that there is some trend in the data (non-stationary data) we have taken.

2. Rolling-Statistics Test

Now, let’s do some mathematical calculations and interpret the result visually based on the above section 7.1 Graphical Analysis.
In the above section, we split the data into two and compared the means and variances of the splits.

To generalize throughout the entire data, we’ll use the rolling_statistics_test() which plots a graph of mean and variances throughout the data so that we can visually analyze if there is any trend in the data.

cars_data.rolling_statistics_test('Number of Sales')

The graph of rolling mean and rolling standard deviation is not constant at every time intervals, this shows that the dataset might be non-stationary.

3. Augmented Dickey-Fuller (ADF) Test

Enough of tests by visualization, now let’s perform some statistical tests on our data to analyze the stationarity, starting with ADF test.

cars_data.augmented_dickey_fuller_test('Number of Sales')
ADF Statistic: -2.160524410739529
p-value: 0.2208833697150417
Critial Values:
   1%, -3.596635636000432
   5%, -2.933297331821618
   10%, -2.6049909750566895

The Data is Non-Stationary

We can see that our statistic value of -2.16 is greater than the value of -3.60 at 1%. This suggests that we can not reject the null hypothesis with a significance level of less than 1%.

Accepting the null hypothesis means that the time series is non-stationary or it have time-dependent structure.

4. Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test

cars_data.kwiatkowski_phillips_schmidt_shin_test('Number of Sales')
Results of KPSS Test:
Test Statistic           0.210379
p-value                  0.100000
#Lags Used               4.000000
Critical Value (10%)     0.347000
Critical Value (5%)      0.463000
Critical Value (2.5%)    0.574000
Critical Value (1%)      0.739000
dtype: float64

 The Data is Stationary

The p-value 0.1 is > 0.05, which doesn’t fail to reject the null hypothesis. Therefore, the data is stationary.

5. Zivot-Andrews Test

cars_data.zivot_andrews_test('Number of Sales')
Zivot-Andrews Statistic: -4.21
Critial Values:
   1%, -5.28
Critial Values:
   5%, -4.81
Critial Values:
   10%, -4.57

p-value: 0.235844
Non-Stationary

Since p-value, 0.235844 > 0.05, it is not a stationary data.

8. Conclusion

  1. Mean and Variance of the split data proves the data to be non-stationary.
  2. Rolling-Statistics proves the data to be non-stationary.
  3. ADF test proves the data to be non-stationary.
  4. KPSS test proves the data to be stationary.
  5. Zivot-Andrews test proves the data to be non-stationary.

Therefore, the majority of the tests prove the data to be non-stationary.

9. Convert data into Stationary

Now, we have seen that non-stationary time series data is not good for predictive modeling. Hence it is a mandatory step to convert the data into stationary before performing any predictive modeling.

Note: It is not necessary to follow this section if the data is stationary. But since the scope of this blog also covers the conversion, the data is generated in such a way.

The Conversion techniques we’ll be following are,

  1. Derivatives
  2. Logarithmic transformation

1. Derivatives

Let us perform three orders of differences on the ‘Number of Enrollments’ column.

cars_data.calculating_derivatives()
cars_data.display_samples()
Date		Number of Sales	diff_1	diff_2	diff_3
2020-04-01	786.0   	-12.0	-3.0	-5.0
2021-05-01	640.0	         43.0	62.0	-23.0
2020-03-01	798.0          	 9.0	7.0	145.0
2021-08-01	589.0   	53.0	90.0	-51.0
2020-01-01	791.0   	138.0	272.0	189.0
2021-02-01	663.0   	54.0	-122.0	-185.0
2019-01-01	566.0   	-27.0	48.0	-61.0
2019-08-01	647.0   	1.0	18.0	-26.0
2018-05-01	676.0   	116.0	64.0	57.0
2020-02-01	789.0   	-2.0	136.0	270.0

Let’s perform an ADF test on the three orders of derivatives to check for stationarity.

1. First Derivative

cars_data.augmented_dickey_fuller_test('diff_1')
cars_data.distribution_plot('diff_1')
ADF Statistic: -2.607621497021281
p-value: 0.09143226802145005
Critial Values:
   1%, -3.596635636000432
   5%, -2.933297331821618
   10%, -2.6049909750566895

The Data is Non-Stationary

We can see that our statistic value of -2.607 is greater than the value of -3.60 at 1%. This suggests that we can not reject the null hypothesis with a significance level of less than 1%.

Accepting the null hypothesis means that the time series is non-stationary or it has a time-dependent structure.

2. Second Derivative

cars_data.augmented_dickey_fuller_test('diff_2')
cars_data.distribution_plot('diff_2')
ADF Statistic: -2.737569268981533
p-value: 0.0677596974413701
Critial Values:
   1%, -3.6055648906249997
   5%, -2.937069375
   10%, -2.606985625

The Data is Non-Stationary

We can see that our statistic value of -2.74 is greater than the value of -3.61 at 1%. This suggests that we can not reject the null hypothesis with a significance level of less than 1%.

Accepting the null hypothesis means that the time series is non-stationary or it has a time-dependent structure.

3. Third Derivative

cars_data.augmented_dickey_fuller_test('diff_3')
cars_data.distribution_plot('diff_3')
ADF Statistic: -2.9231663307597175
p-value: 0.04271963058833072
Critial Values:
   1%, -3.610399601308181
   5%, -2.939108945868946
   10%, -2.6080629651545038

The Data is Non-Stationary

We can see that our statistic value of -2.92 is greater than the value of -3.61 at 1%. This suggests that we can not reject the null hypothesis with a significance level of less than 1%.

Accepting the null hypothesis means that the time series is non-stationary or it has a time-dependent structure.

2. Transformation using Logarithmic Function

Now, we will transform the data by logarithmic function and then, take the first-order difference to check for stationarity

cars_data.log_transform_derivative_1()
cars_data.display_samples()
Date		Number of Sales	diff_1	diff_2	diff_3	log_diff_1
2020-04-01	786.0   	-12.0	-3.0	-5.0	-0.015152
2021-05-01	640.0   	43.0	62.0	-23.0	0.069551
2020-03-01	798.0   	9.0	7.0	145.0	0.011342
2021-08-01	589.0   	53.0	90.0	-51.0	0.094292
2020-01-01	791.0   	138.0	272.0	189.0	0.191721
2021-02-01	663.0   	54.0	-122.0	-185.0	0.084957
2019-01-01	566.0   	-27.0	48.0	-61.0	-0.046600
2019-08-01	647.0   	1.0	18.0	-26.0	0.001547
2018-05-01	676.0   	116.0	64.0	57.0	0.188256
2020-02-01	789.0   	-2.0	136.0	270.0	-0.002532

1. ADF Test

cars_data.augmented_dickey_fuller_test('log_diff_1')
cars_data.distribution_plot('log_diff_1')
ADF Statistic: -9.510294279497844
p-value: 3.263135273864606e-16
Critial Values:
   1%, -3.5812576580093696
   5%, -2.9267849124681518
   10%, -2.6015409829867675

 The Data is Stationary

We can see that our statistic value of -9.51 is lesser than the value of -3.58 at 1%. This suggests that we can reject the null hypothesis with a significance level of less than 1%.

Rejecting the null hypothesis means that the time series is stationary or it does not have a time-dependent structure.

2. KPSS Test

cars_data.kwiatkowski_phillips_schmidt_shin_test('log_diff_1')
Results of KPSS Test:
Test Statistic           0.077049
p-value                  0.100000
#Lags Used               3.000000
Critical Value (10%)     0.347000
Critical Value (5%)      0.463000
Critical Value (2.5%)    0.574000
Critical Value (1%)      0.739000
dtype: float64

 The Data is Stationary

The p-value 0.1 is > 0.05, which doesn’t fail to reject the null hypothesis. Therefore, the data is stationary according to the KPSS test.

3. Zivot-Andrews Test

cars_data.zivot_andrews_test('log_diff_1')
Zivot-Andrews Statistic: -9.82
Critial Values:
   1%, -5.28
Critial Values:
   5%, -4.81
Critial Values:
   10%, -4.57

p-value: 0.000010
Stationary

Since p-value <= 0.05, it is a stationary data.

4. Rolling-Statistics Test

cars_data.rolling_statistics_test('log_diff_1')

The graph of rolling mean and rolling standard deviation is almost constant in every time interval, therefore the dataset might be stationary.

10. Conclusion

  • Rolling-Statistics might prove that the data is stationary.
  • ADF test proves the data to be stationary.
  • KPSS test proves the data to be stationary.
  • Zivot-Andrews test proves the data to be stationary.

Therefore, the majority of the tests prove the data to be stationary.

Hence, the non-stationary data is transformed into stationary data.

Important Notice for college students

If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com

For more Programming related blogs Visit Us Geekycodes. Follow us on Instagram.

About the Author :

Hii, I’m Avinash, pursuing a Bachelor of Engineering in Computer Science and Engineering from Mepco Schlenk Engineering College, Sivakasi.
I’m currently working as a Data Science Intern at @DeepSphere.AI.
I’m an AI enthusiast and Open-Source contributor.

Connect me through:

Feel free to correct me !! 🙂
Thank you folks for reading. Happy Learning !!! 😊

Leave a Reply

%d bloggers like this: