Python

Introduction

Working with date data often involves extracting relevant information, such as the financial year. In this blog post, we’ll explore how to extract the financial year from a date column in both Pandas and PySpark DataFrames.

Extracting Financial Year in Pandas DataFrame

Sample Data

Let’s start by creating a sample Pandas DataFrame with a date column:

import pandas as pd

data = {"date": ["2021-04-15", "2022-09-30", "2020-11-25"]}
df_pandas = pd.DataFrame(data)
df_pandas["date"] = pd.to_datetime(df_pandas["date"])

Extracting Financial Year

Now, let’s define a function and apply it to the Pandas DataFrame to extract the financial year:

def calculate_financial_year_pandas(date):
    if date.month > 3:
        return f"{date.year}-{date.year + 1}"
    elif date.month < 4:
        return f"{date.year - 1}-{date.year}"

# Apply the function to create a new column "financial_year"
df_pandas["financial_year"] = df_pandas["date"].apply(calculate_financial_year_pandas)

# Display the Pandas DataFrame
print("Pandas DataFrame:")
print(df_pandas)

Extracting Financial Year in PySpark DataFrame

Sample Data

Let’s create a sample PySpark DataFrame with a date column:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("FinancialYearExtraction").getOrCreate()

data = [("2021-04-15",), ("2022-09-30",), ("2020-11-25",)]
df_pyspark = spark.createDataFrame(data, ["date"])
df_pyspark = df_pyspark.withColumn("date", col("date").cast("date"))

Extracting Financial Year

Now, let’s define a UDF (User-Defined Function) and apply it to the PySpark DataFrame:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def calculate_financial_year_pyspark(date):
    if date.month > 3:
        return f"{date.year}-{date.year + 1}"
    elif date.month < 4:
        return f"{date.year - 1}-{date.year}"

# Convert the Python function to a PySpark UDF
calculate_financial_year_udf = udf(calculate_financial_year_pyspark, StringType())

# Apply the UDF to create a new column "financial_year"
df_pyspark = df_pyspark.withColumn("financial_year", calculate_financial_year_udf("date"))

# Display the PySpark DataFrame
print("PySpark DataFrame:")
df_pyspark.show(truncate=False)

Conclusion

In this blog post, we explored how to extract the financial year from a date column in both Pandas and PySpark DataFrames. The provided solutions use Python functions and UDFs to perform the extraction, demonstrating the flexibility and power of these data manipulation tools. Adjust the code as needed to fit your specific date formats and requirements. Happy coding!

By

Leave a Reply

Discover more from Geeky Codes

Subscribe now to keep reading and get access to the full archive.

Continue reading