Introduction
Working with date data often involves extracting relevant information, such as the financial year. In this blog post, we’ll explore how to extract the financial year from a date column in both Pandas and PySpark DataFrames.
Extracting Financial Year in Pandas DataFrame
Sample Data
Let’s start by creating a sample Pandas DataFrame with a date column:
import pandas as pd
data = {"date": ["2021-04-15", "2022-09-30", "2020-11-25"]}
df_pandas = pd.DataFrame(data)
df_pandas["date"] = pd.to_datetime(df_pandas["date"])
Extracting Financial Year
Now, let’s define a function and apply it to the Pandas DataFrame to extract the financial year:
def calculate_financial_year_pandas(date):
if date.month > 3:
return f"{date.year}-{date.year + 1}"
elif date.month < 4:
return f"{date.year - 1}-{date.year}"
# Apply the function to create a new column "financial_year"
df_pandas["financial_year"] = df_pandas["date"].apply(calculate_financial_year_pandas)
# Display the Pandas DataFrame
print("Pandas DataFrame:")
print(df_pandas)
Extracting Financial Year in PySpark DataFrame
Sample Data
Let’s create a sample PySpark DataFrame with a date column:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("FinancialYearExtraction").getOrCreate()
data = [("2021-04-15",), ("2022-09-30",), ("2020-11-25",)]
df_pyspark = spark.createDataFrame(data, ["date"])
df_pyspark = df_pyspark.withColumn("date", col("date").cast("date"))
Extracting Financial Year
Now, let’s define a UDF (User-Defined Function) and apply it to the PySpark DataFrame:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def calculate_financial_year_pyspark(date):
if date.month > 3:
return f"{date.year}-{date.year + 1}"
elif date.month < 4:
return f"{date.year - 1}-{date.year}"
# Convert the Python function to a PySpark UDF
calculate_financial_year_udf = udf(calculate_financial_year_pyspark, StringType())
# Apply the UDF to create a new column "financial_year"
df_pyspark = df_pyspark.withColumn("financial_year", calculate_financial_year_udf("date"))
# Display the PySpark DataFrame
print("PySpark DataFrame:")
df_pyspark.show(truncate=False)
Conclusion
In this blog post, we explored how to extract the financial year from a date column in both Pandas and PySpark DataFrames. The provided solutions use Python functions and UDFs to perform the extraction, demonstrating the flexibility and power of these data manipulation tools. Adjust the code as needed to fit your specific date formats and requirements. Happy coding!