Machine Learning Python

How to handle categorical data in machine learning

Understanding Categorical Data and its Importance in Machine Learning

Categorical data is a type of data that can be divided into distinct groups or categories. In machine learning, it is common to encounter categorical data in the form of labels, such as a classification problem where the output is a set of predefined categories. Handling categorical data is an important step in preprocessing your data for machine learning, as the algorithms used in machine learning often require numerical input.

One of the most common ways to handle categorical data is through encoding. Encoding involves converting categorical data into a numerical format that can be understood by the machine learning algorithm. The most popular encoding methods are:

  1. One-hot encoding: This method creates a new binary column for each category and assigns a 1 or 0 to indicate the presence or absence of that category.
  2. Ordinal encoding: This method assigns a numerical value to each category, based on the order of the categories.
  3. Count encoding: This method replaces the categorical value with the number of occurrences of that value in the dataset.
  4. Label Encoding: Label encoding is a method used to convert categorical variables, which have non-numerical values, into numerical values. 

What is One-hot Encoding and How Does it Work?

One-hot encoding is a method of encoding categorical variables as numerical values. It creates a new binary column for each category in the data and assigns a 1 or 0 to indicate the presence or absence of that category in the original data. This method is useful for converting categorical data into a numerical format that can be understood by machine learning algorithms.

For example, consider a dataset with a categorical column “Color” with three categories “Red”, “Green”, “Blue”. Using one-hot encoding, three new binary columns “is_red”, “is_green”, “is_blue” will be created, and each row will be assigned a 1 or 0 depending on which color is present in the original column.

One-hot encoding can be implemented in various programming languages such as Python, R and SAS. Many libraries such as pandas, scikit-learn and onehot-encoder in R provide built-in functions for one-hot encoding.

One-hot encoding has many advantages such as handling categorical variables with ease, but it also has some limitations such as creating high-dimensional sparse data and it may not be suitable for variables with a large number of categories.

Below is a simple implementation of One hot encoding in python

# Program for demonstration of one hot encoding

# import libraries
import numpy as np
import pandas as pd

# import the data required
data = pd.read_csv("employee_data.csv")
print(data.head())

one_hot_encoded_data = pd.get_dummies(data, columns = ['Remarks', 'Gender'])
print(one_hot_encoded_data)

What is Count /Frequency encoding?

Count encoding, also known as count-based encoding, is a method of encoding categorical variables by replacing the categorical values with the count or frequency of their occurrence in the dataset. This method is useful for converting categorical data into a numerical format that can be understood by machine learning algorithms.

For example, consider a dataset with a categorical column “Color” with three categories “Red”, “Green”, “Blue”. Using count encoding, the “Color” column will be replaced with a numerical column indicating the number of times each color appears in the dataset.

Count encoding can be implemented in python with the use of the pandas library. The following code shows an example of how to implement count encoding in python:

import pandas as pd

# Create a sample dataset
data = {'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']}
df = pd.DataFrame(data)

# Implement count encoding
df['Color_Count'] = df.groupby('Color')['Color'].transform('count')

print(df)

The output of the code will be a new column named “Color_Count” with the count of each color in the dataset.

Like One-hot encoding, Count encoding also has its own advantages and disadvantages. One of the advantages is that it can handle categorical variables with high cardinality (a large number of categories) much better than one-hot encoding. However, it can introduce some bias if the data is not randomly sampled.

What is Ordinal Encoding?

Ordinal encoding is a method of encoding categorical variables by assigning an integer value to each category according to their ordinal position. This method is useful for converting categorical data into a numerical format that can be understood by machine learning algorithms.

For example, consider a dataset with a categorical column “Size” with three categories “Small”, “Medium”, “Large”. Using ordinal encoding, the “Size” column will be replaced with a numerical column indicating the ordinal position of each size. Small = 1, Medium = 2, Large = 3

Ordinal encoding can be implemented in python with the use of the pandas library. The following code shows an example of how to implement ordinal encoding in python:

import pandas as pd

# Create a sample dataset
data = {'Size': ['Small', 'Medium', 'Large', 'Small', 'Medium']}
df = pd.DataFrame(data)

# Create a dictionary to map the categorical values to integers
size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}

# Implement ordinal encoding
df['Size_Ordinal'] = df['Size'].map(size_mapping)

print(df)

The output of the code will be a new column named “Size_Ordinal” with the ordinal position of each size in the dataset.

Ordinal encoding is simple and easy to implement, but it is important to note that it assumes an inherent order or ranking in the categories which may not always be the case. It is also sensitive to changes in the order of categories, so if you’re going to use ordinal encoding, you should be sure that the order of the categories is meaningful.

What is Label Encoding?

Label encoding is a method of encoding categorical variables by assigning an integer value to each category without any specific order. It is similar to ordinal encoding, but it doesn’t assume any inherent order or ranking in the categories.

Label encoding can be implemented in python with the use of the pandas library. The following code shows an example of how to implement label encoding in python:

import pandas as pd

# Create a sample dataset
data = {'Animals': ['Dog', 'Cat', 'Horse', 'Dog', 'Cat']}
df = pd.DataFrame(data)

# Implement label encoding
df['Animals_Label'] = df['Animals'].astype('category').cat.codes

print(df)

What is Binary Encoding?

Binary encoding is a method of encoding categorical variables by converting them into a binary code. It is similar to one-hot encoding, but it uses a smaller number of bits to represent each category. This method is useful for datasets with a large number of categories.

For example, consider a dataset with a categorical column “Colors” with eight categories “Red”, “Orange”, “Yellow”, “Green”, “Blue”, “Indigo”, “Violet”, “Purple”. Using binary encoding, the “Colors” column will be replaced with a binary code column indicating the binary representation of each color.

For example, Red = 001, Orange = 010, Yellow = 011, Green = 100, Blue = 101, Indigo = 110, Violet = 111, Purple = 1000

Binary encoding can be implemented in python with the use of the category_encoders library. The following code shows an example of how to implement binary encoding in python:

import pandas as pd
import category_encoders as ce

# Create a sample dataset
data = {'Colors': ['Red', 'Orange', 'Yellow', 'Green', 'Blue']}
df = pd.DataFrame(data)

# Implement binary encoding
encoder = ce.BinaryEncoder(cols=['Colors'])
df = encoder.fit_transform(df)
print(df)

Binary encoding is useful when dealing with datasets with large numbers of categories, but it can lead to a high dimensionality of the dataset. Additionally, it is important to note that it doesn’t assume any order or ranking in the categories.

What is Target Mean Encoding?

Target Mean Encoding is a technique for encoding categorical variables by replacing the categorical values with the mean value of the target variable for that category. It is used to capture the relationship between the categorical variable and the target variable.

For example, consider a dataset with a categorical column “Gender” and a target variable “Salary”. Using target mean encoding, the “Gender” column will be replaced with the mean salary of the respective gender.

For example, Male = $50,000 and Female = $40,000

Target Mean Encoding can be implemented in python with the use of the pandas library. The following code shows an example of how to implement target mean encoding in python:

import pandas as pd

# Create a sample dataset
data = {'Gender': ['Male', 'Female', 'Female', 'Male', 'Female'],
        'Salary': [60000, 50000, 40000, 55000, 45000]}
df = pd.DataFrame(data)

# Implement target mean encoding
df['Gender_enc'] = df.groupby('Gender')['Salary'].transform('mean')

print(df)

The output of the code will be a new column named “Gender_enc” with the mean salary of each gender in the dataset.

Target Mean Encoding can be a powerful technique in handling categorical data, but it has some drawbacks. It can introduce leakage if used in the training data, or overfitting if used in the test data. Additionally, it should be used with caution when the dataset has many rare categories.

Another way to handle categorical data is through feature engineering. This involves creating new features from the categorical data, such as by combining multiple categories into a single feature or by creating an interaction feature between two categories.

Another approach is to use techniques such as Tree-based models or neural networks which can handle categorical variables natively. We will be discussing this in another post.

It’s important to note that the method of encoding and feature engineering you choose will depend on the specific characteristics of your dataset and the machine learning algorithm you are using. It’s also worth testing different encoding methods and seeing which one works best for your problem

Conclusion

In conclusion, handling categorical data is an important step in preprocessing your data for machine learning. Encoding and feature engineering are two common methods for handling categorical data, but the best approach will depend on the specific characteristics of your dataset and the machine learning algorithm you are using. It’s always good to test different techniques and see which one works best for your problem.

Important Notice for college students

If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com

For more Programming related blogs Visit Us Geekycodes. Follow us on Instagram.

Please subscribe to our Youtube Channel by clicking on the youtube icon given below

Leave a Reply

%d bloggers like this: