Data Science Interview Questions

What is Skewness in Data?

Skewness

Skewness is a statistical measure that quantifies the asymmetry of the probability distribution of a dataset. In simpler terms, it tells you how much a dataset’s distribution deviates from being symmetrical (i.e., bell-shaped like a normal distribution). Skewness is an essential concept in statistics and data analysis because it can provide insights into the shape of your data distribution.

There are three main types of skewness:

  1. Positive Skewness (Right Skewed): In a positively skewed distribution, the tail on the right-hand side (the larger values) is longer or fatter than the left-hand side (the smaller values). This means that most of the data points are concentrated on the left side of the distribution, and there are relatively few very large values on the right side.
  • Example: Income distribution in a population, where most people have low to moderate incomes, but a few individuals have very high incomes.
  1. Negative Skewness (Left Skewed): In a negatively skewed distribution, the tail on the left-hand side (the smaller values) is longer or fatter than the right-hand side (the larger values). This indicates that most of the data points are concentrated on the right side of the distribution, and there are relatively few very small values on the left side.
  • Example: Exam scores of a relatively easy test, where most students score high, but a few students score very low.
  1. Zero Skewness (Symmetrical): A distribution is considered to have zero skewness when it is perfectly symmetrical, with equal tail lengths on both sides. The mean, median, and mode of a symmetrical distribution are all the same.
  • Example: A standard normal distribution (bell-shaped curve) has zero skewness.

You can calculate skewness using statistical functions or libraries in Python, such as scipy.stats.skew() or the .skew() method in pandas. Skewness is useful in data analysis because it can influence the choice of data preprocessing techniques and the selection of appropriate statistical models. It’s often important to address skewness, especially in machine learning, as some algorithms assume normally distributed data or perform better with less skewed data.

How does it affect results of a model?

Skewed data can introduce several issues when building and interpreting statistical and machine learning models. Understanding these issues is essential for data analysts and data scientists:

  1. Biased Estimates: Skewed data can lead to biased estimates of statistical parameters. For example, the mean (average) may be pulled towards the skewness, causing it to be unrepresentative of the central tendency of the data. In contrast, the median (middle value) is less affected by skewness and often provides a better measure of central tendency for skewed distributions.
  2. Model Assumptions: Many statistical models assume that the data follows a normal (bell-shaped) distribution. Skewed data violate this assumption, potentially leading to incorrect inferences and predictions. Linear regression, for example, assumes normally distributed residuals, and skewness in the data can result in non-normally distributed residuals.
  3. Inferior Model Performance: Skewed data can negatively impact the performance of machine learning models. Models like linear regression, logistic regression, and k-means clustering may not work optimally with skewed data because they rely on underlying assumptions that are not met. Skewed data can lead to inaccurate predictions and suboptimal model fits.
  4. Loss of Information: Extreme values in skewed data can be underrepresented, leading to a loss of information. These extreme values may contain valuable insights or outliers that could be important for decision-making or anomaly detection.
  5. Inaccurate Significance Testing: In hypothesis testing and statistical significance analysis, skewness can lead to incorrect conclusions. P-values and confidence intervals may be biased, potentially leading to Type I or Type II errors.
  6. Challenges in Visualization: Skewed data can make it challenging to visualize the data effectively. Traditional histograms may not reveal the true underlying patterns, and it might be necessary to transform the data or use alternative visualization techniques.
  7. Difficulty in Interpreting Coefficients: In linear models, the coefficients of features can be challenging to interpret when the data is skewed. This can make it harder to understand the impact of individual features on the target variable.

To address these issues, it’s often necessary to preprocess skewed data. Common approaches include:

  • Transformation: Applying mathematical transformations like log, square root, cube root, or Box-Cox transformations to make the data more symmetric.
  • Binning or Categorization: Converting continuous data into categorical bins can help reduce the impact of extreme values.
  • Outlier Handling: Identifying and handling outliers can mitigate their influence on skewed data.
  • Choosing Appropriate Models: When dealing with skewed data, consider using models that are less sensitive to distribution assumptions, such as decision trees, random forests, or robust regression methods.

Ultimately, the choice of approach depends on the specific characteristics of your data and the goals of your analysis or modeling task. It’s essential to evaluate the impact of skewness on your results and choose the most appropriate techniques accordingly.

How to Handle Skewed Data?

Handling data with skewness is an important step in data preprocessing, as skewed data can negatively impact the performance of many machine learning algorithms. Skewness refers to the asymmetry in the distribution of data points. There are several methods to handle skewed data, depending on the nature of the skewness and the specific problem you’re working on. Here are some common techniques:

  1. Log Transformation:
  • Use the logarithm function to transform positively skewed data (right-skewed) into a more normal distribution. This is particularly effective for data with exponential or multiplicative skewness.
  • For data with zero or negative values, consider adding a constant to the data before applying the log transformation to avoid issues with undefined values.
import numpy as np
import pandas as pd

# Assuming 'data' is your skewed dataset
log_transformed_data = np.log(data + 1)  # Adding 1 to avoid log(0)

  1. Square Root Transformation:
  • The square root transformation can be used for moderately right-skewed data to reduce the skewness and make the data more symmetric.
sqrt_transformed_data = np.sqrt(data)
  1. Box-Cox Transformation:
  • The Box-Cox transformation is a more generalized power transformation that can handle both right and left-skewed data. It tries to find the best power transformation to make the data as close to a normal distribution as possible.
from scipy import stats

boxcox_transformed_data, _ = stats.boxcox(data)
  1. Yeo-Johnson Transformation:
  • Similar to the Box-Cox transformation, the Yeo-Johnson transformation is another option for handling both positive and negative skewness. It can be implemented using the yeojohnson function from the scipy library.
from scipy.stats import yeojohnson

yeojohnson_transformed_data, _ = yeojohnson(data)
  1. Winsorizing:
  • Winsorizing involves capping or truncating extreme values in the data to reduce the impact of outliers on the skewness. You can replace values above or below a certain threshold with the nearest acceptable value.
  1. Data Binning:
  • Binning data into intervals or categories can sometimes help reduce skewness. This is more applicable when dealing with discrete data.
  1. Feature Scaling:
  • Standardizing or normalizing the data can help reduce the impact of skewness on certain algorithms, although it may not completely eliminate skewness.
  1. Choose Appropriate Models:
  • Sometimes, instead of transforming the data, choosing models that are robust to skewness can be a valid strategy. For example, decision trees and random forests can handle skewed data well.
  1. Data Removal:
  • In extreme cases, you might consider removing extreme outliers or data points that are causing the skewness. However, this should be done carefully and with a good understanding of the data and domain.

When choosing a transformation method, it’s essential to monitor the impact on your data and evaluate whether the transformed data works better for your specific machine learning problem. You can use visualization techniques and statistical tests to assess the effectiveness of the transformation.

All of the above methods combined

  1. Square Root Transformation
  2. Cube Root Transformation
  3. Box-Cox Transformation
  4. Yeo-Johnson Transformation

We’ll apply these transformations to an example dataset to demonstrate their effects. Make sure to install the scipy library if you haven’t already, as it provides functions for the Box-Cox and Yeo-Johnson transformations.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.preprocessing import PowerTransformer

# Example dataset with positively skewed feature
data = {'Positively_Skewed_Feature': [2, 3, 5, 8, 12, 18, 30, 50, 80, 130]}
df = pd.DataFrame(data)

# Visualize the original data
plt.figure(figsize=(14, 4))
plt.subplot(151)
plt.hist(df['Positively_Skewed_Feature'], bins=10)
plt.title('Original Data')
plt.xlabel('Values')
plt.ylabel('Frequency')

# Square Root Transformation
df['Square_Root_Transformed'] = np.sqrt(df['Positively_Skewed_Feature'])
plt.subplot(152)
plt.hist(df['Square_Root_Transformed'], bins=10)
plt.title('Square Root Transformation')
plt.xlabel('Values')
plt.ylabel('Frequency')

# Cube Root Transformation
df['Cube_Root_Transformed'] = np.cbrt(df['Positively_Skewed_Feature'])
plt.subplot(153)
plt.hist(df['Cube_Root_Transformed'], bins=10)
plt.title('Cube Root Transformation')
plt.xlabel('Values')
plt.ylabel('Frequency')

# Box-Cox Transformation
boxcox_transformer = PowerTransformer(method='box-cox', standardize=False)
df['Box_Cox_Transformed'] = boxcox_transformer.fit_transform(df[['Positively_Skewed_Feature']])
plt.subplot(154)
plt.hist(df['Box_Cox_Transformed'], bins=10)
plt.title('Box-Cox Transformation')
plt.xlabel('Values')
plt.ylabel('Frequency')

# Yeo-Johnson Transformation
yeojohnson_transformer = PowerTransformer(method='yeo-johnson', standardize=False)
df['Yeo_Johnson_Transformed'] = yeojohnson_transformer.fit_transform(df[['Positively_Skewed_Feature']])
plt.subplot(155)
plt.hist(df['Yeo_Johnson_Transformed'], bins=10)
plt.title('Yeo-Johnson Transformation')
plt.xlabel('Values')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

Conclusion

Each transformation aims to reduce the skewness of the data. The choice of transformation depends on the specific characteristics of your data and the assumptions of your modeling technique. After applying these transformations, you can check the skewness of the transformed data to assess how well each method worked in reducing skewness.

Checkout our latest blog here How haversine distance is being used in machine learning . Follow us on Instagram.

Leave a Reply

%d bloggers like this: