Introduction
Multicollinearity might be a handful to pronounce but it’s a topic you should be aware of in the machine learning field. I am familiar with it because of my statistics background but I’ve seen a lot of professionals unaware that multicollinearity exists.
This is especially prevalent in those machine learning folks who come from a non-mathematical background. And while yes, multicollinearity might not be the most crucial topic to grasp in your journey, it’s still important enough to learn. Especially if you’re sitting for data scientist interviews!
So in this article, we will understand what multicollinearity is, why it’s a problem, what causes multicollinearity, and then understand how to detect and fix multicollinearity.
Multicollinearity
When two or more independent variables are highly correlated with one another in a regression model multicollinearity occurs.
This means that an independent variable can be predicted from another independent variable in a regression model. For example, height and weight, household income and water consumption, mileage and price of a car, study time and leisure time, etc.
Let me take an example from our everyday life to make this simple. John loves to play music while driving. The more he plays music more he drives and more excited he gets. Now if we could quantify his excitement and measure John’s excitement while he’s busy doing his favorite activity. Which do you think would have a greater impact on his excitement? Driving or playing music?
hat’s difficult to determine because the moment we try to measure John’s happiness from playing music, he starts driving. And the moment we try to measure his excitement from driving , he starts playing music.
In the case of John, playing music and driving are highly correlated. Also we cannot individually determine the impact of the individual activities on his excitement. This is the multicollinearity problem!
Implementation
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pandas as pd
import statsmodels.api as sm
What kind of problem we have to solve
Lets say you have two independent features -> Age and experience and you have to predict the salary based on those two values.
Linear Regression Best-fit line formula => Salary = B0 + B1 (Age) + B2 (Experience)
here, B0 = intercept, B1 and B2 are coefficients / slopes
Now, if you see, there can be a possibility that the age and experience variable themselves have a high correlation value (>90%) i.e. age and experience are internally correlated with each other. This affects the output ‘salary’, the features ‘age’ and ‘experience’ will be almost same thus implying that we are providing the same information to the output feature ‘salary’ which we want to compute.
This is the problem that we have to resolve.
Example Dataset 1 => Advertising Dataset
df = pd.read_csv("/kaggle/input/tvradionewspaperadvertising/Advertising.csv")
df.head()

- TV => expenditure done on TV advertisements
- Radio => expenditure done on radio advertisements
- Newspaper => expenditure done on newspaper advertisements
- Sales => The final sales amount collected with the help of expenditures done
Looking at our current dataset, (TV, Radio and Newspaper) are the independent features and the Sales is the output feature which we have to predict
Splitting the data into independent and dependent features
X = df[['TV', 'Radio', 'Newspaper']] # independent variables => predict sales value based on these features
y = df[['Sales']] # dependent variables
X.head()

y.head()

In this Case, we will use the Multiple Linear Regression technique ‘Ordinary Least Squared’
- Equation of Linear Regression best fit line for this dataset is: y = B0 1 + B1 (TV) + B2 (Radio) + B3 (Newspaper)
- Whenever computing Ordinary Least Squared (OLS) => we need to compute B0(i.e. the intercept) also.
- But we dont have the B0 value here => So we will add a column for B0 value and all values in that column will be equal to 1
- To add a constant value column for B0 with all values = 1 => We will use the stats model library
X = sm.add_constant(X)
X.head()

Fit an Ordinary Least Squared Model with intercept on TV and Radio. We will again be using the statsmodel library for this as it has the function OLS for creating the model. Inside the OLS method we have to give endog(output feature) and exog values(input features) as parameters.
model = sm.OLS(y, X).fit()
model.summary()
OLS Regression Results Dep. Variable: Sales R-squared: 0.903 Model: OLS Adj. R-squared: 0.901 Method: Least Squares F-statistic: 605.4 Date: Mon, 15 Mar 2021 Prob (F-statistic): 8.13e-99 Time: 13:09:37 Log-Likelihood: -383.34 No. Observations: 200 AIC: 774.7 Df Residuals: 196 BIC: 787.9 Df Model: 3 Covariance Type: nonrobust coef std err t P>|t| [0.025 0.975] const 4.6251 0.308 15.041 0.000 4.019 5.232 TV 0.0544 0.001 39.592 0.000 0.052 0.057 Radio 0.1070 0.008 12.604 0.000 0.090 0.124 Newspaper 0.0003 0.006 0.058 0.954 -0.011 0.012 Omnibus: 16.081 Durbin-Watson: 2.251 Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.655 Skew: -0.431 Prob(JB): 9.88e-07 Kurtosis: 4.605 Cond. No. 454. Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
According to the summary,
- B0 = coeff of const = 4.6251
- B1 = coeff of TV = 0.0544
- [coeff value means that if we change the TV expenditure(i.e. input feature) by 1 unit, the change in sales(i.e. output) will be 0.0544]
- B2 = coeff of Radio = 0.1070
- B3 = coeff of Newspaper = 0.0003
- B3 coeff => << 0.005 => this shows that we are making an unnecessary expenditure on Newspaper. Thus we can reduce that unnecessary expenditure done on Newspaper. Thus while creating the model, we can just drop this feature.
- R-squared value = 0.903 => very close to 1 => the model has fitted very well
- P value of const = 0
- P value of TV = 0
- P value of Radio = 0
- P value of Newspaper = 0.954
=> Except the feature ‘Newspaper’ (P-value = 0.954) , all the P values are less than 0.05
- std error of const = 0.308
- std error of TV = 0.001
- std error of Radio = 0.008
- std error of Newspaper = 0.006
std error => high number(>0.5) if there is multicollinearity among the independent varibles. But here, the std error are small numbers thus indicating there is no multicollinearity among the independent variables
Plot independent features in terms of correlation
import matplotlib.pyplot as plt
X.iloc[:, 1:].corr()

Through this table, we can see the correlation values among the various independent features:
- Between TV and Radio => 0.054809
- Between Radio and Newspaper => 0.354104
- Between TV and Newspaper => 0.056648
This implies that none of the correlation values are >0.5. Thus indicating that there is not much correlation between the independent features and thus no multicollinearity issue among the independent features
Example Dataset 2 => Salary Dataset with age and YOE
df_salary = pd.read_csv("/kaggle/input/salary-data-with-age-and-experience/Salary_Data.csv")
df_salary.head()

In this case the independent features are ‘Years of Experience’ and the ‘Age’ and we have to predict the dependent variable ‘Salary’ based on these two independent features.
X = df_salary[['YearsExperience', 'Age']]
y = df_salary[['Salary']]
X.head()

y.head()

Fitting the OLS(Ordinary Least Squared) model, similar to the previous dataset
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
model.summary()
OLS Regression Results Dep. Variable: Salary R-squared: 0.960 Model: OLS Adj. R-squared: 0.957 Method: Least Squares F-statistic: 323.9 Date: Mon, 15 Mar 2021 Prob (F-statistic): 1.35e-19 Time: 13:18:56 Log-Likelihood: -300.35 No. Observations: 30 AIC: 606.7 Df Residuals: 27 BIC: 610.9 Df Model: 2 Covariance Type: nonrobust coef std err t P>|t| [0.025 0.975] const -6661.9872 2.28e+04 -0.292 0.773 -5.35e+04 4.02e+04 YearsExperience 6153.3533 2337.092 2.633 0.014 1358.037 1.09e+04 Age 1836.0136 1285.034 1.429 0.165 -800.659 4472.686 Omnibus: 2.695 Durbin-Watson: 1.711 Prob(Omnibus): 0.260 Jarque-Bera (JB): 1.975 Skew: 0.456 Prob(JB): 0.372 Kurtosis: 2.135 Cond. No. 626. Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In this scenario, According to the summary,
- B0 = coeff of const = -6661.9872
- B1 = coeff of YearsofExperience = 6153.3533
- [this coeff value means that if we change the YearsOfExperience(i.e. input feature) by 1 unit, the change in salary(i.e. output) will be 6153.3533]
- B2 = coeff of Age = 1836.0136
- [thus if we change the Age(i.e. input feature) by 1 unit(1 year), the change in salary(i.e. output) will be 6153.3533]
- R-squared value = 0.960 => very close to 1 => the model has fitted very well
- P value of const = 0.773
- P value of YearsOfExperience = 0.014
- P value of Age = 0.165
=> for Age => the P-value is >0.05 => Age and YearsOfExperience may have some kind of correlation
- std error of const = 0.308
- std error of YearsOfExperience = 2337.092
- std error of Age = 1285.034
Here we can see the std errors of both YearsOfExperience and Age are very very high, thus indicating that there is a huge Multicollinearity among them.
Confirming the multicollinearity between Age and YearsOfExperience by plotting the correlation table
X.iloc[:, 1:].corr()

With the help of this Correlation Table / Matrix we can imply that age and yearsofexperience have 98% correlation (very highly correlated). This implies that taking one of these features will be more than enough to predict the salary.
Now the Question is which of the input features (YearsOfExperience and Age) to keep and which one to drop for the final prediction of salary
Remedy for this Multicollinearity problem:
- Solution 1 : Dont do anything, keep things as it is and don’t care about multicollinearity and take all the input features to create the model
- Solution 2 : Check the P values for Age and YearsOfExperience. P value of Age > P value of YearsOfExperience. Thus drop the ‘Age’ feature. This will not have much effect on the model as the correlation is about 98%. Thus the whole model can be trained just by considering the feature ‘YearsOfExperience’
Thank you for reading. Please let me know if you have any feedback. and download this python notebook here
This Post was contributed by a user on Kaggle.com. Follow him here
Read more Python blogs