Understanding Lasso Regression: A Guide to Feature Selection

Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso Regression) is another regularized version of Linear Regression: just like Ridge Regression, it adds a regularization term to the cost function, but it uses the ℓ1 norm of the weight vector instead of half the square of the ℓ2 norm.

Lasso Regression cost function formula showing the MSE term and the L1 regularization term.

Figure below shows the same thing as another figure but replaces Ridge models with Lasso models and uses smaller α values.

A comparison of Lasso Regression models with different alpha values. The left plot shows a linear trend with blue, green, and red lines representing different alpha values, while the right plot displays a more complex curve as alpha changes.

An important characteristic of Lasso Regression is that it tends to completely eliminate the weights of the least important features (i.e., set them to zero). For example, the dashed line in the right plot on Figure above (with α = 10-7) looks quadratic, almost linear: all the weights for the high-degree polynomial features are equal to zero. In other words, Lasso Regression automatically performs feature selection and outputs a sparse model (i.e., with few nonzero feature weights).

You can get a sense of why this is the case by looking at Figure 4-19: on the top-left plot, the background contours (ellipses) represent an unregularized MSE cost function (α = 0), and the white circles show the Batch Gradient Descent path with that cost function. The foreground contours (diamonds) represent the ℓ1 penalty, and the triangles show the BGD path for this penalty only (α → ∞). Notice how the path first reaches θ1 = 0, then rolls down a gutter until it reaches θ2 = 0. On the top-right plot, the contours represent the same cost function plus an ℓ1
penalty with α = 0.5. The global minimum is on the θ2 = 0 axis. BGD first reaches θ2 = 0, then rolls down the gutter until it reaches the global minimum. The two bottom plots show the same
thing but uses an ℓ2 penalty instead. The regularized minimum is closer to θ = 0 than the unregularized minimum, but the weights do not get fully eliminated.

A plot comparing the paths of gradient descent using ℓ1 and ℓ2 penalties, featuring Lasso and Ridge regression in a two-dimensional feature space.

The Lasso cost function is not differentiable at θi = 0 (for i = 1, 2, ⋯, n), but Gradient Descent still works fine if you use a subgradient vector g instead when any θi = 0. Equation 4-11 shows a subgradient vector equation you can use for Gradient Descent with the Lasso cost function.

Mathematical representation of the subgradient vector equation for Lasso regression cost function.

Here is a small Scikit-Learn example using the Lasso class. Note that you could instead use an SGDRegressor(penalty=”l1″).

from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

Conclusion

In conclusion, Lasso Regression proves to be a valuable addition to the data scientist’s toolkit, offering a robust solution to linear regression problems with high-dimensional data and contributing to the ongoing quest for parsimonious and interpretable models in the ever-evolving landscape of machine learning.

What is Lasso Regression? | Machine Learning from Scratch

By

Conclusion

Like this:

Related

By

Related Post

Why RAG Chatbots Struggle in Production

Measuring ROI for a GenAI Initiative in Healthcare

Unique Strings with Odd and Even Swapping Allowed

Leave a ReplyCancel reply

You missed

Why RAG Chatbots Struggle in Production

Measuring ROI for a GenAI Initiative in Healthcare

Unique Strings with Odd and Even Swapping Allowed

Applying SOLID Principles and Dependency Injection in Python

By

Conclusion

Share this:

Like this:

Related

By

Related Post

Leave a ReplyCancel reply

You missed

Discover more from Geeky Codes