Regularization - Ridge & Lasso

type

Post

Created date

Jun 16, 2022 01:21 PM

Definition

When you have underfitting or overfitting issues in a statistical model, you can use the regularization technique to resolve it. Such techniques like LASSO help penalize some model coefficients (e.g., in lm) if they are likely to lead to overfitting.

Overfitting can happen for many reasons

arising due to either collinearity of the covariates or high-dimensionality.

L1 and L2 are done by shrinking (imposing a penalty) on the coefficients.

Is the process of adding tuning parameter (penalty term) to a model to induce smoothness in order to prevent overfitting.

Both L1 (Lasso) and L2 (Ridge) regularization prevents overfitting by shrinking (imposing a penalty) on the coefficients.

Ridge can shrink the beta (coefficient) to be 0, but LASSO cannot.

Theory

Difference between L1 (Lasso) and L2 (Ridge)

Coefficient Zero or not?

L2 (Ridge) shrinks all the coefficient to result be small but non-zero,

Whereas L1 (Lasso) can shrink some coefficients to exactly zero and a few other regression coefficients with comparatively little shrinkage, performing variable selection.

So, L1 can remove variables by shrinking its coeff to be 0; L2 cant

Computational efficiency

L2 regularization is computationally efficient while L1 regularization is computationally inefficient.

Model sparsity

L2 regularization produces dense models while L1 regularization produces sparse models.

What is Sparse model? (Here)

Sparse model is a great property to have when dealing with high-dimensional data, for at least 2 reasons.

Model compression: increasingly important due to the mobile growth

Feature selection: it helps to know which features are important and which features are not or redundant.

Intuition [vidhya]

Ridge Regression:

Performs L2 regularization, i.e. adds penalty equivalent to square of the magnitude of coefficients

Minimization objective = LS Obj + α * (sum of square of coefficients)

Lasso Regression:

Performs L1 regularization, i.e. adds penalty equivalent to absolute value of the magnitude of coefficients

Minimization objective = LS Obj + α * (sum of absolute value of coefficients)

Which to use

If all the features are correlated with the label, ridge outperforms lasso, as the coefficients are never zero in ridge.

If only a subset of features is correlated with the label, lasso outperforms ridge as in lasso model some coefficient can be shrunken to zero.

is a tuning parameter which represents the amount of regularization. That is controls the strength of the penalty term.

λ can be between 0 and ∞.

When λ = 0, ridge regression equals least squares regression. If λ = ∞, all coefficients are shrunk to ZERO.

A large means a greater penalty on the estimates, meaning more shrinkage of these estimates toward 0.

Is not estimated by the model but rather chosen before fitting, typically through cross validation.

Assumption

The bias increases as λ (amount of shrinkage) increases

The variance decreases as λ (amount of shrinkage) increases

Shortcoming

For Lasso L1

Lasso does not work that well in a high-dimensional case, i.e. where the number of samples is lower than the number of dimensions.

Secondly, the main benefit of L1 regularization - i.e., that it results in sparse models - could be a disadvantage as well. For example, when you don't need variables to drop out - e.g., because you already performed variable selection - L1 might induce too much sparsity in your model (Kochede, n.d.). The same is true if the relevant information is "smeared out" over many variables, in a correlative way (cbeleites, 2013; Tripathi, n.d.). In this case, having variables dropped out removes essential information. On the contrary, when your information is primarily present in a few variables only, it makes total sense to induce sparsity and hence use L1.

Even when you do want variables to drop out, it is reported that L1 regularization does not work as well as, for example, L2 Regularization and Elastic Net Regularization (Tripathi, n.d.). We will cover both of them next.

For Ridge L2

Due to the nature of the regularizer (Gupta, 2017). It is model interpretability: due to the fact that L2 regularization does not promote sparsity, you may end up with an uninterpretable model if your dataset is high-dimensional.

This may not always be unavoidable (e.g. in the case where you have a correlative dataset), but once again, take a look at your data first before you choose whether to use L1 or L2 regularization.