Linear models and overfitting
Regularization is a good option for reducing the overfitting of an ML model. Regularization means constraining the learned parameters θ and reducing the degrees of freedom of the model to fit the training data.
In linear regression, we can regularize the parameters by minimizing in training a general cost function of the form:
▶ Important: The penalty term should only be added to the cost function during training.
Ridge Regression
Ridge Regression is a shrinkage regularization technique that aims at keeping the parameter values of the model as small as possible. The penalty function is of the form:
The hyperparameter α determines how much we wish to regularize the model. A very high α will fit the data with a flat line through the average of the labels y. Using Gradient Descent with Ridge Regression, the update of the parameters will be given by:
Lasso Regression
The Lasso (Least Absolute Shrinkage and Selection Operator) Regression sets to zero the parameters associated with the least important features. That is, Lasso Regression performs a selection of features that are useful for prediction. The regularization term is of the form:
The Lasso penalty term is an l1-norm of the parameter vector, while the Ridge version uses an l2-norm.
The Lasso term is not differentiable at θi = 0, so the Gradient Descent algorithm needs to be adjusted properly using a subgradient method:
where sign(θ) takes a value of -1 for those elements where the parameters are negative, 1 where the parameters are positive, and 0 where the parameters are strictly equal to 0.
Elastic Net Regression
The Elastic Net combines Ridge and Lasso regularization, where a hyperparameter determines the combination of shrinkage and feature selection. The Elastic Net penalty is given by: