MODULE 1 | LESSON 2 - Introduction to Machine Learning

 Linear models and overfitting

Regularization is a good option for reducing the overfitting of an ML model. Regularization means constraining the learned parameters θ and reducing the degrees of freedom of the model to fit the training data.

In linear regression, we can regularize the parameters by minimizing in training a general cost function of the form:

▶ Important: The penalty term should only be added to the cost function during training.

Ridge Regression

Ridge Regression is a shrinkage regularization technique that aims at keeping the parameter values of the model as small as possible. The penalty function is of the form:


Ridge Regression “forces” the model to become closer to a model where we use the average in training as the prediction.

The hyperparameter α determines how much we wish to regularize the model. A very high α will fit the data with a flat line through the average of the labels y. Using Gradient Descent with Ridge Regression, the update of the parameters will be given by:

Lasso Regression

The Lasso (Least Absolute Shrinkage and Selection Operator) Regression sets to zero the parameters associated with the least important features. That is, Lasso Regression performs a selection of features that are useful for prediction. The regularization term is of the form:

The Lasso penalty term is an l1-norm of the parameter vector, while the Ridge version uses an l2-norm.

The Lasso term is not differentiable at θi = 0, so the Gradient Descent algorithm needs to be adjusted properly using a subgradient method:

where sign(θ) takes a value of -1 for those elements where the parameters are negative, 1 where the parameters are positive, and 0 where the parameters are strictly equal to 0.

Elastic Net Regression

The Elastic Net combines Ridge and Lasso regularization, where a hyperparameter determines the combination of shrinkage and feature selection. The Elastic Net penalty is given by: