Module 7: Lesson 3 - Multilayer Perceptrons (MLP)

  • Multilayer perceptron (MLP) - simplest type of deep networks: multilayer perceptron (MLP).

  • One of the main features of any deep network is a number of hidden layers. These layers, as we will see shortly, will allow us to explore non-linear relationships in the data:

  • Data movement from the input to output layer (Oq)?


  • - X is a matrix of n × d (a minibatch of n examples for d inputs)
  • - W(1) is a matrix of d × h (d inputs for h units in the hidden layer). Here h = 5
  • - b (1) is a vector of 1 × h (1 bias term for each unit in hidden layer)
  • - H is a matrix of (n × d)(d × h) + (1 × h) = n × h (a minibatch of n examples for h hidden-layer units)
  • - W(2) is a matrix of h × q (h units in the hidden layer by q units in the output layer). Here q = 3
  • - b(2) is a vector of 1 × q (1 bias term for each unit in output layer)
  • - O is a matrix of (n × h)(h × q) + (1 × q) = n × q

Activation Functions: Secret ingredient is going to be a non-linear activation function. These functions transform inputs, in a non-linear way, to decide whether neurons of a layer are activated (or not).

Thus, for an activation function σ(), our previous set of equations will look like:



Activation function ultimately brings non-linearity to our deep networks,




Forward and Backward propagation:

There are 2 key algorithms involved in the training of our neural networks: forward and backward 
propagation.

Forward propagation (or forward pass):

- Refers to the calculation (and storage) of the intermediate variables (and outputs) of the network,
in layer-to-layer order, from input to output layers.


where, for a one hidden-layer network and bias term b = 0:

z = W(1)x → h = φ(z) → o = W(2)h → and J is the regularized-cost function using an l2 norm.

Forward prop is highly interconnected with backward propagation:

Backward propagation (backpropagation):

- Refers to the method of calculating gradient of NN parameters, and updating them.
- The method essentially consists of calculating the partial derivative of the cost function with
respect to the different parameters (i.e., W(l)).
- To achieve this, it relies on the chain rule:


- The backprop algorithm will start with the gradient of the loss function and end obtaining the
gradient of the model weights closest to the input layer (i.e., W(1)):


Derivative of Activation Functions




Early stopping
The forward prop and backprop algorithms will effectively train the model using iterative process:

▶ When do iterations/training stop?
The easy answer to this question is when the stopping criteria is met. But this begs another question:

▶ What is the stopping criteria?
There are many possible choices. For now, we are going to see one of the options: Early Stopping.

Early stopping serves as both a stopping criterion and a regularization technique.

▶ Essentially, it consists of constraining the number of epochs of training. How?
- During training, monitor the error in validation set (usually checking after each epoch).
- Stopping criteria: stop when the validation error has not decreased by more than some amount ε
for some number of epochs
- How many is some epochs? → patience
- Importantly, the validation set must NOT be a part of training nor test sets!