Random World: Module 7: Lesson 3 - Multilayer Perceptrons (MLP)

Multilayer perceptron (MLP) - simplest type of deep networks: multilayer perceptron (MLP).
One of the main features of any deep network is a number of hidden layers. These layers, as we will see shortly, will allow us to explore non-linear relationships in the data:

Data movement from the input to output layer (Oq)?

- X is a matrix of n × d (a minibatch of n examples for d inputs)
- W(1) is a matrix of d × h (d inputs for h units in the hidden layer). Here h = 5
- b (1) is a vector of 1 × h (1 bias term for each unit in hidden layer)
- H is a matrix of (n × d)(d × h) + (1 × h) = n × h (a minibatch of n examples for h hidden-layer units)
- W(2) is a matrix of h × q (h units in the hidden layer by q units in the output layer). Here q = 3
- b(2) is a vector of 1 × q (1 bias term for each unit in output layer)
- O is a matrix of (n × h)(h × q) + (1 × q) = n × q

Activation Functions: Secret ingredient is going to be a non-linear activation function. These functions transform inputs, in a non-linear way, to decide whether neurons of a layer are activated (or not).

Thus, for an activation function σ(), our previous set of equations will look like:

Activation function ultimately brings non-linearity to our deep networks,

Forward and Backward propagation:

There are 2 key algorithms involved in the training of our neural networks: forward and backward

propagation.

Forward propagation (or forward pass):

- Refers to the calculation (and storage) of the intermediate variables (and outputs) of the network,

in layer-to-layer order, from input to output layers.

where, for a one hidden-layer network and bias term b = 0:

z = W(1)x → h = φ(z) → o = W(2)h → and J is the regularized-cost function using an l2 norm.

Forward prop is highly interconnected with backward propagation:

Backward propagation (backpropagation):

- Refers to the method of calculating gradient of NN parameters, and updating them.

- The method essentially consists of calculating the partial derivative of the cost function with

respect to the different parameters (i.e., W(l)).

- To achieve this, it relies on the chain rule:

- The backprop algorithm will start with the gradient of the loss function and end obtaining the

gradient of the model weights closest to the input layer (i.e., W(1)):

Derivative of Activation Functions

Early stopping

The forward prop and backprop algorithms will effectively train the model using iterative process:

▶ When do iterations/training stop?

The easy answer to this question is when the stopping criteria is met. But this begs another question:

▶ What is the stopping criteria?

There are many possible choices. For now, we are going to see one of the options: Early Stopping.

▶ Early stopping serves as both a stopping criterion and a regularization technique.

▶ Essentially, it consists of constraining the number of epochs of training. How?

- During training, monitor the error in validation set (usually checking after each epoch).
- Stopping criteria: stop when the validation error has not decreased by more than some amount ε
for some number of epochs
- How many is some epochs? → patience
- Importantly, the validation set must NOT be a part of training nor test sets!