- Multilayer perceptron (MLP) - simplest type of deep networks: multilayer perceptron (MLP).
- One of the main features of any deep network is a number of hidden layers. These layers, as we will see shortly, will allow us to explore non-linear relationships in the data:
- Data movement from the input to output layer (Oq)?
- - X is a matrix of n × d (a minibatch of n examples for d inputs)
- - W(1) is a matrix of d × h (d inputs for h units in the hidden layer). Here h = 5
- - b (1) is a vector of 1 × h (1 bias term for each unit in hidden layer)
- - H is a matrix of (n × d)(d × h) + (1 × h) = n × h (a minibatch of n examples for h hidden-layer units)
- - W(2) is a matrix of h × q (h units in the hidden layer by q units in the output layer). Here q = 3
- - b(2) is a vector of 1 × q (1 bias term for each unit in output layer)
- - O is a matrix of (n × h)(h × q) + (1 × q) = n × q
Activation Functions: Secret ingredient is going to be a non-linear activation function. These functions transform inputs, in a non-linear way, to decide whether neurons of a layer are activated (or not).
Thus, for an activation function σ(), our previous set of equations will look like:
Activation function ultimately brings non-linearity to our deep networks,
Forward and Backward propagation:
There are 2 key algorithms involved in the training of our neural networks: forward and backward
propagation.
Forward propagation (or forward pass):
- Refers to the calculation (and storage) of the intermediate variables (and outputs) of the network,
in layer-to-layer order, from input to output layers.
where, for a one hidden-layer network and bias term b = 0:
z = W(1)x → h = φ(z) → o = W(2)h → and J is the regularized-cost function using an l2 norm.
Forward prop is highly interconnected with backward propagation:
Backward propagation (backpropagation):
- Refers to the method of calculating gradient of NN parameters, and updating them.
- The method essentially consists of calculating the partial derivative of the cost function with
respect to the different parameters (i.e., W(l)).
- To achieve this, it relies on the chain rule:
- The backprop algorithm will start with the gradient of the loss function and end obtaining the
gradient of the model weights closest to the input layer (i.e., W(1)):
Derivative of Activation Functions
Early stopping
The forward prop and backprop algorithms will effectively train the model using iterative process:
▶ When do iterations/training stop?
The easy answer to this question is when the stopping criteria is met. But this begs another question:
▶ What is the stopping criteria?
There are many possible choices. For now, we are going to see one of the options: Early Stopping.
▶ Early stopping serves as both a stopping criterion and a regularization technique.
▶ Essentially, it consists of constraining the number of epochs of training. How?
- During training, monitor the error in validation set (usually checking after each epoch).- Stopping criteria: stop when the validation error has not decreased by more than some amount εfor some number of epochs- How many is some epochs? → patience- Importantly, the validation set must NOT be a part of training nor test sets!