MODULE 7 | LESSON 1 - ENSEMBLES AND WALK FORWARD

 

1. Deep neural networks are flexible models, but they are prone to problems that damage their generalization capacity (bias and variance):

  • The predictions from deep neural networks are sensitive to initial conditions, both in terms of the initial random weights and in terms of the statistical noise in the training dataset. 
  • The stochastic nature of the learning algorithm means that slightly (or dramatically) different versions of the mapping function from inputs to outputs are learned out of each initialization and different performances on the training and holdout (test) datasets are obtained
  • 2.  Ensemble learning represents a useful tool to overcome these problems and improve the performance of  predictions:

    •  Ensembles consist of training multiple models instead of a single model and combining the predictions from these models to obtain a single overall prediction.

    • The individual models are “weak learners” that, combined, lead to a stronger learner that performs better than the individual ones.

    3. Ensemble methods

    The most commonly used ensemble approach for neural networks is called model-averaging or “committee of networks.”

    • A collection of networks with the same configuration and different initial random weights is trained on the same dataset. The actual overall predictions are calculated as the average of the predictions of each trained model.

    The number of models in the ensemble is often kept small (three, five, or 10 trained models) both because of the computational expense in training models and because of the diminishing returns in performance from adding more ensemble members.


    The most commonly used ensemble approach for neural networks is called model-averaging or “committee of networks.”

    4. Advanced ensemble methods

    Techniques for ensemble learning can be classified according to which element is varied in the process.

    Training data: Each model in the ensemble is trained with different data.

    ▶ K-fold cross-validation: k models are trained on k different subsets of the training data. The ensemble prediction comes from averaging the predictions from the k − 1 models where each observation belongs to the test data.
    ▶ Bootstrap aggregation (Bagging): resampling the training dataset with replacement, then training the
    model for several resampled datasets.

    Models: The combined prediction is obtained from models with different configurations.

    ▶ The ensemble may learn heterogeneous mappings that may display a low correlation in their errors.
    Combinations: The contribution of each model to the combined prediction is optimized.

    ▶ Stacking: Training an entirely new model to learn how to best combine the contributions from each
    member of the ensemble.

    ▶ Boosting: Sequentially train models that learn from the prediction errors of previous models.


    5. Ensemble methods in financial applications

    In applications in finance, observations cannot be assumed to be independent and identically
    distributed.

    ▶ This means that slicing and resampling must be done with care.
    Placing relatively close observations of the dataset in different sets will likely lead to some leakage.

    ▶ Information that is present in the training set (either in the inputs or the labels) is also present in the test set due to the serial correlation.

    ▶ Leakage is likely to lead to overfitting that contaminates the test samples and inflates performance in
    backtests.


    6. Walk Forward
    In most applications,  training deep learning models has been to first split the data by
    time, training on the first chunk of data and using the second for out-of-sample tests.

    While this choice is not wrong by itself, we are not making proper use of the information.

    ▶ We exploit a low share of the information to evaluate the out-of-sample performance.
    ▶ We are biasing our tests based on the ability of the models to predict out of sample only in the most
    recent part of the time series.

    Walk Forward method:

    ▶ We generate a sequence of out-of-sample evaluations by periodically re-training the model using a rolling, or expanding, time window.
    ▶  Think of walk forward as an alternative resampling tool that preserves the sequential dimension of the data.

    7. Walk Forward methods

    1. Select the length of the training and test samples, T1 and T2, respectively.
    2. Train the model using a window of information that spans T1 periods starting with the initial date of the sample. Evaluate the model performance on a test sample that includes T2 periods and begins at some date after the end of the training sample.
    3. Save the predictions of the trained model in the test sample as part of a result set.
    4. Shift the training and test windows T2 periods.
    5. Repeat steps 2 to 4, adding the new predictions to the result set until the test set reaches the end of the time series.

    To avoid leakage, we should impose some window of excluded observations between the training and test samples (purging).


    Or opt for “anchored” walk forward by using in the training samples all the information available since the initial date of the sample.