ML General Issues

Some Jargon

Feature (vector): A feature is a characteristic of the data. For those from stats background think of it as an explanatory variable.

Feature Engineering: transforming input features to be more useful for the models. e.g. mapping categories to buckets, normalizing between -1 and 1, removing null

Classification/Regression: regression is prediction a number (e.g. housing price), classification is prediction from a set of categories(e.g. predicting red/blue/green)

Classification: We will often be dividing our data into different classes or categories. In supervised learning this is done a priori. In unsupervised learning it happens as part of the learning process.

Bias/Variance: how much output is determined by the features. more variance often can mean overfitting, more bias can mean a bad model

A/B testing: statistical way of comparing 2+ techniques to determine which technique performs better and also if difference is statistically significant

Measuring Distances: We will want to measure the distances between them. The shorter the distance between two vectors the closer in character are the two samples they represent. There are many ways to measure distance, and some of them are useful in machine learning.

Euclidean Distance: This is the classic measurement, using Pythagoras, just square the differences between vector entries, sum and square root. This would be the default distance measure, ‘As the crow flies.’ It’s the L 2 norm.

Manhattan Distance: The Manhattan distance is the sum of the absolute values of the differences between entries in the vectors. The name derives from the distance one must travel along the roads in a city laid out in a grid pattern. This measure of distance can be preferable when dealing with data in high dimensions. This is the L 1 norm.

Chebyshev Distance: Take the absolute differences between all the entries in the two vectors and the maximum of these numbers is the Chebyshev distance. This can be the preferred distance measure when you have many dimensions and most of them just aren’t important in classification. This is the L ∞ norm.

Confusion Matrix: A confusion matrix is a simple way of understanding how well an algorithm is doing at classifying data. It is really just the idea of false positives and false negatives.

• Accuracy rate: (TP + TN)/ Total where Total = TP + TN + FP + FN.  This measures the fraction of times the classifier is correct 

• Error rate: 1 − (TP + TN) / Total . 

• True positive rate or Recall: TP/ (TP + FN)

• False positive rate: FP/ (TN + FP)

 Cost Functions: In machine learning a cost function or loss function is used to represent how far away a mathematical model is from the real data

One adjusts the mathematical model, perhaps by varying parameters within the model, so as to minimize the cost function. This is then interpreted as giving the best model, of its type, that f its the data

Call this linear function , to emphasize the dependence on both the variable x and the two parameters, 

We want to measure how far away the data, the s, are from the function 


Regularization: variety of approaches to reduce overfitting, including adding the weights to the loss function, randomly dropping layers (dropout).

One simple way to achieve  Regularization is early stopping 

Sometimes one adds a regularization term to the cost function.

Gradient Descent 


Bias vs. Variance

Bias is how far away the trained model is from the correct result on average. Where on average" means over many goes at training the model, using different data. And variance is a measure of the magnitude of that error.

In supervised learning, the prediction error e is composed of the bias, the variance and the irreducible part. Bias refers to simplifying assumptions made to learn the target function easily. Variance refers to sensitivity of the model to changes in the training data

The goal of parameterization is to achieve a low bias (underlying pattern not too simplified) and low variance (not sensitive to specificities of the training data) tradeoff

Hyperparameters: the ”knobs” that you tweak during successive runs of training a model

Entropy

Entropy is a measure of uncertainty, but uncertainty linked to  information content rather than the uncertainty associated with betting on the outcome of the coin toss

Activation Functions: mathematical functions that introduce non-linearity to a network e.g. RELU, tanh

The maximum likelihood estimate (MLE) is a way to estimate the value of a parameter of interest.
The MLE is the value of p that maximizes the likelihood

In maximum likelihood estimation (MLE), you are typically given a probability density function (PDF) or a probability mass function (PMF), depending on whether you are dealing with continuous or discrete data. The goal of MLE is to find the parameters of this distribution that maximize the likelihood of the observed data.