ML General Issues
Some Jargon
Feature (vector): A feature is a characteristic of the data. For those from stats background think of it as an explanatory variable.
Feature Engineering: transforming input features to be more useful for the models. e.g. mapping categories to buckets, normalizing between -1 and 1, removing null
Classification/Regression: regression is prediction a number (e.g. housing price), classification is prediction from a set of categories(e.g. predicting red/blue/green)
Classification: We will often be dividing our data into different classes or categories. In supervised learning this is done a priori. In unsupervised learning it happens as part of the learning process.
Bias/Variance: how much output is determined by the features. more variance often can mean overfitting, more bias can mean a bad model
A/B testing: statistical way of comparing 2+ techniques to determine which technique performs better and also if difference is statistically significant
Measuring Distances: We will want to measure the distances between them. The shorter the distance between two vectors the closer in character are the two samples they represent. There are many ways to measure distance, and some of them are useful in machine learning.
Euclidean Distance: This is the classic measurement, using Pythagoras, just square the differences between vector entries, sum and square root. This would be the default distance measure, ‘As the crow flies.’ It’s the L 2 norm.
Manhattan Distance: The Manhattan distance is the sum of the absolute values of the differences between entries in the vectors. The name derives from the distance one must travel along the roads in a city laid out in a grid pattern. This measure of distance can be preferable when dealing with data in high dimensions. This is the L 1 norm.
Chebyshev Distance: Take the absolute differences between all the entries in the two vectors and the maximum of these numbers is the Chebyshev distance. This can be the preferred distance measure when you have many dimensions and most of them just aren’t important in classification. This is the L ∞ norm.
Confusion Matrix: A confusion matrix is a simple way of understanding how well an algorithm is doing at classifying data. It is really just the idea of false positives and false negatives.
• Accuracy rate: (TP + TN)/ Total where Total = TP + TN + FP + FN. This measures the fraction of times the classifier is correct
• Error rate: 1 − (TP + TN) / Total .
• True positive rate or Recall: TP/ (TP + FN)
• False positive rate: FP/ (TN + FP)
Cost Functions: In machine learning a cost function or loss function is used to represent how far away a mathematical model is from the real data
One adjusts the mathematical model, perhaps by varying parameters within the model, so as to minimize the cost function. This is then interpreted as giving the best model, of its type, that f its the data
Call this linear function , to emphasize the dependence on both the variable x and the two parameters,
We want to measure how far away the data, the s, are from the function
Regularization: variety of approaches to reduce overfitting, including adding the weights to the loss function, randomly dropping layers (dropout).
One simple way to achieve Regularization is early stopping
Sometimes one adds a regularization term to the cost function.
Gradient Descent
Bias vs. Variance
Bias is how far away the trained model is from the correct result on average. Where on average" means over many goes at training the model, using different data. And variance is a measure of the magnitude of that error.
In supervised learning, the prediction error e is composed of the bias, the variance and the irreducible part. Bias refers to simplifying assumptions made to learn the target function easily. Variance refers to sensitivity of the model to changes in the training data
The goal of parameterization is to achieve a low bias (underlying pattern not too simplified) and low variance (not sensitive to specificities of the training data) tradeoff
Hyperparameters: the ”knobs” that you tweak during successive runs of training a model
Entropy
Entropy is a measure of uncertainty, but uncertainty linked to information content rather than the uncertainty associated with betting on the outcome of the coin toss