-
EDA
-
Setup the target: Objective function
-
Preprocessing and feature engineering
-
Setup data split: Train/val/test or train/train-dev/dev/test
-
Simple model:
-
If starting on a new problem: Start with a simple model.
-
If well studied problem, you might want to start from an existing architecture.
-
Transfer learning from Task A to B, if: Both task A and B have same input. Could learn low level features from task A. Lot more data for task A compared to task B. Task B data can be in the order of 100 samples or 1 hr data
-
-
Use Bias/Variance/Error analysis and iterate the model /hyperparameters and evaluation metrics
-
Offline statistical tests
-
Online A/B tests => Business metrics != Model metrics
-
Productionize: serving architecture. If online learning, data input pipeline and training and versioning on model.
def random_prediction():
return train_y[random]
def zero-rule_prediction():
return max_class or
return mean or
return moving_avg
J(parameters) which we minimize. Lowest point of the teta versus J(teta) curve below
See more here at Keras documentation
A scalar metric that evaluates the "goodness"/ "fit" of the model on training data and is evaluated on validation data.
-
Used for the learning process
-
Evaluated on validation data to pick the best model and hyper parameters
-
MAE: Normalized L1 norm error(LAE)
-
MSE: Normalized L2 norm error(LSE)
-
RMSE: Penalizes large errors more than small errors. Root makes units of measurement back to original - easy interoretation.
-
R2: Coefficient of determination, also explains variation of the residuals
- regression variation/total variation, where total variation = residual variation + regression variation
-
Notes:
-
- L1: Robust to outliers?Sparse?Multiple solutions?
-
-
Hinge loss:
-
Absolute Number of errors
-
No probabilties
-
Might lead to better accuracy
-
-
Notes: Hinge vs log loss
- Log loss gives probabilities, so it optimizes for it. So, if acuracy is more important and probabilities are not, hinge loss might perform better.
-
Contours will be very skewed if not scaled
-
Gradient is slower for longer ranges - Takes long time to train
Normalization
Types of data:
-
Continous
-
Discrete
-
Categorical
-
Ordinal
-
Binary
-
Unstructured
-
Anything else?
-
datetime
Questions:
- Do all models accept all types of data? What kind of preprocessing is required?
-
Training set: To fit the model and understand expressivity of the model. Compare with human level performance.
-
Validation set: To pick best model among
-
Various model.
-
Hyper parameter tuning
-
And generalization test on out of training data.
-
-
Test set - Unbiased estimate of the model: Test set should be treated as black box, cannot tune anything based on test set results
-
Split based on dataset sizes:
-
100/1K/10K/100K => 60/20/20
-
100K/1000K => 99/1/1
-
Test set big enough to give high performance of the system. TODO: Statistical significance?
- If confidence is not required then train/valid split might be enough, but not recommended
-
-
Distributions => valid and test should come from same distribution, otherwise you are shooting wrong targets
-
If your test data from real use case is small, and you have other bigger dataset to train on, you might want to have different train and valid/test distributions
-
For example: 200K web cats, 10K mobile cats
-
Bad split: random shuffle: 205K / 2.5K/2.5K
-
Better split(diff distributions): 205K(200+5)/2.5(mobile)/2.5(mobile)
-
-
Have 4 split: Train, train-dev, dev/valid, test. Where train and train-dev come from the same distribution and dev and test come from same distribution
-
-
Hold out
-
K-fold cross-validation (better for small datasets, also more robust estimates)
-
Bootstrapping
-
Derivative is equivalent to slope of tangent at J(teta). It it is positive, move left, else move right
-
Slope is zero when at minimum, hence teta remains unchanged.
-
Takes smaller steps as slope gets closer to zero
-
Finds local minima
Variants
-
Batch
-
Stochastic: Full data in each epoch
-
Randomize the data in each epoch
-
Calculate gradient for each sample
-
-
Mini batch:
- Faster than stochastic, slower than regular
Gradient descent alternatives:
-
Conjugate descent
-
BFGS
-
L-BFGS
TODO: How to check for convergence? When to stop training?
-
If loss is not strictly reducing, decrease learning rate
-
If loss is decreasing too slowly, increase
-
Bias is more: Accuracy on training set is much worser than human level performance/ bayes error
-
NN: Bigger/better architecture, better optimization, train longer
-
Traditional ML: Additional features, polynomial features, decrease regularizer parameter gamma
-
-
Variance is more: Accuracy on validation is much worser than on training set
-
Regularization
-
More training data, data augmentation
-
Better architecture/hyper parameter search
-
Traditional ML: Smaller set of features, increrase regularizer parameter gamma
-
-
Error analysis on validation
-
Dev set: If % erros due to incorrect labels is significant which makes it hard to chose the right model, then might be better to fix the incorrect labels
- Apply same process to dev and test sets to make sure they come from same distribution
-
Training set: Mislabels in training are not an issue in NN as long we have big enough training set and the mislabels are not systematic
-
Any specific classes getting wrong? Might have to improve training data
-
-
Test performance is bad / Overfitting to dev set
- Bigger dev set
-
In the case of 4 way split:
-
dev accuracy is bad: Data mismatch
- Manual analysis of train and valid to see difference and make data more similar(synthesize data), add noise etc. But be careful that you might be simulating only a subset.
-
-
Performance on real use cases: A/B testing
-
Change validation set or change cost function or evaluation metric if A/B tests shows unexpected results
- Add class weight term
-
Objective function without regularization can be used as evaluation metric. But, we might be required to use other metrics as "fit" alone might not be the best indicator of "goodness". Some possible reasons:
-
Error and accuracy give us information on how well we are predicting. But high accuracy does not necessarily mean our model peforms well for following reasons:
-
Class imbalance: It favors majority class and hence accuracy can be high for imbalanced dataset
-
It treates importance of predicting all classes equally which might not be true in real world.
-
-
It treates predicting and not predicting equally. Some types of errors might be more important
-
Interpretability
-
Calibration
-
Scalability
-
Inference time
-
Memory footprint
-
Simple to explain
Note:
-
Have single optimizing metric and possibly a few satisfying metrics
-
Single number evaluation metric like F1 score is useful
-
RCE (Relative cross entropy): Test on baseline model(For example: 5% of random predict 1), and compare model accuracy with the baseline.
-
Analyzing types of errors:
-
Confusion matrix: TP, TN, FP, FN
- Possible for multi class classification
-
Predictions | |||||
PP | NP | ||||
Observations | PO | TP | FN | FNR(Type2) =FN/PO | TPR (Sensitivity/Recall) =TP/PO |
NO | FP | TN | FPR(Type1) =FP/NO | TNR (Specificity) =TN/NO | |
Precision =TP/PP |
2. Precision Recall curve: Recall vs Precision
3. ROC curve: Curve represents FPR vs TPR for various thresholds of predicted probability. One curve for each class weight
2. AUC: Area under ROC curve, single metric representative of "goodness"
4. F1 score: Score lies between 0 and 1. 1 = best
5. Other:
3. Gain and lift chart
4. K-S chart
5. MCC
6. Gini cofficient
7. Concordant - Discordant ratio
-
Notes:
-
PR curve better for imbalanced data
- TODO: Discuss and derive why
-
Use class weights for loss function
-
Set Classification threshold
-
Which of these metrics are not optimizable in training? AUC?
-
-
Early stopping is a not orthogonal knob. Can afffect fit as well as regularization
-
Class weights can also be learned?
TODO: How to do it for linear regression? Predictions on bootstrapped data? To predict how it performs if we collect data again?
F(P+N)/All predictions
- What percent of overall predictions are wrong predictions?
T(P+N)/All predictions
- What percent of overall predictions are correct predictions?
-
What percent of negative values have been predicted as positive?
-
Alpha** (**Type 1 error?)
-
FP/Negative Observations (or) FP/(TN + FP)
-
What percent of positive samples have been predicted as negative?
-
Beta (Type 2 error?)
-
FN/Positive Observations (or) FN/(TP+FN)
-
What percent of negative samples have been correctly labeled negative?
-
1-alpha => 1-FP/(TN+FP) = TN/Negative Observations
-
What percent of positive samples have been correctly labeled positive?
-
1- beta = 1- FN/(TP+FN) = TP/Positive Observations
-
Use case: Expensive to miss positive cases
-
Confidence on True prediction
-
What percent of positive predictions are actually correct?
-
**TP/Positive Predictions **(or) TP/(TP+FP)
-
Use case: Expensive to diagnose incorrectly.
-
Harmonic mean of precision and recall
-
Equally weights precision and recall
ROC (Receiver Operating Characteristic):
True positive rate (Recall or Sensitivity) versus False positive rate(Type 1 error)
- Popular in Information Retrieval domain. Used in ordered results. It is the area under the precision recall curve
- This integral is in practice replaced with a finite sum over every position in the ranked sequence of documents.
-
where k is the rank in the sequence of retrieved documents, n is the number of retrieved documents, P(k) is the precision at cut-off k in the list, rel(k)is an indicator function equaling 1 if the item at rank k is a relevant document, zero otherwise.
-
Some authors (PASCAL VOC) choose to interpolate the p(r) function to reduce the impact of "wiggles" in the curve
- takes the maximum precision over all recalls greater than r
- Addition/Deletion/Substitution
-
Perplexity: Lower the better
-
Translation:
-
Bleu: precision based
-
Meteor: Leverages stems
-
-
MOT( Multi Object Tracking)
def beta(x,y) : Captures how sensitive is y with changes to x
def corr_cofficient(x,y) #R: Captures how correlated y is with respect to x, does not capture sensitivity
def corr_determ(x,y) # R^2
http://www.ritchieng.com/machine-learning-evaluate-classification-model/