-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Truncated gradient descent example
Truncated Gradient Descent Example for VW
VW has an efficient (approximate) implementation of the truncated gradient algorithm for online L1 regularization. This paper provides an example using the rcv1 data set to illustrate the use of it. The (exact) online L2 regularization in VW can be done similarly, with the --l1 option below replaced by --l2.
We use the same training and test data prepared as in the RCV1 example; the cache files are cache_train and cache_test. The test label file will be needed for classifier evaluation, and is obtained by
zcat rcv1.test.dat.gz | cut -d ' ' -f 1 | sed -e 's/^-1/0/' > test_labels
The following three steps run (1) training, (2) testing, (3) evaluation of ROC, and (4) measuring model size, respectively:
vw --cache_file cache_train --final_regressor r_temp --passes 3 --readable_model r_temp.txt --l1 lambda1 vw --testonly --initial_regressor r_temp --cache_file cache_test --predictions p_out perf -ROC -files test_labels p_out cat r_temp.txt | grep -c ^[0-9]
where
- lambda1 is the regularization level applied to online learning
- r_temp.txt is the human-readable model file for us to count the number of nonzero weights in the learned regressor
By varying lambda1, we see the role of L1 regularization on prediction performance (ROC in particular) and model size:
lambda1 | ROC | Model Size |
0 | 0.98346 | 41409 |
5e-8 | 0.98345 | 39985 |
1e-7 | 0.98345 | 38822 |
5e-7 | 0.98345 | 31899 |
1e-6 | 0.98345 | 26559 |
5e-6 | 0.98319 | 12564 |
1e-5 | 0.98288 | 7647 |
5e-5 | 0.98068 | 1860 |
1e-4 | 0.97804 | 921 |
1e-3 | 0.92469 | 53 |
Note that L1 and L2 can be used simultaneously in VW, which resembles the elastic net. To see the role of L2-regularization better, the training data is first subsampled at 1% rate, yielding a set of roughly 7.8K examples. Let cache_train_small be the training cache file. The previous commands are modified slightly by adding the --l2 option as follows:
vw --cache_file cache_train_small --final_regressor r_temp --passes 1000 --readable_model r_temp.txt --l1 lambda1 --l2 lambda2 vw --testonly --initial_regressor r_temp --cache_file cache_test --predictions p_out perf -ROC -files test_labels p_out cat r_temp.txt | grep -c ^[0-9]
Note that we set the number of passes to 1000 so that we can see the phenomenon of overfitting.
The table below reports the ROC metric and model size by varying lambda1 and lambda2:
lambda1 | lambda2 | ROC | Model Size |
0 | 0 | 0.96863 | 20832 |
0 | 0.0005 | 0.97364 | 21490 |
1e-7 | 0.0005 | 0.97364 | 21470 |
1e-6 | 0.0005 | 0.97363 | 21149 |
1e-5 | 0.0005 | 0.97348 | 14857 |
5e-5 | 0.0005 | 0.97231 | 4185 |
1e-4 | 0.0005 | 0.97003 | 2020 |
- Home
- First Steps
- Input
- Command line arguments
- Model saving and loading
- Controlling VW's output
- Audit
- Algorithm details
- Awesome Vowpal Wabbit
- Learning algorithm
- Learning to Search subsystem
- Loss functions
- What is a learner?
- Docker image
- Model merging
- Evaluation of exploration algorithms
- Reductions
- Contextual Bandit algorithms
- Contextual Bandit Exploration with SquareCB
- Contextual Bandit Zeroth Order Optimization
- Conditional Contextual Bandit
- Slates
- CATS, CATS-pdf for Continuous Actions
- Automl
- Epsilon Decay
- Warm starting contextual bandits
- Efficient Second Order Online Learning
- Latent Dirichlet Allocation
- VW Reductions Workflows
- Interaction Grounded Learning
- CB with Large Action Spaces
- CB with Graph Feedback
- FreeGrad
- Marginal
- Active Learning
- Eigen Memory Trees (EMT)
- Element-wise interaction
- Bindings
-
Examples
- Logged Contextual Bandit example
- One Against All (oaa) multi class example
- Weighted All Pairs (wap) multi class example
- Cost Sensitive One Against All (csoaa) multi class example
- Multiclass classification
- Error Correcting Tournament (ect) multi class example
- Malicious URL example
- Daemon example
- Matrix factorization example
- Rcv1 example
- Truncated gradient descent example
- Scripts
- Implement your own joint prediction model
- Predicting probabilities
- murmur2 vs murmur3
- Weight vector
- Matching Label and Prediction Types Between Reductions
- Zhen's Presentation Slides on enhancements to vw
- EZExample Archive
- Design Documents
- Contribute: