-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming mllib [SPARK-2438][MLLIB] #1361
Conversation
- Abstract class to support a variety of streaming regression analyses - Example concrete class for streaming linear regression - Example usage: continually train on one data stream and test on another
Can one of the admins verify this patch? |
@freeman-lab This is great! Could you create a JIRA and add |
Jenkins, add to whitelist. |
Jenkins, test this please. |
Merged build triggered. |
Merged build started. |
Awesome, time to have some fun :D |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16516/ |
@mengxr great! Just created a JIRA (https://issues.apache.org/jira/browse/SPARK-2438) and added to the title. |
@freeman-lab Could you add some unit tests? There should be some examples under streaming and mllib. |
- Test parameter estimate accuracy after several updates - Test parameter accuracy improvement after each batch
Jenkins, add to whitelist. |
Jenkins, test this please. |
QA tests have started for PR 1361. This patch merges cleanly. |
QA results for PR 1361: |
Looks like the basic test for correct final params passes, but not the stricter test for improvement on every update. Both pass locally. My guess is that it's running a bit slower on Jenkins, so the updates don't complete fast enough (I can create a failure locally by making the test data rate too high). I'll play with this, might work to just slow down the data rate. |
- Slower simulated data rates and updates - Softens requirement for strict error reduction, but still ensures error stability, and error reduction on at least a subset of updates
@mengxr mind retesting? I tried to make the convergence test more robust in a couple ways. If we still have issues we might need to rethink that test further. Thanks! |
QA tests have started for PR 1361. This patch merges cleanly. |
QA results for PR 1361: |
- Also deleted companion object - Renamed file for consistency - Explained usage in documentation
@mengxr done! removed the static methods (and made the class public), and added those usage notes to |
// Sample a subset (fraction miniBatchFraction) of the total data | ||
// compute and sum up the subgradients on this subset (this is one map-reduce) | ||
val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i) | ||
.aggregate((BDV.zeros[Double](weights.size), 0.0))( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aggregate
-> .treeAggregate
. We use a tree pattern to avoid sending too much data to the driver. Does it hurt streaming update performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's totally fine, I might have lost it in the merge, put it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for broadcasting, sorry, fixing...
QA tests have started for PR 1361. This patch merges cleanly. |
QA tests have started for PR 1361. This patch merges cleanly. |
@@ -174,17 +182,18 @@ object GradientDescent extends Logging { | |||
weights, Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2 | |||
|
|||
for (i <- 1 to numIterations) { | |||
val bcWeights = data.context.broadcast(weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadcasting the weights is actually important for performance. Did you experience any problem with it? It may be an orthogonal issue. Maybe we should keep this code block unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my mistake, should be fixed now.
QA results for PR 1361: |
QA results for PR 1361: |
QA tests have started for PR 1361. This patch merges cleanly. |
QA results for PR 1361: |
LGTM. Merged into master. Thanks a lot for putting Streaming and MLlib together! |
Yay!!! On Fri, Aug 1, 2014 at 8:12 PM, Xiangrui Meng notifications@github.com
|
This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with tdas and mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries. __Summary of additions:__ _StreamingLinearAlgorithm_ - An abstract class for fitting generalized linear models online to streaming data, including training on (and updating) a model, and making predictions. _StreamingLinearRegressionWithSGD_ - Class and companion object for running streaming linear regression _StreamingLinearRegressionTestSuite_ - Unit tests _StreamingLinearRegression_ - Example use case: fitting a model online to data from one stream, and making predictions on other data __Notes__ - If this looks good, I can use the StreamingLinearAlgorithm class to easily implement other analyses that follow the same logic (Ridge, Lasso, Logistic, SVM). Author: Jeremy Freeman <the.freeman.lab@gmail.com> Author: freeman <the.freeman.lab@gmail.com> Closes apache#1361 from freeman-lab/streaming-mllib and squashes the following commits: 775ea29 [Jeremy Freeman] Throw error if user doesn't initialize weights 4086fee [Jeremy Freeman] Fixed current weight formatting 8b95b27 [Jeremy Freeman] Restored broadcasting 29f27ec [Jeremy Freeman] Formatting 8711c41 [Jeremy Freeman] Used return to avoid indentation 777b596 [Jeremy Freeman] Restored treeAggregate 74cf440 [Jeremy Freeman] Removed static methods d28cf9a [Jeremy Freeman] Added usage notes c3326e7 [Jeremy Freeman] Improved documentation 9541a41 [Jeremy Freeman] Merge remote-tracking branch 'upstream/master' into streaming-mllib 66eba5e [Jeremy Freeman] Fixed line lengths 2fe0720 [Jeremy Freeman] Minor cleanup 7d51378 [Jeremy Freeman] Moved streaming loader to MLUtils b9b69f6 [Jeremy Freeman] Added setter methods c3f8b5a [Jeremy Freeman] Modified logging 00aafdc [Jeremy Freeman] Add modifiers 14b801e [Jeremy Freeman] Name changes c7d38a3 [Jeremy Freeman] Move check for empty data to GradientDescent 4b0a5d3 [Jeremy Freeman] Cleaned up tests 74188d6 [Jeremy Freeman] Eliminate dependency on commons 50dd237 [Jeremy Freeman] Removed experimental tag 6bfe1e6 [Jeremy Freeman] Fixed imports a2a63ad [freeman] Makes convergence test more robust 86220bc [freeman] Streaming linear regression unit tests fb4683a [freeman] Minor changes for scalastyle consistency fd31e03 [freeman] Changed logging behavior 453974e [freeman] Fixed indentation c4b1143 [freeman] Streaming linear regression 604f4d7 [freeman] Expanded private class to include mllib d99aa85 [freeman] Helper methods for streaming MLlib apps 0898add [freeman] Added dependency on streaming
This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with @tdas and @mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries.
Summary of additions:
StreamingLinearAlgorithm
StreamingLinearRegressionWithSGD
StreamingLinearRegressionTestSuite
StreamingLinearRegression
Notes