Streaming mllib [SPARK-2438][MLLIB] #1361

freeman-lab · 2014-07-10T19:53:46Z

This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with @tdas and @mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries.

Summary of additions:

StreamingLinearAlgorithm

An abstract class for fitting generalized linear models online to streaming data, including training on (and updating) a model, and making predictions.

StreamingLinearRegressionWithSGD

Class and companion object for running streaming linear regression

StreamingLinearRegressionTestSuite

Unit tests

StreamingLinearRegression

Example use case: fitting a model online to data from one stream, and making predictions on other data

Notes

If this looks good, I can use the StreamingLinearAlgorithm class to easily implement other analyses that follow the same logic (Ridge, Lasso, Logistic, SVM).

- Abstract class to support a variety of streaming regression analyses - Example concrete class for streaming linear regression - Example usage: continually train on one data stream and test on another

AmplabJenkins · 2014-07-10T19:56:18Z

Can one of the admins verify this patch?

mengxr · 2014-07-10T20:38:31Z

@freeman-lab This is great! Could you create a JIRA and add [SPARK-####][MLLIB] to the title of this PR? Thanks!

mengxr · 2014-07-10T20:38:43Z

Jenkins, add to whitelist.

mengxr · 2014-07-10T20:38:54Z

Jenkins, test this please.

AmplabJenkins · 2014-07-10T20:41:17Z

Merged build triggered.

AmplabJenkins · 2014-07-10T20:41:23Z

Merged build started.

tdas · 2014-07-10T20:41:32Z

Awesome, time to have some fun :D
Roping in @pwendell

AmplabJenkins · 2014-07-10T20:42:34Z

Merged build finished.

AmplabJenkins · 2014-07-10T20:42:34Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16516/

freeman-lab · 2014-07-10T21:06:57Z

@mengxr great! Just created a JIRA (https://issues.apache.org/jira/browse/SPARK-2438) and added to the title.

mengxr · 2014-07-12T06:31:25Z

@freeman-lab Could you add some unit tests? There should be some examples under streaming and mllib.

- Test parameter estimate accuracy after several updates - Test parameter accuracy improvement after each batch

freeman-lab · 2014-07-14T18:12:50Z

@mengxr I added two tests, they check that parameter estimates are accurate, and improve over time. The tests use temporary file writing / file streams, which is clunky, but @tdas will help add dependencies on the streaming test suite so we can use its utilities instead.

mengxr · 2014-07-17T08:25:07Z

Jenkins, add to whitelist.

mengxr · 2014-07-17T08:25:13Z

Jenkins, test this please.

SparkQA · 2014-07-17T08:28:00Z

QA tests have started for PR 1361. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16774/consoleFull

SparkQA · 2014-07-17T10:03:15Z

QA results for PR 1361:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16774/consoleFull

freeman-lab · 2014-07-17T18:20:28Z

Looks like the basic test for correct final params passes, but not the stricter test for improvement on every update. Both pass locally. My guess is that it's running a bit slower on Jenkins, so the updates don't complete fast enough (I can create a failure locally by making the test data rate too high). I'll play with this, might work to just slow down the data rate.

- Slower simulated data rates and updates - Softens requirement for strict error reduction, but still ensures error stability, and error reduction on at least a subset of updates

freeman-lab · 2014-07-18T00:52:37Z

@mengxr mind retesting? I tried to make the convergence test more robust in a couple ways. If we still have issues we might need to rethink that test further. Thanks!

SparkQA · 2014-07-18T00:58:00Z

QA tests have started for PR 1361. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16799/consoleFull

SparkQA · 2014-07-18T02:35:48Z

QA results for PR 1361:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16799/consoleFull

- Also deleted companion object - Renamed file for consistency - Explained usage in documentation

freeman-lab · 2014-08-01T22:46:13Z

@mengxr done! removed the static methods (and made the class public), and added those usage notes to StreamingLinearAlgorithm

mengxr · 2014-08-01T22:56:30Z

mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala

+        // Sample a subset (fraction miniBatchFraction) of the total data
+        // compute and sum up the subgradients on this subset (this is one map-reduce)
+        val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i)
+          .aggregate((BDV.zeros[Double](weights.size), 0.0))(


aggregate -> .treeAggregate. We use a tree pattern to avoid sending too much data to the driver. Does it hurt streaming update performance?

It's totally fine, I might have lost it in the merge, put it back.

Same for broadcasting, sorry, fixing...

SparkQA · 2014-08-01T23:39:18Z

QA tests have started for PR 1361. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17716/consoleFull

SparkQA · 2014-08-01T23:54:07Z

QA tests have started for PR 1361. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17718/consoleFull

mengxr · 2014-08-02T00:00:42Z

mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala

@@ -174,17 +182,18 @@ object GradientDescent extends Logging {
      weights, Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2

    for (i <- 1 to numIterations) {
-      val bcWeights = data.context.broadcast(weights)


Broadcasting the weights is actually important for performance. Did you experience any problem with it? It may be an orthogonal issue. Maybe we should keep this code block unchanged.

This was my mistake, should be fixed now.

SparkQA · 2014-08-02T00:37:50Z

QA results for PR 1361:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class StreamingLinearRegressionWithSGD (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17716/consoleFull

SparkQA · 2014-08-02T00:47:45Z

QA results for PR 1361:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class StreamingLinearRegressionWithSGD (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17718/consoleFull

SparkQA · 2014-08-02T01:54:28Z

QA tests have started for PR 1361. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17732/consoleFull

SparkQA · 2014-08-02T02:59:59Z

QA results for PR 1361:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class StreamingLinearRegressionWithSGD (

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17732/consoleFull

mengxr · 2014-08-02T03:12:16Z

LGTM. Merged into master. Thanks a lot for putting Streaming and MLlib together!

tdas · 2014-08-02T03:13:26Z

Yay!!!

On Fri, Aug 1, 2014 at 8:12 PM, Xiangrui Meng notifications@github.com
wrote:

LGTM. Merged into master. Thanks a lot for putting Streaming and MLlib
together!

—
Reply to this email directly or view it on GitHub
#1361 (comment).

This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with tdas and mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries. __Summary of additions:__ _StreamingLinearAlgorithm_ - An abstract class for fitting generalized linear models online to streaming data, including training on (and updating) a model, and making predictions. _StreamingLinearRegressionWithSGD_ - Class and companion object for running streaming linear regression _StreamingLinearRegressionTestSuite_ - Unit tests _StreamingLinearRegression_ - Example use case: fitting a model online to data from one stream, and making predictions on other data __Notes__ - If this looks good, I can use the StreamingLinearAlgorithm class to easily implement other analyses that follow the same logic (Ridge, Lasso, Logistic, SVM). Author: Jeremy Freeman <the.freeman.lab@gmail.com> Author: freeman <the.freeman.lab@gmail.com> Closes apache#1361 from freeman-lab/streaming-mllib and squashes the following commits: 775ea29 [Jeremy Freeman] Throw error if user doesn't initialize weights 4086fee [Jeremy Freeman] Fixed current weight formatting 8b95b27 [Jeremy Freeman] Restored broadcasting 29f27ec [Jeremy Freeman] Formatting 8711c41 [Jeremy Freeman] Used return to avoid indentation 777b596 [Jeremy Freeman] Restored treeAggregate 74cf440 [Jeremy Freeman] Removed static methods d28cf9a [Jeremy Freeman] Added usage notes c3326e7 [Jeremy Freeman] Improved documentation 9541a41 [Jeremy Freeman] Merge remote-tracking branch 'upstream/master' into streaming-mllib 66eba5e [Jeremy Freeman] Fixed line lengths 2fe0720 [Jeremy Freeman] Minor cleanup 7d51378 [Jeremy Freeman] Moved streaming loader to MLUtils b9b69f6 [Jeremy Freeman] Added setter methods c3f8b5a [Jeremy Freeman] Modified logging 00aafdc [Jeremy Freeman] Add modifiers 14b801e [Jeremy Freeman] Name changes c7d38a3 [Jeremy Freeman] Move check for empty data to GradientDescent 4b0a5d3 [Jeremy Freeman] Cleaned up tests 74188d6 [Jeremy Freeman] Eliminate dependency on commons 50dd237 [Jeremy Freeman] Removed experimental tag 6bfe1e6 [Jeremy Freeman] Fixed imports a2a63ad [freeman] Makes convergence test more robust 86220bc [freeman] Streaming linear regression unit tests fb4683a [freeman] Minor changes for scalastyle consistency fd31e03 [freeman] Changed logging behavior 453974e [freeman] Fixed indentation c4b1143 [freeman] Streaming linear regression 604f4d7 [freeman] Expanded private class to include mllib d99aa85 [freeman] Helper methods for streaming MLlib apps 0898add [freeman] Added dependency on streaming

freeman-lab added 6 commits July 10, 2014 10:39

Added dependency on streaming

0898add

Helper methods for streaming MLlib apps

d99aa85

Expanded private class to include mllib

604f4d7

Streaming linear regression

c4b1143

- Abstract class to support a variety of streaming regression analyses - Example concrete class for streaming linear regression - Example usage: continually train on one data stream and test on another

Fixed indentation

453974e

Changed logging behavior

fd31e03

freeman-lab changed the title ~~Streaming mllib~~ Streaming mllib [SPARK-2438][MLLIB] Jul 10, 2014

freeman-lab added 2 commits July 14, 2014 12:44

Minor changes for scalastyle consistency

fb4683a

Streaming linear regression unit tests

86220bc

- Test parameter estimate accuracy after several updates - Test parameter accuracy improvement after each batch

Makes convergence test more robust

a2a63ad

- Slower simulated data rates and updates - Softens requirement for strict error reduction, but still ensures error stability, and error reduction on at least a subset of updates

Removed static methods

74cf440

- Also deleted companion object - Renamed file for consistency - Explained usage in documentation

mengxr reviewed Aug 1, 2014
View reviewed changes

freeman-lab added 3 commits August 1, 2014 19:15

Restored treeAggregate

777b596

Used return to avoid indentation

8711c41

Formatting

29f27ec

Restored broadcasting

8b95b27

mengxr reviewed Aug 2, 2014
View reviewed changes

freeman-lab added 2 commits August 1, 2014 21:32

Fixed current weight formatting

4086fee

Throw error if user doesn't initialize weights

775ea29

asfgit closed this in f6a1899 Aug 2, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming mllib [SPARK-2438][MLLIB] #1361

Streaming mllib [SPARK-2438][MLLIB] #1361

freeman-lab commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

mengxr commented Jul 10, 2014

mengxr commented Jul 10, 2014

mengxr commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

tdas commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

freeman-lab commented Jul 10, 2014

mengxr commented Jul 12, 2014

freeman-lab commented Jul 14, 2014

mengxr commented Jul 17, 2014

mengxr commented Jul 17, 2014

SparkQA commented Jul 17, 2014

SparkQA commented Jul 17, 2014

freeman-lab commented Jul 17, 2014

freeman-lab commented Jul 18, 2014

SparkQA commented Jul 18, 2014

SparkQA commented Jul 18, 2014

freeman-lab commented Aug 1, 2014

mengxr Aug 1, 2014

freeman-lab Aug 1, 2014

freeman-lab Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

mengxr Aug 2, 2014

freeman-lab Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

mengxr commented Aug 2, 2014

tdas commented Aug 2, 2014

Streaming mllib [SPARK-2438][MLLIB] #1361

Streaming mllib [SPARK-2438][MLLIB] #1361

Conversation

freeman-lab commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

mengxr commented Jul 10, 2014

mengxr commented Jul 10, 2014

mengxr commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

tdas commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

freeman-lab commented Jul 10, 2014

mengxr commented Jul 12, 2014

freeman-lab commented Jul 14, 2014

mengxr commented Jul 17, 2014

mengxr commented Jul 17, 2014

SparkQA commented Jul 17, 2014

SparkQA commented Jul 17, 2014

freeman-lab commented Jul 17, 2014

freeman-lab commented Jul 18, 2014

SparkQA commented Jul 18, 2014

SparkQA commented Jul 18, 2014

freeman-lab commented Aug 1, 2014

mengxr Aug 1, 2014

Choose a reason for hiding this comment

freeman-lab Aug 1, 2014

Choose a reason for hiding this comment

freeman-lab Aug 1, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

mengxr Aug 2, 2014

Choose a reason for hiding this comment

freeman-lab Aug 2, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

mengxr commented Aug 2, 2014

tdas commented Aug 2, 2014