Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming KMeans [MLLIB][SPARK-3254] #2942

Closed
wants to merge 24 commits into from

Conversation

freeman-lab
Copy link
Contributor

This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.

The PR includes:

  • StreamingKMeans algorithm with decay factor settings
  • Usage example
  • Additions to documentation clustering page
  • Unit tests of basic behavior and decay behaviors

@tdas @mengxr @rezazadeh

- Used trainOn and predictOn pattern, similar to
StreamingLinearAlgorithm
- Decay factor can be set explicitly, or via fractional decay
parameters expressed in units of number of batches, or number of points
- Unit tests for basic functionality and decay settings
@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22209 has started for PR 2942 at commit 2086bdc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22209 has finished for PR 2942 at commit 2086bdc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingKMeansModel(
    • class StreamingKMeans(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22209/
Test PASSed.

@AtlasPilotPuppy
Copy link
Contributor

Should we create another PR for the python bindings/example?


@DeveloperApi
class StreamingKMeans(
var k: Int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent of 4 spaces?

@mengxr
Copy link
Contributor

mengxr commented Oct 28, 2014

@anantasty This PR is still in review. If you are interested in Python binding of streaming algorithms. Could you help add one for StreamingLinearRegression? Thanks!

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala

@AtlasPilotPuppy
Copy link
Contributor

I would certainly be interested in doing that. I just wasn't sure if it was
better to do it as a separate PR/ task.
On Oct 28, 2014 11:19 AM, "Xiangrui Meng" notifications@github.com wrote:

@anantasty https://github.com/anantasty This PR is still in review. If
you are interested in Python binding of streaming algorithms. Could you
help add one for StreamingLinearRegression? Thanks!

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala


Reply to this email directly or view it on GitHub
#2942 (comment).

@mengxr
Copy link
Contributor

mengxr commented Oct 28, 2014

It should be in a separate JIRA (and hence a separate PR). Please create a JIRA for StreamingLinearRegression and ping me there. Thanks!

@freeman-lab
Copy link
Contributor Author

@anantasty Agreed, should be separate, but would be very cool to have! Ping me as well, happy to provide feedback.


## Streaming clustering

When data arrive in a stream, we may want to estimate clusters dynamically, updating them as new data arrive. MLlib provides support for streaming KMeans clustering, with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm uses a generalization of the mini-batch KMeans update rule. For each batch of data, we assign all points to their nearest cluster, compute new cluster centers, then update each cluster using:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. line too wide
  2. KMeans -> k-means

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22426 has started for PR 2942 at commit 374a706.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22426 has finished for PR 2942 at commit 374a706.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingKMeansModel(
    • class StreamingKMeans(

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22426/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22428 has started for PR 2942 at commit 9f7aea9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22428 has finished for PR 2942 at commit 9f7aea9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingKMeansModel(
    • class StreamingKMeans(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22428/
Test PASSed.

- Use a single halfLife parameter that now determines the decay factor
directly
- Allow specification of timeUnit for the halfLife as “batches” or
“points”
- Documentation adjusted accordingly
@freeman-lab
Copy link
Contributor Author

@mengxr I implemented the new parameterization (and tried to make the docs on it more intuitive), see what you think!

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22607 has started for PR 2942 at commit 0411bf5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22607 has finished for PR 2942 at commit 0411bf5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingKMeansModel(
    • class StreamingKMeans(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22607/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Oct 31, 2014

@freeman-lab I made some changes: freeman-lab#1 , which includes the following:

  1. discount on previous counts
  2. detecting dying clusters
  3. use BLAS if possible
  4. use dense vectors in aggregation

If the update looks good to you, could you merge that PR? Thanks!

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22673 has started for PR 2942 at commit 078617c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22673 has finished for PR 2942 at commit 078617c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingKMeansModel(
    • class StreamingKMeans(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22673/
Test PASSed.

@freeman-lab
Copy link
Contributor Author

@mengxr great updates! LGMT. Just need to update the doc/examples in a couple places I think.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22677 has started for PR 2942 at commit b2e5b4a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22677 has finished for PR 2942 at commit b2e5b4a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingKMeansModel(
    • class StreamingKMeans(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22677/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Nov 1, 2014

LGTM. Merged into master. Thanks for adding streaming k-means!

@asfgit asfgit closed this in 98c556e Nov 1, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants