Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function #4254

Closed
wants to merge 31 commits into from

Conversation

fjiang6
Copy link

@fjiang6 fjiang6 commented Jan 28, 2015

Add single pseudo-eigenvector PIC
Including documentations and updated pom.xml with the following codes:
mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala
mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala

fjiang6 and others added 24 commits January 22, 2015 13:52
@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26252 has started for PR 4254 at commit 121e4d5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26252 has finished for PR 4254 at commit 121e4d5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26252/
Test FAILed.

@mengxr
Copy link
Contributor

mengxr commented Jan 28, 2015

Is it possible to do Gaussian similarity in another PR? It should be part of the feature transformation but not within PIC. It would be easier for code review if the PR is minimal.

* @param verticesFile Local filesystem path to the Points input file
* @return Set of Vertices in format appropriate for consumption by the PIC algorithm
*/
def readVerticesfromFile(verticesFile: String): Points = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not handle I/O here. We can have an example code under examples/ and load files there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

// TODO: avoid local collect and then sc.parallelize.
val localVt = vt.collect.sortBy(_._1)
val vectRdd = sc.parallelize(localVt.map(v => (v._1, Vectors.dense(v._2))))
vectRdd.cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need ordering vt before calling kmeans.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, it was used sorted for other purpose, will remove.

@mengxr
Copy link
Contributor

mengxr commented Jan 29, 2015

@fjiang6 @javadba Please focus on the public APIs first and then the implementation. The best way to check public APIs is generating the html doc and look what are exposed to users. We can probably refactor the test code in later PRs.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26354 has started for PR 4254 at commit 24fbf52.

  • This patch merges cleanly.

@@ -0,0 +1,299 @@
1000 0.000000 0.000125 0.000038 0.012684 0.000638 0.051091 0.000151 0.044208 0.004264 0.000617 0.007746 0.036569 0.001813 0.000305 0.003171 0.004114 0.000530 0.016800 0.003396 0.017566 0.034756 0.000018 0.051096 0.000022 0.001749 0.000210 0.006065 0.006969 0.016719 0.006028 0.003378 0.003200 0.025072 0.000291 0.000116 0.001633 0.000028 0.011305 0.000019 0.010359 0.006533 0.047593 0.027411 0.000059 0.017558 0.000518 0.000946 0.044212 0.000094 0.005404 0.026762 0.009941 0.003801 0.000027 0.000161 0.000901 0.000019 0.000518 0.034732 0.000059 0.000126 0.000970 0.011814 0.005997 0.000205 0.001832 0.008792 0.036318 0.000149 0.032781 0.010692 0.000530 0.010557 0.016641 0.008180 0.001606 0.000092 0.007445 0.026718 0.027457 0.000957 0.005901 0.000314 0.000162 0.000856 0.004776 0.008114 0.003693 0.000038 0.024965 0.044256 0.007180 0.000022 0.010297 0.000994 0.044255 0.001725 0.016541 0.003658 0.000288
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too large for unit tests. Unit tests should be as minimal as possible. For this one, we can construct a very small graph, compute its eigenvector, and derive the clustering result manually, then verify PIC result. For example

a - b - c - g - h
| \     |   | \ |
d - e - f   i - j

Assign each edge distance 1 and run PIC with k = 2. The solution should be clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26354 has finished for PR 4254 at commit 24fbf52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class PowerIterationClustering(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26354/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26420 has started for PR 4254 at commit f292f31.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26423 has started for PR 4254 at commit 4550850.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26420 has finished for PR 4254 at commit f292f31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class PowerIterationClusteringModel(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26420/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26423 has finished for PR 4254 at commit 4550850.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class PowerIterationClusteringModel(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26423/
Test PASSed.


Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:

* accepts a [Graph](https://spark.apache.org/docs/0.9.2/api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use relative path "api/graphx/...". See examples in this markdown file.

@asfgit asfgit closed this in f377431 Jan 30, 2015
@mengxr
Copy link
Contributor

mengxr commented Jan 30, 2015

LGTM except minor user guide issues, which will be addressed in SPARK-5503. I've merged this into master. Thanks for the contributing! (Now MLlib depends on GraphX.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants