-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function #4254
Conversation
…ts; added Markdown documentation
…es and making noncritical methods private
Test build #26252 has started for PR 4254 at commit
|
Test build #26252 has finished for PR 4254 at commit
|
Test FAILed. |
Is it possible to do Gaussian similarity in another PR? It should be part of the feature transformation but not within PIC. It would be easier for code review if the PR is minimal. |
* @param verticesFile Local filesystem path to the Points input file | ||
* @return Set of Vertices in format appropriate for consumption by the PIC algorithm | ||
*/ | ||
def readVerticesfromFile(verticesFile: String): Points = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not handle I/O here. We can have an example code under examples/
and load files there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
// TODO: avoid local collect and then sc.parallelize. | ||
val localVt = vt.collect.sortBy(_._1) | ||
val vectRdd = sc.parallelize(localVt.map(v => (v._1, Vectors.dense(v._2)))) | ||
vectRdd.cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need ordering vt
before calling kmeans
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, it was used sorted for other purpose, will remove.
…iangrui on the PR
Test build #26354 has started for PR 4254 at commit
|
@@ -0,0 +1,299 @@ | |||
1000 0.000000 0.000125 0.000038 0.012684 0.000638 0.051091 0.000151 0.044208 0.004264 0.000617 0.007746 0.036569 0.001813 0.000305 0.003171 0.004114 0.000530 0.016800 0.003396 0.017566 0.034756 0.000018 0.051096 0.000022 0.001749 0.000210 0.006065 0.006969 0.016719 0.006028 0.003378 0.003200 0.025072 0.000291 0.000116 0.001633 0.000028 0.011305 0.000019 0.010359 0.006533 0.047593 0.027411 0.000059 0.017558 0.000518 0.000946 0.044212 0.000094 0.005404 0.026762 0.009941 0.003801 0.000027 0.000161 0.000901 0.000019 0.000518 0.034732 0.000059 0.000126 0.000970 0.011814 0.005997 0.000205 0.001832 0.008792 0.036318 0.000149 0.032781 0.010692 0.000530 0.010557 0.016641 0.008180 0.001606 0.000092 0.007445 0.026718 0.027457 0.000957 0.005901 0.000314 0.000162 0.000856 0.004776 0.008114 0.003693 0.000038 0.024965 0.044256 0.007180 0.000022 0.010297 0.000994 0.044255 0.001725 0.016541 0.003658 0.000288 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too large for unit tests. Unit tests should be as minimal as possible. For this one, we can construct a very small graph, compute its eigenvector, and derive the clustering result manually, then verify PIC result. For example
a - b - c - g - h
| \ | | \ |
d - e - f i - j
Assign each edge distance 1
and run PIC with k = 2. The solution should be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Test build #26354 has finished for PR 4254 at commit
|
Test PASSed. |
refactor PIC
Test build #26420 has started for PR 4254 at commit
|
Test build #26423 has started for PR 4254 at commit
|
Test build #26420 has finished for PR 4254 at commit
|
Test PASSed. |
Test build #26423 has finished for PR 4254 at commit
|
Test PASSed. |
|
||
Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm: | ||
|
||
* accepts a [Graph](https://spark.apache.org/docs/0.9.2/api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use relative path "api/graphx/...". See examples in this markdown file.
LGTM except minor user guide issues, which will be addressed in SPARK-5503. I've merged this into master. Thanks for the contributing! (Now MLlib depends on GraphX.) |
Add single pseudo-eigenvector PIC
Including documentations and updated pom.xml with the following codes:
mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala
mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala