[SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function #4254

fjiang6 · 2015-01-28T21:47:04Z

Add single pseudo-eigenvector PIC
Including documentations and updated pom.xml with the following codes:
mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala
mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala

…onvergence step)

…ts; added Markdown documentation

…es and making noncritical methods private

SparkQA · 2015-01-28T21:52:45Z

Test build #26252 has started for PR 4254 at commit 121e4d5.

This patch merges cleanly.

SparkQA · 2015-01-28T21:53:39Z

Test build #26252 has finished for PR 4254 at commit 121e4d5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-28T21:53:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26252/
Test FAILed.

mengxr · 2015-01-28T21:57:32Z

Is it possible to do Gaussian similarity in another PR? It should be part of the feature transformation but not within PIC. It would be easier for code review if the PR is minimal.

mengxr · 2015-01-28T21:59:36Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala

+   * @param verticesFile Local filesystem path to the Points input file
+   * @return Set of Vertices in format appropriate for consumption by the PIC algorithm
+   */
+  def readVerticesfromFile(verticesFile: String): Points = {


Let's not handle I/O here. We can have an example code under examples/ and load files there.

mengxr · 2015-01-29T09:52:24Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala

+    // TODO: avoid local collect and then sc.parallelize.
+    val localVt = vt.collect.sortBy(_._1)
+    val vectRdd = sc.parallelize(localVt.map(v => (v._1, Vectors.dense(v._2))))
+    vectRdd.cache()


You don't need ordering vt before calling kmeans.

true, it was used sorted for other purpose, will remove.

mengxr · 2015-01-29T09:55:51Z

@fjiang6 @javadba Please focus on the public APIs first and then the implementation. The best way to check public APIs is generating the html doc and look what are exposed to users. We can probably refactor the test code in later PRs.

…iangrui on the PR

SparkQA · 2015-01-30T01:02:34Z

Test build #26354 has started for PR 4254 at commit 24fbf52.

This patch merges cleanly.

mengxr · 2015-01-30T01:44:44Z

data/mllib/pic_data.txt

@@ -0,0 +1,299 @@
+1000	0.000000	0.000125	0.000038	0.012684	0.000638	0.051091	0.000151	0.044208	0.004264	0.000617	0.007746	0.036569	0.001813	0.000305	0.003171	0.004114	0.000530	0.016800	0.003396	0.017566	0.034756	0.000018	0.051096	0.000022	0.001749	0.000210	0.006065	0.006969	0.016719	0.006028	0.003378	0.003200	0.025072	0.000291	0.000116	0.001633	0.000028	0.011305	0.000019	0.010359	0.006533	0.047593	0.027411	0.000059	0.017558	0.000518	0.000946	0.044212	0.000094	0.005404	0.026762	0.009941	0.003801	0.000027	0.000161	0.000901	0.000019	0.000518	0.034732	0.000059	0.000126	0.000970	0.011814	0.005997	0.000205	0.001832	0.008792	0.036318	0.000149	0.032781	0.010692	0.000530	0.010557	0.016641	0.008180	0.001606	0.000092	0.007445	0.026718	0.027457	0.000957	0.005901	0.000314	0.000162	0.000856	0.004776	0.008114	0.003693	0.000038	0.024965	0.044256	0.007180	0.000022	0.010297	0.000994	0.044255	0.001725	0.016541	0.003658	0.000288


This is too large for unit tests. Unit tests should be as minimal as possible. For this one, we can construct a very small graph, compute its eigenvector, and derive the clustering result manually, then verify PIC result. For example

a - b - c - g - h | \ | | \ | d - e - f i - j

Assign each edge distance 1 and run PIC with k = 2. The solution should be clear.

SparkQA · 2015-01-30T02:14:11Z

Test build #26354 has finished for PR 4254 at commit 24fbf52.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PowerIterationClustering(

AmplabJenkins · 2015-01-30T02:14:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26354/
Test PASSed.

refactor PIC

SparkQA · 2015-01-30T20:32:48Z

Test build #26420 has started for PR 4254 at commit f292f31.

This patch merges cleanly.

SparkQA · 2015-01-30T20:42:40Z

Test build #26423 has started for PR 4254 at commit 4550850.

This patch merges cleanly.

SparkQA · 2015-01-30T21:44:04Z

Test build #26420 has finished for PR 4254 at commit f292f31.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PowerIterationClusteringModel(

AmplabJenkins · 2015-01-30T21:44:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26420/
Test PASSed.

SparkQA · 2015-01-30T22:01:03Z

Test build #26423 has finished for PR 4254 at commit 4550850.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PowerIterationClusteringModel(

AmplabJenkins · 2015-01-30T22:01:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26423/
Test PASSed.

mengxr · 2015-01-30T22:09:58Z

docs/mllib-clustering.md

+
+Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values.  Internally the algorithm:
+
+* accepts a [Graph](https://spark.apache.org/docs/0.9.2/api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a  normalized pairwise affinity between all input points.


Should use relative path "api/graphx/...". See examples in this markdown file.

mengxr · 2015-01-30T22:11:25Z

LGTM except minor user guide issues, which will be addressed in SPARK-5503. I've merged this into master. Thanks for the contributing! (Now MLlib depends on GraphX.)

fjiang6 and others added 24 commits January 22, 2015 13:52

Adding Power Iteration Clustering

a3c5fbe

Adding Power Iteration Clustering and Suite test

d5aae20

PIClustering is running in new branch (up to the pseudo-eigenvector c…

3fd5bc8

…onvergence step)

Added ConcentricCircles data generation and KMeans clustering

0ef163f

Update circles test data values

32a90dc

First end to end working version: but has bad performance issue

0700335

First end to end working PIC

e5df2b8

Added visualization/plotting of input/output data

9294263

Revert inadvertent update to KMeans

a2b1e57

Added axes and combined into single plot for matplotlib

b7dbcbe

Added iris dataset

f656c34

Added graphx main and test jars as dependencies to mllib/pom.xml

a112f38

Update PIClustering.scala

ace9749

Update PIClustering.scala

b29c0db

Converted custom Linear Algebra datatypes/routines to use Breeze.

bea48ea

Converted from custom Linalg routines to Breeze: added JavaDoc commen…

90e7fa4

…ts; added Markdown documentation

Added mllib specific log4j

be659e3

Added link to PIC doc from the main clustering md doc

060e6bf

fixed incorrect markdown in clustering doc

24f438e

Add assert to testcase on cluster sizes

88aacc8

Change last two println's to log4j logger

43ab10b

Applied Xiangrui's comments - especially removing RDD/PICLinalg class…

218a49d

…es and making noncritical methods private

removed matplot.py and reordered all private methods to bottom of PIC

1c3a62e

Remove unused testing data files

121e4d5

mengxr reviewed Jan 28, 2015
View reviewed changes

mengxr reviewed Jan 29, 2015
View reviewed changes

Updated API to be similar to KMeans plus other changes requested by X…

24fbf52

…iangrui on the PR

mengxr reviewed Jan 30, 2015
View reviewed changes

mengxr and others added 2 commits January 30, 2015 06:44

refactor PIC

4b78aaf

Merge pull request #44 from mengxr/SPARK-4259

f292f31

refactor PIC

Removed pic test data

4550850

mengxr reviewed Jan 30, 2015
View reviewed changes

asfgit closed this in f377431 Jan 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function #4254

[SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function #4254

fjiang6 commented Jan 28, 2015

SparkQA commented Jan 28, 2015

SparkQA commented Jan 28, 2015

AmplabJenkins commented Jan 28, 2015

mengxr commented Jan 28, 2015

mengxr Jan 28, 2015

javadba Jan 28, 2015

mengxr Jan 29, 2015

javadba Jan 29, 2015

mengxr commented Jan 29, 2015

SparkQA commented Jan 30, 2015

mengxr Jan 30, 2015

javadba Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

mengxr Jan 30, 2015

mengxr commented Jan 30, 2015

		@@ -0,0 +1,299 @@
		1000 0.000000 0.000125 0.000038 0.012684 0.000638 0.051091 0.000151 0.044208 0.004264 0.000617 0.007746 0.036569 0.001813 0.000305 0.003171 0.004114 0.000530 0.016800 0.003396 0.017566 0.034756 0.000018 0.051096 0.000022 0.001749 0.000210 0.006065 0.006969 0.016719 0.006028 0.003378 0.003200 0.025072 0.000291 0.000116 0.001633 0.000028 0.011305 0.000019 0.010359 0.006533 0.047593 0.027411 0.000059 0.017558 0.000518 0.000946 0.044212 0.000094 0.005404 0.026762 0.009941 0.003801 0.000027 0.000161 0.000901 0.000019 0.000518 0.034732 0.000059 0.000126 0.000970 0.011814 0.005997 0.000205 0.001832 0.008792 0.036318 0.000149 0.032781 0.010692 0.000530 0.010557 0.016641 0.008180 0.001606 0.000092 0.007445 0.026718 0.027457 0.000957 0.005901 0.000314 0.000162 0.000856 0.004776 0.008114 0.003693 0.000038 0.024965 0.044256 0.007180 0.000022 0.010297 0.000994 0.044255 0.001725 0.016541 0.003658 0.000288


		Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:

		* accepts a [Graph](https://spark.apache.org/docs/0.9.2/api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.

[SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function #4254

[SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function #4254

Conversation

fjiang6 commented Jan 28, 2015

SparkQA commented Jan 28, 2015

SparkQA commented Jan 28, 2015

AmplabJenkins commented Jan 28, 2015

mengxr commented Jan 28, 2015

mengxr Jan 28, 2015

Choose a reason for hiding this comment

javadba Jan 28, 2015

Choose a reason for hiding this comment

mengxr Jan 29, 2015

Choose a reason for hiding this comment

javadba Jan 29, 2015

Choose a reason for hiding this comment

mengxr commented Jan 29, 2015

SparkQA commented Jan 30, 2015

mengxr Jan 30, 2015

Choose a reason for hiding this comment

javadba Jan 30, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

mengxr Jan 30, 2015

Choose a reason for hiding this comment

mengxr commented Jan 30, 2015