[SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD #3070

mengxr · 2014-11-03T07:07:10Z

Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from @jkbradley.

~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~

@marmbrus @jkbradley

SparkQA · 2014-11-03T07:14:57Z

Test build #22805 has started for PR 3070 at commit e8a5763.

This patch merges cleanly.

SparkQA · 2014-11-03T07:16:07Z

Test build #22805 has finished for PR 3070 at commit e8a5763.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- class VectorUDT(UserDefinedType):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-03T07:16:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22805/
Test FAILed.

SparkQA · 2014-11-03T07:29:46Z

Test build #22806 has started for PR 3070 at commit f6827e4.

This patch merges cleanly.

srowen · 2014-11-03T07:32:08Z

mllib/pom.xml

@@ -46,6 +46,11 @@
      <version>${project.version}</version>
    </dependency>
    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-sql_${scala.binary.version}</artifactId>


This still feels weird to me, MLlib depending on SQL. It seems like they are both wanting to depend on a SchemaRDD that is specific to neither. I'm afraid of making the jar hell in Spark worse by attaching more subprojects together. That said, the SQL module itself doesn't, for instance, bring in Hive. Is this going to add much to the MLlib deps? or can the commonality not be factored out into Core?

@srowen Yes, it feels weird if we say ML depends on SQL, the "query language". Spark SQL provides RDD with schema support and execution plan optimization, both of which are need by MLlib. We need flexible table-like datasets and I/O support, and operations that "carry over" additional columns during the training phrase. It is natural to say that ML depends on RDD with schema support and execution plan optimization.

I agree that we should factor the common part out or make SchemaRDD a first-class citizen in Core, but that definitely takes time for both design and development. This dependence change has no effect on the content we deliver to users, and UDTs are internal to Spark.

I think it would be pretty difficult to have a SchemaRDD that didn't at least depend on catalyst and then there still would be no way to execute the projections and structured data input/output that MLlib wants to. I think really the problem might be in naming. Catalyst / Spark SQL core are really more about manipulating structured data using Spark and we actually considered not even having SQL in the name (unfortunately Spark Schema doesn't have the same ring to it).

The SQL project has already been carefully factored into pieces to minimize the number of dependencies, and so I believe that the only additional dependency that we are bringing in here is Parquet (which is kind of the point of this example).

SparkQA · 2014-11-03T09:19:55Z

Test build #22806 has finished for PR 3070 at commit f6827e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- class VectorUDT(UserDefinedType):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-03T09:19:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22806/
Test PASSed.

sryza · 2014-11-03T18:04:05Z

examples/src/main/scala/org/apache/spark/examples/mllib/DatasetExample.scala

+    val meanLabel = labels.fold(0.0)(_ + _) / numLabels
+    println(s"Selected label column with average value $meanLabel")
+
+    val featuresSchemaRDD: SchemaRDD = origData.select('features)


What's the right way to select a column within "features"?

either of the following is okay: select("features".attr) or select('feature)

Does this also work for any arbitrary column name ? i.e if I am taking in the features column name as a command line argument, how would it look ?

select(colName.attr) works if colName is a String. The column name needs to be legal for SQL/Catalyst.

When using the DSL like we are in this example, any String column name is legal. The SQL/HiveQL parsers are a little more restrictive about what they consider legal, but with backticks you can can access just about anything.

jkbradley · 2014-11-04T01:29:58Z

LGTM though I'll depend on @davies for feedback on the Python API on the other PR [https://github.com//pull/3068]

SparkQA · 2014-11-04T03:45:00Z

Test build #22858 has started for PR 3070 at commit c44b3ab.

This patch merges cleanly.

AmplabJenkins · 2014-11-04T03:52:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22857/
Test FAILed.

SparkQA · 2014-11-04T04:25:00Z

Test build #22859 has started for PR 3070 at commit 236f0a0.

This patch merges cleanly.

SparkQA · 2014-11-04T05:13:09Z

Test build #22858 has finished for PR 3070 at commit c44b3ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExecutorActor(executorId: String) extends Actor with ActorLogReceive with Logging
- case class GetActorSystemHostPortForExecutor(executorId: String) extends ToBlockManagerMaster
- case class Params(
- class VectorUDT(UserDefinedType):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-04T05:13:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22858/
Test PASSed.

SparkQA · 2014-11-04T05:44:53Z

Test build #22863 has started for PR 3070 at commit 3a0b6e5.

This patch merges cleanly.

SparkQA · 2014-11-04T05:49:54Z

Test build #22859 has finished for PR 3070 at commit 236f0a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- class VectorUDT(UserDefinedType):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-04T05:49:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22859/
Test PASSed.

davies · 2014-11-04T06:12:32Z

examples/src/main/python/mllib/dataset_example.py

+
+
+def summarize(dataset):
+    print "schema: %s" % dataset.schema().json()


dataset.print_schema() will be better.

dataset.printSchema() doesn't output json, which contains more information:

{ "type" : "struct", "fields" : [ { "name" : "label", "type" : "double", "nullable" : false, "metadata" : { } }, { "name" : "features", "type" : { "type" : "udt", "class" : "org.apache.spark.mllib.linalg.VectorUDT", "pyClass" : "pyspark.mllib.linalg.VectorUDT", "sqlType" : { "type" : "struct", "fields" : [ { "name" : "type", "type" : "byte", "nullable" : false, "metadata" : { } }, { "name" : "size", "type" : "integer", "nullable" : true, "metadata" : { } }, { "name" : "indices", "type" : { "type" : "array", "elementType" : "integer", "containsNull" : false }, "nullable" : true, "metadata" : { } }, { "name" : "values", "type" : { "type" : "array", "elementType" : "double", "containsNull" : false }, "nullable" : true, "metadata" : { } } ] } }, "nullable" : true, "metadata" : { } } ] }

jkbradley · 2014-11-04T06:26:51Z

Just checked the updated storage format for dense/sparse & the new test. LGTM

Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley. ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~ marmbrus jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #3070 from mengxr/SPARK-3573 and squashes the following commits: 3a0b6e5 [Xiangrui Meng] organize imports 236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples (cherry picked from commit 1a9c6cd) Signed-off-by: Xiangrui Meng <meng@databricks.com>

mengxr · 2014-11-04T06:58:38Z

Thanks all for reviewing the code! I've merged this into master and branch-1.2.

SparkQA · 2014-11-04T07:11:50Z

Test build #22863 has finished for PR 3070 at commit 3a0b6e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExecutorActor(executorId: String) extends Actor with ActorLogReceive with Logging
- case class GetActorSystemHostPortForExecutor(executorId: String) extends ToBlockManagerMaster
- case class Params(
- class VectorUDT(UserDefinedType):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-04T07:11:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22863/
Test PASSed.

srowen reviewed Nov 3, 2014
View reviewed changes

sryza reviewed Nov 3, 2014
View reviewed changes

mengxr force-pushed the SPARK-3573 branch from f6827e4 to 103efa5 Compare November 4, 2014 03:35

register vector as UDT and provide dataset examples

236f0a0

mengxr force-pushed the SPARK-3573 branch from c44b3ab to 236f0a0 Compare November 4, 2014 04:18

organize imports

3a0b6e5

davies reviewed Nov 4, 2014
View reviewed changes

asfgit closed this in 1a9c6cd Nov 4, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD #3070

[SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD #3070

mengxr commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

SparkQA commented Nov 3, 2014

srowen Nov 3, 2014

mengxr Nov 3, 2014

marmbrus Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

sryza Nov 3, 2014

mengxr Nov 3, 2014

shivaram Nov 3, 2014

mengxr Nov 3, 2014

marmbrus Nov 3, 2014

jkbradley commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

davies Nov 4, 2014

mengxr Nov 4, 2014

jkbradley commented Nov 4, 2014

mengxr commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014



		def summarize(dataset):
		print "schema: %s" % dataset.schema().json()

[SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD #3070

[SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD #3070

Conversation

mengxr commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

SparkQA commented Nov 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Nov 4, 2014

mengxr commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014