Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD #3070

Closed
wants to merge 2 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Nov 3, 2014

Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from @jkbradley.

This PR contains the changes from #3068 . I will rebase after #3068 is merged.

@marmbrus @jkbradley

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22805 has started for PR 3070 at commit e8a5763.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22805 has finished for PR 3070 at commit e8a5763.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • class VectorUDT(UserDefinedType):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22805/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22806 has started for PR 3070 at commit f6827e4.

  • This patch merges cleanly.

@@ -46,6 +46,11 @@
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still feels weird to me, MLlib depending on SQL. It seems like they are both wanting to depend on a SchemaRDD that is specific to neither. I'm afraid of making the jar hell in Spark worse by attaching more subprojects together. That said, the SQL module itself doesn't, for instance, bring in Hive. Is this going to add much to the MLlib deps? or can the commonality not be factored out into Core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Yes, it feels weird if we say ML depends on SQL, the "query language". Spark SQL provides RDD with schema support and execution plan optimization, both of which are need by MLlib. We need flexible table-like datasets and I/O support, and operations that "carry over" additional columns during the training phrase. It is natural to say that ML depends on RDD with schema support and execution plan optimization.

I agree that we should factor the common part out or make SchemaRDD a first-class citizen in Core, but that definitely takes time for both design and development. This dependence change has no effect on the content we deliver to users, and UDTs are internal to Spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be pretty difficult to have a SchemaRDD that didn't at least depend on catalyst and then there still would be no way to execute the projections and structured data input/output that MLlib wants to. I think really the problem might be in naming. Catalyst / Spark SQL core are really more about manipulating structured data using Spark and we actually considered not even having SQL in the name (unfortunately Spark Schema doesn't have the same ring to it).

The SQL project has already been carefully factored into pieces to minimize the number of dependencies, and so I believe that the only additional dependency that we are bringing in here is Parquet (which is kind of the point of this example).

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22806 has finished for PR 3070 at commit f6827e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • class VectorUDT(UserDefinedType):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22806/
Test PASSed.

val meanLabel = labels.fold(0.0)(_ + _) / numLabels
println(s"Selected label column with average value $meanLabel")

val featuresSchemaRDD: SchemaRDD = origData.select('features)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the right way to select a column within "features"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either of the following is okay: select("features".attr) or select('feature)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also work for any arbitrary column name ? i.e if I am taking in the features column name as a command line argument, how would it look ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select(colName.attr) works if colName is a String. The column name needs to be legal for SQL/Catalyst.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using the DSL like we are in this example, any String column name is legal. The SQL/HiveQL parsers are a little more restrictive about what they consider legal, but with backticks you can can access just about anything.

@jkbradley
Copy link
Member

LGTM though I'll depend on @davies for feedback on the Python API on the other PR [https://github.com//pull/3068]

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22858 has started for PR 3070 at commit c44b3ab.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22857/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22859 has started for PR 3070 at commit 236f0a0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22858 has finished for PR 3070 at commit c44b3ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorActor(executorId: String) extends Actor with ActorLogReceive with Logging
    • case class GetActorSystemHostPortForExecutor(executorId: String) extends ToBlockManagerMaster
    • case class Params(
    • class VectorUDT(UserDefinedType):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22858/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22863 has started for PR 3070 at commit 3a0b6e5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22859 has finished for PR 3070 at commit 236f0a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • class VectorUDT(UserDefinedType):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22859/
Test PASSed.



def summarize(dataset):
print "schema: %s" % dataset.schema().json()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset.print_schema() will be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset.printSchema() doesn't output json, which contains more information:

{
  "type" : "struct",
  "fields" : [ {
    "name" : "label",
    "type" : "double",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "features",
    "type" : {
      "type" : "udt",
      "class" : "org.apache.spark.mllib.linalg.VectorUDT",
      "pyClass" : "pyspark.mllib.linalg.VectorUDT",
      "sqlType" : {
        "type" : "struct",
        "fields" : [ {
          "name" : "type",
          "type" : "byte",
          "nullable" : false,
          "metadata" : { }
        }, {
          "name" : "size",
          "type" : "integer",
          "nullable" : true,
          "metadata" : { }
        }, {
          "name" : "indices",
          "type" : {
            "type" : "array",
            "elementType" : "integer",
            "containsNull" : false
          },
          "nullable" : true,
          "metadata" : { }
        }, {
          "name" : "values",
          "type" : {
            "type" : "array",
            "elementType" : "double",
            "containsNull" : false
          },
          "nullable" : true,
          "metadata" : { }
        } ]
      }
    },
    "nullable" : true,
    "metadata" : { }
  } ]
}

@jkbradley
Copy link
Member

Just checked the updated storage format for dense/sparse & the new test. LGTM

asfgit pushed a commit that referenced this pull request Nov 4, 2014
Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.

~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~

marmbrus jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:

3a0b6e5 [Xiangrui Meng] organize imports
236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples

(cherry picked from commit 1a9c6cd)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in 1a9c6cd Nov 4, 2014
@mengxr
Copy link
Contributor Author

mengxr commented Nov 4, 2014

Thanks all for reviewing the code! I've merged this into master and branch-1.2.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22863 has finished for PR 3070 at commit 3a0b6e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorActor(executorId: String) extends Actor with ActorLogReceive with Logging
    • case class GetActorSystemHostPortForExecutor(executorId: String) extends ToBlockManagerMaster
    • case class Params(
    • class VectorUDT(UserDefinedType):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22863/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants