Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4192][SQL] Internal API for Python UDT #3068

Closed
wants to merge 10 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Nov 3, 2014

Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before SQLContext.applySchema, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.

@marmbrus @jkbradley @davies

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22796 has started for PR 3068 at commit 4e84fce.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22796 has finished for PR 3068 at commit 4e84fce.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22796/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22798 has started for PR 3068 at commit e98d9d0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22799 has started for PR 3068 at commit 75223db.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22798 has finished for PR 3068 at commit e98d9d0.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class UserDefinedType(DataType):
    • // in some cases, such as when a class is enclosed in an object (in which case
    • abstract class UserDefinedType[UserType] extends DataType with Serializable
    • public abstract class UserDefinedType<UserType> extends DataType implements Serializable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22798/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22799 has finished for PR 3068 at commit 75223db.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22799/
Test FAILed.

@@ -39,6 +39,7 @@
from array import array
from operator import itemgetter
from itertools import imap
import importlib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importlib is not available in python2.6, use __import__ instead

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22802 has started for PR 3068 at commit f19eb2b.

  • This patch merges cleanly.

@@ -775,11 +954,22 @@ def _verify_type(obj, dataType):
Traceback (most recent call last):
...
ValueError:...
>>> from pyspark.tests import ExamplePoint, ExamplePointUDT
>>> _verify_type(ExamplePoint(1.0, 2.0), ExamplePointUDT())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to remove these tests for ExamplePoint, it should be in tests.py (or already covered)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the same group of other doctests for this private function. I didn't find one for _verify_type in SQLTests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your tests in tests.py have covered these internal functions, so I think it's fine to not have them here.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22802 has finished for PR 3068 at commit f19eb2b.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22802/
Test FAILed.

@mengxr
Copy link
Contributor Author

mengxr commented Nov 3, 2014

test this please

return False


def _python_to_sql_converter(dataType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can _create_converter do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_create_converter doesn't do this. It is used to drop the names if user provides Row objects, called by _drop_schema. I think we need to refactor the code a little bit during QA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@mengxr mengxr changed the title [SPARK-4192][SQL] Python UDT [SPARK-4192][SQL] Internal API for Python UDT Nov 3, 2014
@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22803 has finished for PR 3068 at commit 7c4a6a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22803/
Test PASSed.

@mengxr
Copy link
Contributor Author

mengxr commented Nov 3, 2014

@davies I moved import ExamplePoint code to global test setup. Though the code paths are tested in SQLTests, it is still nice to have unit tests for each function in sql.py.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22822 has started for PR 3068 at commit 2c9d7e4.

  • This patch merges cleanly.

case null => (null, null)
case udt =>
val split = udt.lastIndexOf(".")
(udt.substring(0, split), udt.substring(split + 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we could use pyClass without pyModule (similar to class), do the split in Python.

@davies
Copy link
Contributor

davies commented Nov 3, 2014

LGTM, just some minor comments.

BTW, serialize/deserialize usually means dump the object into bytes, could we change UDF.serialize/deserialize to a better name? They convert the user defined object into object of Spark SQL type actually.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22822 has finished for PR 3068 at commit 2c9d7e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22822/
Test PASSed.

@davies
Copy link
Contributor

davies commented Nov 3, 2014

Note: In order to let pythonUDF can work over UDT, we should let UDT be picklable (register into Pyrolite, similar to Vector).

output sqlType as well
@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22832 has started for PR 3068 at commit dba5ea7.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22833 has started for PR 3068 at commit acff637.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22832 has finished for PR 3068 at commit dba5ea7.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22832/
Test PASSed.

@davies
Copy link
Contributor

davies commented Nov 4, 2014

LGTM, let's ship it.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22833 has finished for PR 3068 at commit acff637.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class UserDefinedType(DataType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22833/
Test PASSed.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 4, 2014

LGTM
On Nov 3, 2014 6:27 PM, "UCB AMPLab" notifications@github.com wrote:

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22833/
Test PASSed.


Reply to this email directly or view it on GitHub
#3068 (comment).

@mengxr
Copy link
Contributor Author

mengxr commented Nov 4, 2014

@marmbrus @davies Thanks! I've merged this into master and branch-1.2.

asfgit pushed a commit that referenced this pull request Nov 4, 2014
Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.

marmbrus jkbradley davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits:

acff637 [Xiangrui Meng] merge master
dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well
2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
7c4a6a9 [Xiangrui Meng] address comments
75223db [Xiangrui Meng] minor update
f740379 [Xiangrui Meng] remove UDT from default imports
e98d9d0 [Xiangrui Meng] fix py style
4e84fce [Xiangrui Meng] remove local hive tests and add more tests
39f19e0 [Xiangrui Meng] add tests
b7f666d [Xiangrui Meng] add Python UDT

(cherry picked from commit 04450d1)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in 04450d1 Nov 4, 2014
asfgit pushed a commit that referenced this pull request Nov 4, 2014
Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.

~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~

marmbrus jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:

3a0b6e5 [Xiangrui Meng] organize imports
236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples

(cherry picked from commit 1a9c6cd)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
asfgit pushed a commit that referenced this pull request Nov 4, 2014
Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.

~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~

marmbrus jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:

3a0b6e5 [Xiangrui Meng] organize imports
236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants