Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3988][SQL] add public API for date type #2901

Closed
wants to merge 6 commits into from

Conversation

adrian-wang
Copy link
Contributor

Add json and python api for date type.
By using Pickle, java.sql.Date was serialized as calendar, and recognized in python as datetime.datetime.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2901 at commit 8d7dd22.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have finished for PR 2901 at commit 8d7dd22.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DateType(PrimitiveType):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22053/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2901 at commit 1d74448.

  • This patch merges cleanly.

@@ -76,6 +79,7 @@ public void constructSimpleRow() {
new Boolean(booleanValue),
stringValue, // StringType
binaryValue, // BinaryType
dateValue, // DateType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation is off

@liancheng
Copy link
Contributor

LGTM, thanks!

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have finished for PR 2901 at commit 1d74448.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DateType(PrimitiveType):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22055/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2901 at commit 444f100.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2901 at commit f760d8e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have finished for PR 2901 at commit 444f100.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DateType(PrimitiveType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22058/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have finished for PR 2901 at commit f760d8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DateType(PrimitiveType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22059/
Test PASSed.

@@ -1065,7 +1074,9 @@ def applySchema(self, rdd, schema):
[Row(field1=1, field2=u'row1'),..., Row(field1=3, field2=u'row3')]

>>> from datetime import datetime
>>> from datetime import date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: these two lines can be combined.

@davies
Copy link
Contributor

davies commented Oct 23, 2014

Thanks for fix so many typos!

It will be awesome to recognize all Date/Timestamps values in JsonRDD. If it's not easy to do it in this PR, we could do it in another one.

... lambda x: (x.byte1, x.byte2, x.short1, x.short2, x.int, x.float, x.date,
... x.time, x.map["a"], x.struct.b, x.list, x.null))
>>> results.collect()[0] # doctest: +NORMALIZE_WHITESPACE
(127, -128, -32768, 32767, 2147483647, 1.0, datetime.datetime(2010, 1, 1, 0, 0),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies because of using pyrolite, java.sql.Date is serialized in the same way as java.sql.Timestamp, since they are all subtype of java.util.Date. And this make the dumps() function to generate datetime instead of date for java.util.Date. I think this is related to your comments in JIRA SPARK-2674

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. After the data was deserialized in Python, we need to some data coversions, so we can convert datetime to date if DataType is DateType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should the convert in python side or scala side, which one would you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can only do it in Python side.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22133 has started for PR 2901 at commit 5670626.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22133 has finished for PR 2901 at commit 5670626.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DateType(PrimitiveType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22133/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #440 has started for PR 2901 at commit 5670626.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #440 has finished for PR 2901 at commit 5670626.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DateType(PrimitiveType):

@adrian-wang
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22215 has started for PR 2901 at commit 5670626.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22215 has finished for PR 2901 at commit 5670626.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DateType(PrimitiveType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22215/
Test PASSed.

@marmbrus
Copy link
Contributor

@davies does anything else remain to be done here?

@davies
Copy link
Contributor

davies commented Oct 27, 2014

@marmbrus There is a bug: DateType object will be datetime in Python. Also we could improve DataType/TimestampType support in jsonRDD, it could be done separately.

@adrian-wang You could convert datetime object into date object in _create_properties/_create_object

@davies
Copy link
Contributor

davies commented Oct 27, 2014

@adrian-wang You can do it like this:

diff --git a/python/pyspark/sql.py b/python/pyspark/sql.py
index 7daf306..db256a4 100644
--- a/python/pyspark/sql.py
+++ b/python/pyspark/sql.py
@@ -782,6 +782,8 @@ def _restore_object(dataType, obj):

 def _create_object(cls, v):
     """ Create an customized object with class `cls`. """
+    if cls is datetime.date and isinstance(v, datetime.datetime):
+        return v.date()
     return cls(v) if v is not None else v


@@ -796,9 +798,11 @@ def _create_getter(dt, i):


 def _has_struct(dt):
-    """Return whether `dt` is or has StructType in it"""
+    """Return whether `dt` is or has StructType/DateType in it"""
     if isinstance(dt, StructType):
         return True
+    if isinstance(dt, DateType):
+        return True
     elif isinstance(dt, ArrayType):
         return _has_struct(dt.elementType)
     elif isinstance(dt, MapType):
@@ -870,6 +874,9 @@ def _create_cls(dataType):

         return Dict

+    elif isinstance(dataType, DateType):
+        return datetime.date
+
     elif not isinstance(dataType, StructType):
         raise Exception("unexpected data type: %s" % dataType)

@adrian-wang
Copy link
Contributor Author

@davies
Thanks for the tips! I'll follow up very soon.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22274 has started for PR 2901 at commit c51a24d.

  • This patch merges cleanly.

@davies
Copy link
Contributor

davies commented Oct 27, 2014

LGTM, waiting for the tests.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22274 has finished for PR 2901 at commit c51a24d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DateType(PrimitiveType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22274/
Test PASSed.

@marmbrus
Copy link
Contributor

Thanks! Merged to master.

@asfgit asfgit closed this in 47a40f6 Oct 28, 2014
asfgit pushed a commit that referenced this pull request Nov 11, 2014
This implement the feature davies mentioned in #2901 (diff)

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3012 from adrian-wang/iso8601 and squashes the following commits:

50df6e7 [Daoyuan Wang] json data timestamp ISO8601 support
asfgit pushed a commit that referenced this pull request Nov 11, 2014
This implement the feature davies mentioned in #2901 (diff)

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3012 from adrian-wang/iso8601 and squashes the following commits:

50df6e7 [Daoyuan Wang] json data timestamp ISO8601 support

(cherry picked from commit a1fc059)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants