-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37360][SQL] Support TimestampNTZ in JSON data source #34638
Conversation
Kubernetes integration test starting |
I left some comments in the PR for CSV: #34596 |
Kubernetes integration test status failure |
Test build #145360 has finished for PR 34638 at commit
|
Test build #145418 has finished for PR 34638 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145428 has finished for PR 34638 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145435 has finished for PR 34638 at commit
|
Test build #145872 has finished for PR 34638 at commit
|
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status failure |
Kubernetes integration test status failure |
Test build #145873 has finished for PR 34638 at commit
|
Test build #145874 has finished for PR 34638 at commit
|
@gengliangwang @MaxGekk could you review the PR? Thank you. |
@@ -144,6 +150,9 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { | |||
} | |||
if (options.prefersDecimal && decimalTry.isDefined) { | |||
decimalTry.get | |||
} else if (options.inferTimestamp && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you adjust the comment for inferTimestamp and mention the ntz type:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
Line 141 in 1235bd2
* Enables inferring of TimestampType from strings matched to the timestamp pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure. I will update, thanks for pointing it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated! Could you check the latest PR version? Thanks.
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
Outdated
Show resolved
Hide resolved
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145883 has finished for PR 34638 at commit
|
|
||
test("SPARK-37360: Write and infer TIMESTAMP_LTZ values with a non-default pattern") { | ||
withTempPath { path => | ||
val exp = spark.sql("select timestamp_ltz'2020-12-12 12:12:12' as col0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you append a fraction part like timestamp_ltz'2020-12-12 12:12:12.123456'
. Though I would guess JSON ds should write .000000
, and if something went wrong, it won't be able to read even such zeros. But just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good test case! Updated, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except of a minor comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if Max's comments are addressed
JSONOptions have the following comment for
Let me know if you would like any modifications to the javadoc. |
Kubernetes integration test starting |
+1, LGTM. Merging to master. |
Kubernetes integration test status failure |
Test build #145940 has finished for PR 34638 at commit
|
Thank you! |
What changes were proposed in this pull request?
This PR adds support for TimestampNTZ type in the JSON data source.
Most of the functionality has already been added, this patch verifies that writes + reads work for TimestampNTZ type and adds schema inference depending on the timestamp value format written. The following applies:
TIMESTAMP_NTZ
andTIMESTAMP_LTZ
values, useTIMESTAMP_LTZ
.TIMESTAMP_NTZ
values, resolve using the the default timestamp type configured withspark.sql.timestampType
.In addition, I introduced a new JSON option
timestampNTZFormat
which is similar totimestampFormat
but it allows to configure read/write pattern forTIMESTAMP_NTZ
types. It is basically a copy of timestamp pattern but without timezone.Why are the changes needed?
The PR fixes issues when writing and reading TimestampNTZ to and from JSON.
Does this PR introduce any user-facing change?
Previously, JSON data source would infer timestamp values as
TimestampType
when reading a JSON file. Now, the data source would infer the timestamp value type based on the format (with or without timezone) and default timestamp type based onspark.sql.timestampType
.A new JSON option
timestampNTZFormat
is added to control the way values are formatted during writes or parsed during reads.How was this patch tested?
I extended
JsonSuite
with a few unit tests to verify that write-read roundtrip works forTIMESTAMP_NTZ
andTIMESTAMP_LTZ
values.