[SPARK-37360][SQL] Support TimestampNTZ in JSON data source #34638

sadikovi · 2021-11-18T02:07:58Z

What changes were proposed in this pull request?

This PR adds support for TimestampNTZ type in the JSON data source.

Most of the functionality has already been added, this patch verifies that writes + reads work for TimestampNTZ type and adds schema inference depending on the timestamp value format written. The following applies:

If there is a mixture of TIMESTAMP_NTZ and TIMESTAMP_LTZ values, use TIMESTAMP_LTZ.
If there are only TIMESTAMP_NTZ values, resolve using the the default timestamp type configured with spark.sql.timestampType.

In addition, I introduced a new JSON option timestampNTZFormat which is similar to timestampFormat but it allows to configure read/write pattern for TIMESTAMP_NTZ types. It is basically a copy of timestamp pattern but without timezone.

Why are the changes needed?

The PR fixes issues when writing and reading TimestampNTZ to and from JSON.

Does this PR introduce any user-facing change?

Previously, JSON data source would infer timestamp values as TimestampType when reading a JSON file. Now, the data source would infer the timestamp value type based on the format (with or without timezone) and default timestamp type based on spark.sql.timestampType.

A new JSON option timestampNTZFormat is added to control the way values are formatted during writes or parsed during reads.

How was this patch tested?

I extended JsonSuite with a few unit tests to verify that write-read roundtrip works for TIMESTAMP_NTZ and TIMESTAMP_LTZ values.

SparkQA · 2021-11-18T03:34:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49831/

gengliangwang · 2021-11-18T03:58:48Z

I left some comments in the PR for CSV: #34596
Let's revisit this one after the CSV one is merged.

SparkQA · 2021-11-18T04:21:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49831/

SparkQA · 2021-11-18T07:28:53Z

Test build #145360 has finished for PR 34638 at commit f9d097c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-19T01:16:14Z

Test build #145418 has finished for PR 34638 at commit 522f7de.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-19T01:34:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49892/

SparkQA · 2021-11-19T02:39:15Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49892/

SparkQA · 2021-11-19T03:30:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49900/

SparkQA · 2021-11-19T04:30:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49900/

SparkQA · 2021-11-19T04:39:54Z

Test build #145428 has finished for PR 34638 at commit e867277.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-19T05:36:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49907/

SparkQA · 2021-11-19T06:20:59Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49907/

SparkQA · 2021-11-19T07:09:45Z

Test build #145435 has finished for PR 34638 at commit 55f9e3f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-03T01:40:01Z

Test build #145872 has finished for PR 34638 at commit 412bb61.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TimedeltaOps(DataTypeOps):
class TimedeltaIndex(Index):
class MissingPandasLikeTimedeltaIndex(MissingPandasLikeIndex):
class SQLStringFormatter(string.Formatter):
class UDFBasicProfiler(BasicProfiler):
class CloudPickleSerializer(FramedSerializer):
class DayTimeIntervalType(AtomicType):
class DayTimeIntervalTypeConverter(object):
class ExecutorPodsPollingSnapshotSource(
class ExecutorPodsWatchSnapshotSource(
class AnsiCombinedTypeCoercionRule(rules: Seq[TypeCoercionRule]) extends
case class RelationTimeTravel(
case class AsOfTimestamp(timestamp: Long) extends TimeTravelSpec
case class AsOfVersion(version: String) extends TimeTravelSpec
class CombinedTypeCoercionRule(rules: Seq[TypeCoercionRule]) extends TypeCoercionRule
case class PrettyPythonUDF(
case class UnclosedCommentProcessor(
case class CreateTable(
case class TableSpec(
case class OptimizeSkewedJoin(ensureRequirements: EnsureRequirements)
// When this is enabled, this class does additional lookup on write operations (put/delete) to

SparkQA · 2021-12-03T02:20:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50347/

SparkQA · 2021-12-03T02:28:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50349/

SparkQA · 2021-12-03T02:28:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50348/

SparkQA · 2021-12-03T03:03:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50347/

SparkQA · 2021-12-03T03:12:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50349/

SparkQA · 2021-12-03T03:13:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50348/

SparkQA · 2021-12-03T03:48:16Z

Test build #145873 has finished for PR 34638 at commit 2cdbb18.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-03T03:52:43Z

Test build #145874 has finished for PR 34638 at commit 5fd0dbe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sadikovi · 2021-12-03T04:43:49Z

@gengliangwang @MaxGekk could you review the PR? Thank you.

MaxGekk · 2021-12-03T05:38:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

@@ -144,6 +150,9 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
        }
        if (options.prefersDecimal && decimalTry.isDefined) {
          decimalTry.get
+        } else if (options.inferTimestamp &&


Could you adjust the comment for inferTimestamp and mention the ntz type:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

Line 141 in 1235bd2

* Enables inferring of TimestampType from strings matched to the timestamp pattern

Yes, sure. I will update, thanks for pointing it out!

Updated! Could you check the latest PR version? Thanks.

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

SparkQA · 2021-12-03T05:46:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50358/

SparkQA · 2021-12-03T06:35:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50358/

SparkQA · 2021-12-03T09:46:42Z

Test build #145883 has finished for PR 34638 at commit 50690e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2021-12-03T10:38:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+
+  test("SPARK-37360: Write and infer TIMESTAMP_LTZ values with a non-default pattern") {
+    withTempPath { path =>
+      val exp = spark.sql("select timestamp_ltz'2020-12-12 12:12:12' as col0")


Could you append a fraction part like timestamp_ltz'2020-12-12 12:12:12.123456'. Though I would guess JSON ds should write .000000, and if something went wrong, it won't be able to read even such zeros. But just in case.

This is a good test case! Updated, thank you.

MaxGekk

LGTM except of a minor comment.

gengliangwang

LGTM if Max's comments are addressed

sadikovi · 2021-12-06T06:26:46Z

JSONOptions have the following comment for inferTimestamp:

Enables inferring of TimestampType and TimestampNTZType from strings matched to the corresponding timestamp pattern defined by the timestampFormat and timestampNTZFormat options respectively.

Let me know if you would like any modifications to the javadoc.
I also updated the tests, would appreciate another round of reviews. Thanks!

SparkQA · 2021-12-06T07:40:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50416/

MaxGekk · 2021-12-06T08:24:15Z

+1, LGTM. Merging to master.
Thank you, @sadikovi and @gengliangwang for review.

SparkQA · 2021-12-06T08:42:21Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50416/

SparkQA · 2021-12-06T11:24:38Z

Test build #145940 has finished for PR 34638 at commit a626c9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sadikovi · 2021-12-06T20:56:13Z

Thank you!

sadikovi added 2 commits November 18, 2021 14:49

add TimestampNTZType support in JSON

5a71c29

minor formatting

f9d097c

github-actions bot added DOCS SQL labels Nov 18, 2021

update code

522f7de

fix DateExpressionsSuite

e867277

update timestampNTZ/timestamp.sql.out file

55f9e3f

sadikovi added 3 commits December 3, 2021 14:24

rebase

412bb61

update files

2cdbb18

update files

5fd0dbe

update tests

50690e9

MaxGekk reviewed Dec 3, 2021

View reviewed changes

MaxGekk approved these changes Dec 3, 2021

View reviewed changes

gengliangwang approved these changes Dec 3, 2021

View reviewed changes

address comments

a626c9d

MaxGekk closed this in 4f36978 Dec 6, 2021

[SPARK-37360][SQL] Support TimestampNTZ in JSON data source #34638

[SPARK-37360][SQL] Support TimestampNTZ in JSON data source #34638

Conversation

sadikovi commented Nov 18, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Nov 18, 2021

gengliangwang commented Nov 18, 2021

SparkQA commented Nov 18, 2021

SparkQA commented Nov 18, 2021

SparkQA commented Nov 19, 2021

SparkQA commented Nov 19, 2021

SparkQA commented Nov 19, 2021

SparkQA commented Nov 19, 2021

SparkQA commented Nov 19, 2021

SparkQA commented Nov 19, 2021

SparkQA commented Nov 19, 2021

SparkQA commented Nov 19, 2021

SparkQA commented Nov 19, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

sadikovi commented Dec 3, 2021

MaxGekk Dec 3, 2021

Choose a reason for hiding this comment

sadikovi Dec 3, 2021

Choose a reason for hiding this comment

sadikovi Dec 6, 2021

Choose a reason for hiding this comment

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

SparkQA commented Dec 3, 2021

MaxGekk Dec 3, 2021

Choose a reason for hiding this comment

sadikovi Dec 6, 2021

Choose a reason for hiding this comment

MaxGekk left a comment

Choose a reason for hiding this comment

gengliangwang left a comment

Choose a reason for hiding this comment

sadikovi commented Dec 6, 2021

SparkQA commented Dec 6, 2021

MaxGekk commented Dec 6, 2021

SparkQA commented Dec 6, 2021

SparkQA commented Dec 6, 2021

sadikovi commented Dec 6, 2021