Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37360][SQL] Support TimestampNTZ in JSON data source #34638

Closed
wants to merge 10 commits into from

Conversation

sadikovi
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds support for TimestampNTZ type in the JSON data source.

Most of the functionality has already been added, this patch verifies that writes + reads work for TimestampNTZ type and adds schema inference depending on the timestamp value format written. The following applies:

  • If there is a mixture of TIMESTAMP_NTZ and TIMESTAMP_LTZ values, use TIMESTAMP_LTZ.
  • If there are only TIMESTAMP_NTZ values, resolve using the the default timestamp type configured with spark.sql.timestampType.

In addition, I introduced a new JSON option timestampNTZFormat which is similar to timestampFormat but it allows to configure read/write pattern for TIMESTAMP_NTZ types. It is basically a copy of timestamp pattern but without timezone.

Why are the changes needed?

The PR fixes issues when writing and reading TimestampNTZ to and from JSON.

Does this PR introduce any user-facing change?

Previously, JSON data source would infer timestamp values as TimestampType when reading a JSON file. Now, the data source would infer the timestamp value type based on the format (with or without timezone) and default timestamp type based on spark.sql.timestampType.

A new JSON option timestampNTZFormat is added to control the way values are formatted during writes or parsed during reads.

How was this patch tested?

I extended JsonSuite with a few unit tests to verify that write-read roundtrip works for TIMESTAMP_NTZ and TIMESTAMP_LTZ values.

@SparkQA
Copy link

SparkQA commented Nov 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49831/

@gengliangwang
Copy link
Member

I left some comments in the PR for CSV: #34596
Let's revisit this one after the CSV one is merged.

@SparkQA
Copy link

SparkQA commented Nov 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49831/

@SparkQA
Copy link

SparkQA commented Nov 18, 2021

Test build #145360 has finished for PR 34638 at commit f9d097c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Test build #145418 has finished for PR 34638 at commit 522f7de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49892/

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49892/

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49900/

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49900/

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Test build #145428 has finished for PR 34638 at commit e867277.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49907/

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49907/

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Test build #145435 has finished for PR 34638 at commit 55f9e3f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Test build #145872 has finished for PR 34638 at commit 412bb61.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class TimedeltaOps(DataTypeOps):
  • class TimedeltaIndex(Index):
  • class MissingPandasLikeTimedeltaIndex(MissingPandasLikeIndex):
  • class SQLStringFormatter(string.Formatter):
  • class UDFBasicProfiler(BasicProfiler):
  • class CloudPickleSerializer(FramedSerializer):
  • class DayTimeIntervalType(AtomicType):
  • class DayTimeIntervalTypeConverter(object):
  • class ExecutorPodsPollingSnapshotSource(
  • class ExecutorPodsWatchSnapshotSource(
  • class AnsiCombinedTypeCoercionRule(rules: Seq[TypeCoercionRule]) extends
  • case class RelationTimeTravel(
  • case class AsOfTimestamp(timestamp: Long) extends TimeTravelSpec
  • case class AsOfVersion(version: String) extends TimeTravelSpec
  • class CombinedTypeCoercionRule(rules: Seq[TypeCoercionRule]) extends TypeCoercionRule
  • case class PrettyPythonUDF(
  • case class UnclosedCommentProcessor(
  • case class CreateTable(
  • case class TableSpec(
  • case class OptimizeSkewedJoin(ensureRequirements: EnsureRequirements)
  • // When this is enabled, this class does additional lookup on write operations (put/delete) to

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50347/

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50349/

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50348/

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50347/

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50349/

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50348/

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Test build #145873 has finished for PR 34638 at commit 2cdbb18.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Test build #145874 has finished for PR 34638 at commit 5fd0dbe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sadikovi
Copy link
Contributor Author

sadikovi commented Dec 3, 2021

@gengliangwang @MaxGekk could you review the PR? Thank you.

@@ -144,6 +150,9 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
}
if (options.prefersDecimal && decimalTry.isDefined) {
decimalTry.get
} else if (options.inferTimestamp &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you adjust the comment for inferTimestamp and mention the ntz type:

* Enables inferring of TimestampType from strings matched to the timestamp pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure. I will update, thanks for pointing it out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated! Could you check the latest PR version? Thanks.

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50358/

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50358/

@SparkQA
Copy link

SparkQA commented Dec 3, 2021

Test build #145883 has finished for PR 34638 at commit 50690e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


test("SPARK-37360: Write and infer TIMESTAMP_LTZ values with a non-default pattern") {
withTempPath { path =>
val exp = spark.sql("select timestamp_ltz'2020-12-12 12:12:12' as col0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you append a fraction part like timestamp_ltz'2020-12-12 12:12:12.123456'. Though I would guess JSON ds should write .000000, and if something went wrong, it won't be able to read even such zeros. But just in case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good test case! Updated, thank you.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except of a minor comment.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if Max's comments are addressed

@sadikovi
Copy link
Contributor Author

sadikovi commented Dec 6, 2021

JSONOptions have the following comment for inferTimestamp:

Enables inferring of TimestampType and TimestampNTZType from strings matched to the corresponding timestamp pattern defined by the timestampFormat and timestampNTZFormat options respectively.

Let me know if you would like any modifications to the javadoc.
I also updated the tests, would appreciate another round of reviews. Thanks!

@SparkQA
Copy link

SparkQA commented Dec 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50416/

@MaxGekk
Copy link
Member

MaxGekk commented Dec 6, 2021

+1, LGTM. Merging to master.
Thank you, @sadikovi and @gengliangwang for review.

@MaxGekk MaxGekk closed this in 4f36978 Dec 6, 2021
@SparkQA
Copy link

SparkQA commented Dec 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50416/

@SparkQA
Copy link

SparkQA commented Dec 6, 2021

Test build #145940 has finished for PR 34638 at commit a626c9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sadikovi
Copy link
Contributor Author

sadikovi commented Dec 6, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants