[SPARK-26246][SQL] Inferring TimestampType from JSON #23201

MaxGekk · 2018-12-02T20:57:10Z

What changes were proposed in this pull request?

The JsonInferSchema class is extended to support TimestampType inferring from string fields in JSON input:

If the prefersDecimal option is set to true, it tries to infer decimal type from the string field.
If decimal type inference fails or prefersDecimal is disabled, JsonInferSchema tries to infer TimestampType.
If timestamp type inference fails, StringType is returned as the inferred type.

How was this patch tested?

Added new test suite - JsonInferSchemaSuite to check date and timestamp types inferring from JSON using JsonInferSchema directly. A few tests were added JsonSuite to check type merging and roundtrip tests. This changes was tested by JsonSuite, JsonExpressionsSuite and JsonFunctionsSuite as well.

SparkQA · 2018-12-03T00:35:47Z

Test build #99580 has finished for PR 23201 at commit 9dbdf0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class JsonInferSchemaSuite extends SparkFunSuite

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

…over timestamps

SparkQA · 2018-12-03T19:25:44Z

Test build #99613 has finished for PR 23201 at commit 05bbfea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-12-06T16:46:59Z

@cloud-fan May I ask you to look at this PR, please.

cloud-fan · 2018-12-06T17:04:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

@@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
            DecimalType(bigDecimal.precision, bigDecimal.scale)
        }
        decimalTry.getOrElse(StringType)
-      case VALUE_STRING => StringType
+      case VALUE_STRING =>
+        val stringValue = parser.getText


shall we abstract out this logic for all the text sources?

Yes, we can do that. There is some common code that could be shared. Can we do it in a separate PR?

sure. How many text data sources already support it?

DateType is not inferred at all but there is another type inference code that could be shared between JSON and CSV (maybe somewhere else).

I checked PartitioningUtils.inferPartitionColumnValue, we try timestamp first and then date. Shall we follow it?

do you mean partition value type inference will have a different result than json value type inference?

I didn't mean type inference in partition values but you are probably right we should follow the same logic in schema inferring in datasources and partition value types.

Just wondering how it works for now, this code:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

Lines 474 to 482 in 5a140b7

val unescapedRaw = unescapePathName(raw)

// try and parse the date, if no exception occurs this is a candidate to be resolved as

// TimestampType

DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)

// SPARK-23436: see comment for date

val timestampValue = Cast(Literal(unescapedRaw), TimestampType, Some(timeZone.getID)).eval()

// Disallow TimestampType if the cast returned null

require(timestampValue != null)

Literal.create(timestampValue, TimestampType)

and this

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

Line 163 in f982ca0

if ((allCatch opt timeParser.parse(field)).isDefined) {

can use different timestamp patterns, or it is supposed to work only with default settings?

Maybe inferPartitionColumnValue should ask a datasource for inferring date/timestamp types?

the partition feature is shared between all the file-based sources, I think it's an overkill to make it differ with different data sources.

The simplest solution to me is asking all text sources to follow the behavior of partition value type inference.

Yea, one time I tried to match it with CSV a long long ago but I kind of gave up due to behaviour changes IIRC. If that's possible, it should be awesome.

If that's difficult, matching the behaviour within text based datasource (meaning CSV and JSON I guess) should be good enough.

If we switch the order here, we don't need the length check here, right?

@cloud-fan, that works only if we use default date/timestamp patterns. Both should do the exact match with pattern, which unfortunately the current parsing library (SimpleDateFormat) does not allow.

The order here is just to make it look better and both shouldn't be dependent on its order. I think we should support those inferences after completely switching the library to java.time.format.* (which does an exact match, and exists in JDK 8) without a legacy. That should make this change easier without a hole.

SparkQA · 2018-12-16T13:26:05Z

Test build #100202 has finished for PR 23201 at commit 63ebf42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala

cloud-fan · 2018-12-17T00:40:47Z

LGTM

MaxGekk · 2018-12-17T11:15:15Z

@HyukjinKwon @cloud-fan To be consistent to CSV datasource, should we infer TimestampType only so far?

cloud-fan · 2018-12-17T12:00:12Z

SGTM

…n-infer-time

SparkQA · 2018-12-18T00:40:10Z

Test build #100261 has finished for PR 23201 at commit e67a2a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-18T00:59:40Z

Test build #100262 has finished for PR 23201 at commit 11daee3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class JsonInferSchemaSuite extends SparkFunSuite with SQLHelper

HyukjinKwon

LGTM as well

HyukjinKwon · 2018-12-18T05:51:00Z

Merged to master.

rxin · 2019-01-04T17:22:05Z

Is there an option flag for this? This is a breaking change for people, and we need a way to fallback.

MaxGekk · 2019-01-04T17:27:45Z

Is there an option flag for this?

No, I will add it.

MaxGekk · 2019-01-04T17:34:19Z

@rxin Would a JSON specific option be enough or we need a global SQL config for that? I mean JSON option prefersDecimal.

MaxGekk · 2019-01-04T20:39:08Z

Here is the PR: #23455

## What changes were proposed in this pull request? The `JsonInferSchema` class is extended to support `TimestampType` inferring from string fields in JSON input: - If the `prefersDecimal` option is set to `true`, it tries to infer decimal type from the string field. - If decimal type inference fails or `prefersDecimal` is disabled, `JsonInferSchema` tries to infer `TimestampType`. - If timestamp type inference fails, `StringType` is returned as the inferred type. ## How was this patch tested? Added new test suite - `JsonInferSchemaSuite` to check date and timestamp types inferring from JSON using `JsonInferSchema` directly. A few tests were added `JsonSuite` to check type merging and roundtrip tests. This changes was tested by `JsonSuite`, `JsonExpressionsSuite` and `JsonFunctionsSuite` as well. Closes apache#23201 from MaxGekk/json-infer-time. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon · 2019-01-25T10:03:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

@@ -115,13 +121,19 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
        // record fields' types have been combined.
        NullType

-      case VALUE_STRING if options.prefersDecimal =>
+      case VALUE_STRING =>
+        val field = parser.getText
        val decimalTry = allCatch opt {


Yea, I think this was a mistake. Previously if prefersDecimal was false (by default), it won't try decimal casting. Now looks we're trying decimal try always. @bersprockets, can you open a PR to fix it? I think we can just make it lazy.

We shouldn't. The opt method calls body by name: def opt[U >: T](body: => U): Option[U] It should not try infer if options.prefersDecimal is false.

... or prefersDecimal became true by default?

@HyukjinKwon What's the problem here. Could you give more context (JIRA, PR), please.

Here, SPARK-26711. You're cc'ed there as well :).

The problem here is decimal conversion looks always being tried even when prefersDecimal is false. Previously, it checked prefersDecimal first so decimal try wasn't made. This looks causing performance regression.

I mean

scala> if (prefersDecimal) allCatch opt { println("im expensive"); true } res0: Any = () scala> allCatch opt { println("im expensive"); true } im expensive res1: Option[Boolean] = Some(true)

and making it lazy will save us by short circuiting. I wanted to open a PR right away but wanted to let him open since this is what I investigated at SPARK-26711.

re: #23201 (comment)
That's being called by name within opt I believe.

## What changes were proposed in this pull request? The `JsonInferSchema` class is extended to support `TimestampType` inferring from string fields in JSON input: - If the `prefersDecimal` option is set to `true`, it tries to infer decimal type from the string field. - If decimal type inference fails or `prefersDecimal` is disabled, `JsonInferSchema` tries to infer `TimestampType`. - If timestamp type inference fails, `StringType` is returned as the inferred type. ## How was this patch tested? Added new test suite - `JsonInferSchemaSuite` to check date and timestamp types inferring from JSON using `JsonInferSchema` directly. A few tests were added `JsonSuite` to check type merging and roundtrip tests. This changes was tested by `JsonSuite`, `JsonExpressionsSuite` and `JsonFunctionsSuite` as well. Closes apache#23201 from MaxGekk/json-infer-time. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

MaxGekk added 3 commits December 2, 2018 21:06

Added a test for timestamp inferring

2a26e2c

Infer date and timestamp types

bd47207

Test for date type

9dbdf0a

HyukjinKwon reviewed Dec 3, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala Outdated Show resolved Hide resolved

MaxGekk added 2 commits December 3, 2018 13:36

Added a test to check that inferring of the date type is prioritised …

9376832

…over timestamps

Infer date type before timestamp type

05bbfea

cloud-fan reviewed Dec 6, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Dec 10, 2018

[SPARK-26248][SQL] Infer date type from CSV #23202

Closed

MaxGekk added 2 commits December 16, 2018 10:28

Merge remote-tracking branch 'origin/master' into json-infer-time

53778f9

Fix merges

63ebf42

cloud-fan reviewed Dec 17, 2018

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 17, 2018

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala Outdated Show resolved Hide resolved

MaxGekk added 10 commits December 17, 2018 18:04

Merge remote-tracking branch 'origin/master' into json-infer-time

f92ff86

Inferring timestamp only

e6fc432

Test for inferring timestamps and decimals

b27d081

type -> dt

c59e3e8

GMT -> UTC

e7471a7

Test for fallback to string type

82816ed

Fix task is not serializable

b0d1374

Added test for schema inferring

5782de5

Roundtrip test for timestamp inferring

63a6568

Merge branch 'json-infer-time' of github.com:MaxGekk/spark-1 into jso…

e67a2a1

…n-infer-time

MaxGekk changed the title ~~[SPARK-26246][SQL] Infer date and timestamp types from JSON~~ [SPARK-26246][SQL] Inferring TimestampType from JSON Dec 17, 2018

Testing for legacy and new timestamp parser

11daee3

HyukjinKwon approved these changes Dec 18, 2018

View reviewed changes

asfgit closed this in d72571e Dec 18, 2018

HyukjinKwon reviewed Jan 25, 2019

View reviewed changes

MaxGekk deleted the json-infer-time branch August 17, 2019 13:35

MaxGekk mentioned this pull request Nov 24, 2021

[SPARK-37326][SQL] Support TimestampNTZ in CSV data source #34596

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26246][SQL] Inferring TimestampType from JSON #23201

[SPARK-26246][SQL] Inferring TimestampType from JSON #23201

MaxGekk commented Dec 2, 2018 •

edited

Loading

SparkQA commented Dec 3, 2018

SparkQA commented Dec 3, 2018

MaxGekk commented Dec 6, 2018

cloud-fan Dec 6, 2018

MaxGekk Dec 6, 2018

cloud-fan Dec 6, 2018

MaxGekk Dec 6, 2018

cloud-fan Dec 7, 2018

cloud-fan Dec 9, 2018

MaxGekk Dec 9, 2018

cloud-fan Dec 10, 2018

HyukjinKwon Dec 10, 2018

HyukjinKwon Dec 10, 2018 •

edited

Loading

SparkQA commented Dec 16, 2018

cloud-fan commented Dec 17, 2018

MaxGekk commented Dec 17, 2018

cloud-fan commented Dec 17, 2018

SparkQA commented Dec 18, 2018

SparkQA commented Dec 18, 2018

HyukjinKwon left a comment

HyukjinKwon commented Dec 18, 2018

rxin commented Jan 4, 2019

MaxGekk commented Jan 4, 2019

MaxGekk commented Jan 4, 2019

MaxGekk commented Jan 4, 2019

HyukjinKwon Jan 25, 2019

MaxGekk Jan 25, 2019 •

edited

Loading

MaxGekk Jan 25, 2019

HyukjinKwon Jan 25, 2019

HyukjinKwon Jan 25, 2019

HyukjinKwon Jan 25, 2019

HyukjinKwon Jan 25, 2019

	val unescapedRaw = unescapePathName(raw)
	// try and parse the date, if no exception occurs this is a candidate to be resolved as
	// TimestampType
	DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
	// SPARK-23436: see comment for date
	val timestampValue = Cast(Literal(unescapedRaw), TimestampType, Some(timeZone.getID)).eval()
	// Disallow TimestampType if the cast returned null
	require(timestampValue != null)
	Literal.create(timestampValue, TimestampType)

[SPARK-26246][SQL] Inferring TimestampType from JSON #23201

[SPARK-26246][SQL] Inferring TimestampType from JSON #23201

Conversation

MaxGekk commented Dec 2, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 3, 2018

SparkQA commented Dec 3, 2018

MaxGekk commented Dec 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Dec 10, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Dec 16, 2018

cloud-fan commented Dec 17, 2018

MaxGekk commented Dec 17, 2018

cloud-fan commented Dec 17, 2018

SparkQA commented Dec 18, 2018

SparkQA commented Dec 18, 2018

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Dec 18, 2018

rxin commented Jan 4, 2019

MaxGekk commented Jan 4, 2019

MaxGekk commented Jan 4, 2019

MaxGekk commented Jan 4, 2019

Choose a reason for hiding this comment

MaxGekk Jan 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk commented Dec 2, 2018 •

edited

Loading

HyukjinKwon Dec 10, 2018 •

edited

Loading

MaxGekk Jan 25, 2019 •

edited

Loading