Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26246][SQL] Inferring TimestampType from JSON #23201

Closed
wants to merge 18 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 2, 2018

What changes were proposed in this pull request?

The JsonInferSchema class is extended to support TimestampType inferring from string fields in JSON input:

  • If the prefersDecimal option is set to true, it tries to infer decimal type from the string field.
  • If decimal type inference fails or prefersDecimal is disabled, JsonInferSchema tries to infer TimestampType.
  • If timestamp type inference fails, StringType is returned as the inferred type.

How was this patch tested?

Added new test suite - JsonInferSchemaSuite to check date and timestamp types inferring from JSON using JsonInferSchema directly. A few tests were added JsonSuite to check type merging and roundtrip tests. This changes was tested by JsonSuite, JsonExpressionsSuite and JsonFunctionsSuite as well.

@SparkQA
Copy link

SparkQA commented Dec 3, 2018

Test build #99580 has finished for PR 23201 at commit 9dbdf0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JsonInferSchemaSuite extends SparkFunSuite

@SparkQA
Copy link

SparkQA commented Dec 3, 2018

Test build #99613 has finished for PR 23201 at commit 05bbfea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 6, 2018

@cloud-fan May I ask you to look at this PR, please.

@@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
DecimalType(bigDecimal.precision, bigDecimal.scale)
}
decimalTry.getOrElse(StringType)
case VALUE_STRING => StringType
case VALUE_STRING =>
val stringValue = parser.getText
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we abstract out this logic for all the text sources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do that. There is some common code that could be shared. Can we do it in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. How many text data sources already support it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateType is not inferred at all but there is another type inference code that could be shared between JSON and CSV (maybe somewhere else).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked PartitioningUtils.inferPartitionColumnValue, we try timestamp first and then date. Shall we follow it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean partition value type inference will have a different result than json value type inference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean type inference in partition values but you are probably right we should follow the same logic in schema inferring in datasources and partition value types.

Just wondering how it works for now, this code:

val unescapedRaw = unescapePathName(raw)
// try and parse the date, if no exception occurs this is a candidate to be resolved as
// TimestampType
DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
// SPARK-23436: see comment for date
val timestampValue = Cast(Literal(unescapedRaw), TimestampType, Some(timeZone.getID)).eval()
// Disallow TimestampType if the cast returned null
require(timestampValue != null)
Literal.create(timestampValue, TimestampType)
and this
if ((allCatch opt timeParser.parse(field)).isDefined) {
can use different timestamp patterns, or it is supposed to work only with default settings?

Maybe inferPartitionColumnValue should ask a datasource for inferring date/timestamp types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the partition feature is shared between all the file-based sources, I think it's an overkill to make it differ with different data sources.

The simplest solution to me is asking all text sources to follow the behavior of partition value type inference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, one time I tried to match it with CSV a long long ago but I kind of gave up due to behaviour changes IIRC. If that's possible, it should be awesome.

If that's difficult, matching the behaviour within text based datasource (meaning CSV and JSON I guess) should be good enough.

Copy link
Member

@HyukjinKwon HyukjinKwon Dec 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we switch the order here, we don't need the length check here, right?

@cloud-fan, that works only if we use default date/timestamp patterns. Both should do the exact match with pattern, which unfortunately the current parsing library (SimpleDateFormat) does not allow.

The order here is just to make it look better and both shouldn't be dependent on its order. I think we should support those inferences after completely switching the library to java.time.format.* (which does an exact match, and exists in JDK 8) without a legacy. That should make this change easier without a hole.

@SparkQA
Copy link

SparkQA commented Dec 16, 2018

Test build #100202 has finished for PR 23201 at commit 63ebf42.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 17, 2018

@HyukjinKwon @cloud-fan To be consistent to CSV datasource, should we infer TimestampType only so far?

@cloud-fan
Copy link
Contributor

SGTM

@MaxGekk MaxGekk changed the title [SPARK-26246][SQL] Infer date and timestamp types from JSON [SPARK-26246][SQL] Inferring TimestampType from JSON Dec 17, 2018
@SparkQA
Copy link

SparkQA commented Dec 18, 2018

Test build #100261 has finished for PR 23201 at commit e67a2a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2018

Test build #100262 has finished for PR 23201 at commit 11daee3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JsonInferSchemaSuite extends SparkFunSuite with SQLHelper

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in d72571e Dec 18, 2018
@rxin
Copy link
Contributor

rxin commented Jan 4, 2019

Is there an option flag for this? This is a breaking change for people, and we need a way to fallback.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 4, 2019

Is there an option flag for this?

No, I will add it.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 4, 2019

@rxin Would a JSON specific option be enough or we need a global SQL config for that? I mean JSON option prefersDecimal.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 4, 2019

Here is the PR: #23455

holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
## What changes were proposed in this pull request?

The `JsonInferSchema` class is extended to support `TimestampType` inferring from string fields in JSON input:
- If the `prefersDecimal` option is set to `true`, it tries to infer decimal type from the string field.
- If decimal type inference fails or `prefersDecimal` is disabled, `JsonInferSchema` tries to infer `TimestampType`.
- If timestamp type inference fails, `StringType` is returned as the inferred type.

## How was this patch tested?

Added new test suite - `JsonInferSchemaSuite` to check date and timestamp types inferring from JSON using `JsonInferSchema` directly. A few tests were added `JsonSuite` to check type merging and roundtrip tests. This changes was tested by `JsonSuite`, `JsonExpressionsSuite` and `JsonFunctionsSuite` as well.

Closes apache#23201 from MaxGekk/json-infer-time.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@@ -115,13 +121,19 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
// record fields' types have been combined.
NullType

case VALUE_STRING if options.prefersDecimal =>
case VALUE_STRING =>
val field = parser.getText
val decimalTry = allCatch opt {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think this was a mistake. Previously if prefersDecimal was false (by default), it won't try decimal casting. Now looks we're trying decimal try always. @bersprockets, can you open a PR to fix it? I think we can just make it lazy.

Copy link
Member Author

@MaxGekk MaxGekk Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't. The opt method calls body by name: def opt[U >: T](body: => U): Option[U] It should not try infer if options.prefersDecimal is false.

... or prefersDecimal became true by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon What's the problem here. Could you give more context (JIRA, PR), please.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, SPARK-26711. You're cc'ed there as well :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is decimal conversion looks always being tried even when prefersDecimal is false. Previously, it checked prefersDecimal first so decimal try wasn't made. This looks causing performance regression.

I mean

scala> if (prefersDecimal) allCatch opt { println("im expensive"); true }
res0: Any = ()

scala>  allCatch opt { println("im expensive"); true }
im expensive
res1: Option[Boolean] = Some(true)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and making it lazy will save us by short circuiting. I wanted to open a PR right away but wanted to let him open since this is what I investigated at SPARK-26711.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #23201 (comment)
That's being called by name within opt I believe.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

The `JsonInferSchema` class is extended to support `TimestampType` inferring from string fields in JSON input:
- If the `prefersDecimal` option is set to `true`, it tries to infer decimal type from the string field.
- If decimal type inference fails or `prefersDecimal` is disabled, `JsonInferSchema` tries to infer `TimestampType`.
- If timestamp type inference fails, `StringType` is returned as the inferred type.

## How was this patch tested?

Added new test suite - `JsonInferSchemaSuite` to check date and timestamp types inferring from JSON using `JsonInferSchema` directly. A few tests were added `JsonSuite` to check type merging and roundtrip tests. This changes was tested by `JsonSuite`, `JsonExpressionsSuite` and `JsonFunctionsSuite` as well.

Closes apache#23201 from MaxGekk/json-infer-time.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@MaxGekk MaxGekk deleted the json-infer-time branch August 17, 2019 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants