[SPARK-26503][CORE] Get rid of spark.sql.legacy.timeParser.enabled #23411

srowen · 2018-12-30T04:14:57Z

What changes were proposed in this pull request?

Per discussion in #23391 (comment) this proposes to just remove the old pre-Spark-3 time parsing behavior.

How was this patch tested?

Existing tests.

HyukjinKwon · 2018-12-30T05:41:32Z

The time parser is kind of a breaking change but to me i'm fine with removing it considering thst we go for 3.0. It makes the codes difficult to maintain and blocks some features like date type inference.

SparkQA · 2018-12-30T06:25:52Z

Test build #100545 has finished for PR 23411 at commit c9a94f4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-12-30T10:53:32Z

For the failed test, there is the ticket: https://issues.apache.org/jira/browse/SPARK-26374 . I think need to change the test and use Java 8 API for results checking.

srowen · 2018-12-30T13:12:17Z

@HyukjinKwon yeah this was suggested by @hvanhovell ... it's unclear whether we need a safety valve flag given this is a major release and I don't know if people would have relied on the previous behavior.

@MaxGekk yeah saw that, and part of my concern is that test isn't 'compatible' with the new change yet, which needs to be addressed.

srowen · 2018-12-30T16:57:37Z

Hm, this is a tough one @MaxGekk . For the particular test case that failed most recently, the random timestamp that is generated that causes a problem is -61688070376409L. I believe it's Jackson that ultimately serializes the timestamp to formatted date, and it produces {"index":2,"col":"0015-03-08T09:00:45.591-07:52"}

That's consistent with java.time formatting, sort of:

scala> java.time.Instant.ofEpochMilli(-61688070376409L)
res0: java.time.Instant = 0015-03-08T16:53:43.591Z

After accounting for time zone differnece, there's a 1 minute and 2 second difference.

That's also the difference in the Spark test failures:

|2    |0015-03-10 08:53:43.591|
...
|2    |0015-03-10 08:52:45.591|

... but there's a nearly 2 day difference with the formatting from java.time!

is this what you're getting to with #23391 and I am just catching up?

I suppose we can't merge this until this difference is really fixed, but until that happens, the new functionality doesn't quite work for old dates right?

MaxGekk · 2019-01-01T13:02:12Z

The stringToTime method in DateTimeUtils is used only by the legacy parser. It can be removed too.

MaxGekk · 2019-01-01T18:29:20Z

I have found a performance issue on parsing time zone offsets in jdk 8 (fixed in jdk 9): https://bugs.openjdk.java.net/browse/JDK-8066291 and https://stackoverflow.com/questions/34374464/extremely-slow-parsing-of-time-zone-with-the-new-java-time-api /cc @hvanhovell @gatorsmile @cloud-fan

srowen · 2019-01-03T22:43:55Z

Hm that's not great, but is correctness more important than speed? probably so, but, if the correctness issue is only for old dates, maybe best to leave in the old parser. Should I just close this? and/or add a note about why one might want the old parser?

MaxGekk · 2019-01-03T23:11:29Z

Hm that's not great, but is correctness more important than speed?

It would be nice to re-run those JMH benchmarks and confirm the numbers on recent jdk 8 build.

srowen · 2019-01-04T00:58:13Z

@MaxGekk I ran it locally with Java 8:

# VM version: JDK 1.8.0_192, Java HotSpot(TM) 64-Bit Server VM, 25.192-b12
...
Benchmark                                Mode  Cnt     Score   Error  Units
DateParsingBenchmark.newFormat_noZone    avgt    5    51.168 ± 0.398  ms/op
DateParsingBenchmark.newFormat_withZone  avgt    5  1375.479 ± 8.168  ms/op
DateParsingBenchmark.oldFormat_noZone    avgt    5    76.494 ± 0.662  ms/op
DateParsingBenchmark.oldFormat_withZone  avgt    5    84.934 ± 1.238  ms/op

and Java 11:

# VM version: JDK 11.0.1, OpenJDK 64-Bit Server VM, 11.0.1+13
...
Benchmark                                Mode  Cnt   Score   Error  Units
DateParsingBenchmark.newFormat_noZone    avgt    5  48.196 ± 0.446  ms/op
DateParsingBenchmark.newFormat_withZone  avgt    5  84.605 ± 1.056  ms/op
DateParsingBenchmark.oldFormat_noZone    avgt    5  84.288 ± 0.911  ms/op
DateParsingBenchmark.oldFormat_withZone  avgt    5  98.591 ± 0.602  ms/op

Still a big difference unfortunately. I'd say let's leave in the old parser for the foreseeable future, at least.

srowen · 2019-01-04T21:38:41Z

Given the potential perf issue here, I do think it's wise to keep the flag as a safety valve for 3.0.

MaxGekk · 2019-01-07T23:40:17Z

Still a big difference unfortunately.

@srowen This is for the case when a zone is defined via zone id, right? I would guess ZoneTextPrinterParser should not be used if a pattern contains zone offset like XXX in JSON/CSV datasource. For our default pattern with zone offsets - yyyy-MM-dd'T'HH:mm:ss.SSSXXX, there should not be big difference.

srowen · 2019-01-07T23:43:21Z

I see; I'm just running the test at https://stackoverflow.com/questions/34374464/extremely-slow-parsing-of-time-zone-with-the-new-java-time-api which uses time zones like "CET" and parsed with pattern 'z'. Let me try to respin the benchmark with "XXX" and a suitable timezone.

srowen · 2019-01-08T01:51:39Z

@MaxGekk I instead tried benchmarking old/new format, and yyyy-MM-dd'T'HH:mm:ss.SSS vs yyyy-MM-dd'T'HH:mm:ss.SSSXXX with input like 2011-12-03T10:15:30.000 and 2011-12-03T10:15:30.000+01:00:

Java 8:

DateParsingBenchmark.newFormat_noZone    avgt    5  0.538 ± 0.066  ms/op
DateParsingBenchmark.newFormat_withZone  avgt    5  0.693 ± 0.024  ms/op
DateParsingBenchmark.oldFormat_noZone    avgt    5  0.745 ± 0.024  ms/op
DateParsingBenchmark.oldFormat_withZone  avgt    5  0.760 ± 0.020  ms/op

Java 11:

DateParsingBenchmark.newFormat_noZone    avgt    5  0.579 ± 0.006  ms/op
DateParsingBenchmark.newFormat_withZone  avgt    5  0.690 ± 0.026  ms/op
DateParsingBenchmark.oldFormat_noZone    avgt    5  0.876 ± 0.009  ms/op
DateParsingBenchmark.oldFormat_withZone  avgt    5  0.893 ± 0.036  ms/op

No real difference (times are smaller here because I ran a shorter test.)

Is that what you're interested in testing?

If so are we back to favoring removing the old legacy parser? I'm OK with that.

MaxGekk · 2019-01-08T07:21:36Z

Is that what you're interested in testing?

Yes, thank you.

If so are we back to favoring removing the old legacy parser?

I would continue with the PR. @hvanhovell WDYT?

srowen · 2019-01-09T03:05:26Z

@MaxGekk I reopened as #23495 with a rebase

## What changes were proposed in this pull request? Per discussion in apache#23391 (comment) this proposes to just remove the old pre-Spark-3 time parsing behavior. This is a rebase of apache#23411 ## How was this patch tested? Existing tests. Closes apache#23495 from srowen/SPARK-26503.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

Remove spark.sql.legacy.timeParser.enabled

c9a94f4

srowen closed this Jan 4, 2019

srowen deleted the SPARK-26503 branch January 4, 2019 21:39

srowen mentioned this pull request Jan 9, 2019

[SPARK-26503][CORE] Get rid of spark.sql.legacy.timeParser.enabled #23495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26503][CORE] Get rid of spark.sql.legacy.timeParser.enabled #23411

[SPARK-26503][CORE] Get rid of spark.sql.legacy.timeParser.enabled #23411

srowen commented Dec 30, 2018

HyukjinKwon commented Dec 30, 2018

SparkQA commented Dec 30, 2018

MaxGekk commented Dec 30, 2018

srowen commented Dec 30, 2018

srowen commented Dec 30, 2018

MaxGekk commented Jan 1, 2019

MaxGekk commented Jan 1, 2019

srowen commented Jan 3, 2019

MaxGekk commented Jan 3, 2019

srowen commented Jan 4, 2019

srowen commented Jan 4, 2019

MaxGekk commented Jan 7, 2019

srowen commented Jan 7, 2019

srowen commented Jan 8, 2019

MaxGekk commented Jan 8, 2019

srowen commented Jan 9, 2019

[SPARK-26503][CORE] Get rid of spark.sql.legacy.timeParser.enabled #23411

[SPARK-26503][CORE] Get rid of spark.sql.legacy.timeParser.enabled #23411

Conversation

srowen commented Dec 30, 2018

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Dec 30, 2018

SparkQA commented Dec 30, 2018

MaxGekk commented Dec 30, 2018

srowen commented Dec 30, 2018

srowen commented Dec 30, 2018

MaxGekk commented Jan 1, 2019

MaxGekk commented Jan 1, 2019

srowen commented Jan 3, 2019

MaxGekk commented Jan 3, 2019

srowen commented Jan 4, 2019

srowen commented Jan 4, 2019

MaxGekk commented Jan 7, 2019

srowen commented Jan 7, 2019

srowen commented Jan 8, 2019

MaxGekk commented Jan 8, 2019

srowen commented Jan 9, 2019