Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26503][CORE] Get rid of spark.sql.legacy.timeParser.enabled #23411

Closed
wants to merge 1 commit into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Dec 30, 2018

What changes were proposed in this pull request?

Per discussion in #23391 (comment) this proposes to just remove the old pre-Spark-3 time parsing behavior.

How was this patch tested?

Existing tests.

@HyukjinKwon
Copy link
Member

The time parser is kind of a breaking change but to me i'm fine with removing it considering thst we go for 3.0. It makes the codes difficult to maintain and blocks some features like date type inference.

@SparkQA
Copy link

SparkQA commented Dec 30, 2018

Test build #100545 has finished for PR 23411 at commit c9a94f4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Dec 30, 2018

For the failed test, there is the ticket: https://issues.apache.org/jira/browse/SPARK-26374 . I think need to change the test and use Java 8 API for results checking.

@srowen
Copy link
Member Author

srowen commented Dec 30, 2018

@HyukjinKwon yeah this was suggested by @hvanhovell ... it's unclear whether we need a safety valve flag given this is a major release and I don't know if people would have relied on the previous behavior.

@MaxGekk yeah saw that, and part of my concern is that test isn't 'compatible' with the new change yet, which needs to be addressed.

@srowen
Copy link
Member Author

srowen commented Dec 30, 2018

Hm, this is a tough one @MaxGekk . For the particular test case that failed most recently, the random timestamp that is generated that causes a problem is -61688070376409L. I believe it's Jackson that ultimately serializes the timestamp to formatted date, and it produces {"index":2,"col":"0015-03-08T09:00:45.591-07:52"}

That's consistent with java.time formatting, sort of:

scala> java.time.Instant.ofEpochMilli(-61688070376409L)
res0: java.time.Instant = 0015-03-08T16:53:43.591Z

After accounting for time zone differnece, there's a 1 minute and 2 second difference.

That's also the difference in the Spark test failures:

|2    |0015-03-10 08:53:43.591|
...
|2    |0015-03-10 08:52:45.591|

... but there's a nearly 2 day difference with the formatting from java.time!

is this what you're getting to with #23391 and I am just catching up?

I suppose we can't merge this until this difference is really fixed, but until that happens, the new functionality doesn't quite work for old dates right?

@MaxGekk
Copy link
Member

MaxGekk commented Jan 1, 2019

The stringToTime method in DateTimeUtils is used only by the legacy parser. It can be removed too.

@MaxGekk
Copy link
Member

MaxGekk commented Jan 1, 2019

@srowen
Copy link
Member Author

srowen commented Jan 3, 2019

Hm that's not great, but is correctness more important than speed? probably so, but, if the correctness issue is only for old dates, maybe best to leave in the old parser. Should I just close this? and/or add a note about why one might want the old parser?

@MaxGekk
Copy link
Member

MaxGekk commented Jan 3, 2019

Hm that's not great, but is correctness more important than speed?

It would be nice to re-run those JMH benchmarks and confirm the numbers on recent jdk 8 build.

@srowen
Copy link
Member Author

srowen commented Jan 4, 2019

@MaxGekk I ran it locally with Java 8:

# VM version: JDK 1.8.0_192, Java HotSpot(TM) 64-Bit Server VM, 25.192-b12
...
Benchmark                                Mode  Cnt     Score   Error  Units
DateParsingBenchmark.newFormat_noZone    avgt    5    51.168 ± 0.398  ms/op
DateParsingBenchmark.newFormat_withZone  avgt    5  1375.479 ± 8.168  ms/op
DateParsingBenchmark.oldFormat_noZone    avgt    5    76.494 ± 0.662  ms/op
DateParsingBenchmark.oldFormat_withZone  avgt    5    84.934 ± 1.238  ms/op

and Java 11:

# VM version: JDK 11.0.1, OpenJDK 64-Bit Server VM, 11.0.1+13
...
Benchmark                                Mode  Cnt   Score   Error  Units
DateParsingBenchmark.newFormat_noZone    avgt    5  48.196 ± 0.446  ms/op
DateParsingBenchmark.newFormat_withZone  avgt    5  84.605 ± 1.056  ms/op
DateParsingBenchmark.oldFormat_noZone    avgt    5  84.288 ± 0.911  ms/op
DateParsingBenchmark.oldFormat_withZone  avgt    5  98.591 ± 0.602  ms/op

Still a big difference unfortunately. I'd say let's leave in the old parser for the foreseeable future, at least.

@srowen
Copy link
Member Author

srowen commented Jan 4, 2019

Given the potential perf issue here, I do think it's wise to keep the flag as a safety valve for 3.0.

@srowen srowen closed this Jan 4, 2019
@srowen srowen deleted the SPARK-26503 branch January 4, 2019 21:39
@MaxGekk
Copy link
Member

MaxGekk commented Jan 7, 2019

Still a big difference unfortunately.

@srowen This is for the case when a zone is defined via zone id, right? I would guess ZoneTextPrinterParser should not be used if a pattern contains zone offset like XXX in JSON/CSV datasource. For our default pattern with zone offsets - yyyy-MM-dd'T'HH:mm:ss.SSSXXX, there should not be big difference.

@srowen
Copy link
Member Author

srowen commented Jan 7, 2019

I see; I'm just running the test at https://stackoverflow.com/questions/34374464/extremely-slow-parsing-of-time-zone-with-the-new-java-time-api which uses time zones like "CET" and parsed with pattern 'z'. Let me try to respin the benchmark with "XXX" and a suitable timezone.

@srowen
Copy link
Member Author

srowen commented Jan 8, 2019

@MaxGekk I instead tried benchmarking old/new format, and yyyy-MM-dd'T'HH:mm:ss.SSS vs yyyy-MM-dd'T'HH:mm:ss.SSSXXX with input like 2011-12-03T10:15:30.000 and 2011-12-03T10:15:30.000+01:00:

Java 8:

DateParsingBenchmark.newFormat_noZone    avgt    5  0.538 ± 0.066  ms/op
DateParsingBenchmark.newFormat_withZone  avgt    5  0.693 ± 0.024  ms/op
DateParsingBenchmark.oldFormat_noZone    avgt    5  0.745 ± 0.024  ms/op
DateParsingBenchmark.oldFormat_withZone  avgt    5  0.760 ± 0.020  ms/op

Java 11:

DateParsingBenchmark.newFormat_noZone    avgt    5  0.579 ± 0.006  ms/op
DateParsingBenchmark.newFormat_withZone  avgt    5  0.690 ± 0.026  ms/op
DateParsingBenchmark.oldFormat_noZone    avgt    5  0.876 ± 0.009  ms/op
DateParsingBenchmark.oldFormat_withZone  avgt    5  0.893 ± 0.036  ms/op

No real difference (times are smaller here because I ran a shorter test.)

Is that what you're interested in testing?

If so are we back to favoring removing the old legacy parser? I'm OK with that.

@MaxGekk
Copy link
Member

MaxGekk commented Jan 8, 2019

Is that what you're interested in testing?

Yes, thank you.

If so are we back to favoring removing the old legacy parser?

I would continue with the PR. @hvanhovell WDYT?

@srowen
Copy link
Member Author

srowen commented Jan 9, 2019

@MaxGekk I reopened as #23495 with a rebase

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Per discussion in apache#23391 (comment) this proposes to just remove the old pre-Spark-3 time parsing behavior.

This is a rebase of apache#23411

## How was this patch tested?

Existing tests.

Closes apache#23495 from srowen/SPARK-26503.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants