[SPARK-26178][SQL] Use java.time API for parsing timestamps and dates from CSV #23150

MaxGekk · 2018-11-26T23:02:31Z

What changes were proposed in this pull request?

In the PR, I propose to use java.time API for parsing timestamps and dates from CSV content with microseconds precision. ~~The SQL config spark.sql.legacy.timeParser.enabled allow to switch back to previous behaviour with using java.text.SimpleDateFormat/FastDateFormat for parsing/generating timestamps/dates.~~

How was this patch tested?

It was tested by UnivocityParserSuite, CsvExpressionsSuite, CsvFunctionsSuite and CsvSuite.

… values

SparkQA · 2018-11-27T00:55:32Z

Test build #99296 has finished for PR 23150 at commit f287b77.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-27T12:30:26Z

Test build #99316 has finished for PR 23150 at commit 52074f7.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-27T13:08:30Z

jenkins, retest this, please

SparkQA · 2018-11-27T16:48:59Z

Test build #99327 has finished for PR 23150 at commit 52074f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-28T16:34:08Z

@srowen @HyukjinKwon @viirya @mgaido91 May I ask you to look at the PR. The changes are related to another PR which you have reviewed already.

srowen · 2018-11-28T16:40:48Z

The code looks good at a glance. So the flag lets people select the old behavior; that keeps it pretty safe. Are there any other behavior changes with the new code, besides being able to parse microseconds?

MaxGekk · 2018-11-28T17:04:05Z

Are there any other behavior changes with the new code, besides being able to parse microseconds?

The main one is new parser doesn't have the fallback to DateTimeUtils.stringToTime if it cannot parse its input. Though I am not sure about this fallback ... if an user wants to parse input which doesn't fit to specified pattern (like "yyyy-MM-dd'T'HH:mm:ss.SSSXXX") why should guess and try another pattern instead of forcing him/her to set appropriate one. I could guess this fallback was added for backward compatibility to previous versions. If so, should we still keep compatibility to such version?

MaxGekk · 2018-11-28T17:12:46Z

I have to correct timestamp/date pattern in a few tests to follow ISO 8601 (see Patterns for Formatting and Parsing ) like hh -> HH and mm -> MM.

srowen · 2018-11-28T17:18:01Z

Looks like that kind of fallback logic was there for compatibility with Spark 1.x and 2.0; see some comments about 'backwards compatibility' in for example https://github.com/apache/spark/blame/af8250e12490c77f02587275eff9aa225e5dcdba/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

I think we can remove this fallback and document it as a behavior change.

MaxGekk · 2018-11-28T17:18:01Z

Some tests didn't pass on new changes till I set time zone explicitly. The tests use the same functions for checking correctness as the code that is supposed to test. I think need more investigation here, I guess old (current) implementation doesn't take timezones into account in some cases.

mgaido91 · 2018-11-29T11:39:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

 import org.apache.spark.sql.types._

-object CSVInferSchema {
+class CSVInferSchema(val options: CSVOptions) extends Serializable {


since we get the CSVOptions in the constructor, shall we remove it as a parameter of the several methods? it is pretty confusing which one is used right now...

MaxGekk · 2018-12-01T20:17:49Z

they pass right? is there another test you were unable to add?

For now everything has been passed. I run all test localy on different timezones (set via jvm parameter -Duser.timezone).

I updated the migration guide since I enabled new parser by default.

MaxGekk · 2018-12-01T20:24:19Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

@@ -1107,7 +1111,7 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te

  test("SPARK-18699 put malformed records in a `columnNameOfCorruptRecord` field") {
    Seq(false, true).foreach { multiLine =>
-      val schema = new StructType().add("a", IntegerType).add("b", TimestampType)
+      val schema = new StructType().add("a", IntegerType).add("b", DateType)


I changed the type because supposed to valid date "1983-08-04" cannot be parsed with default timestamp pattern.

MaxGekk · 2018-12-01T20:27:24Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

@@ -622,10 +623,11 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
    val options = Map(
      "header" -> "true",
      "inferSchema" -> "false",
-      "dateFormat" -> "dd/MM/yyyy hh:mm")
+      "dateFormat" -> "dd/MM/yyyy HH:mm")


According to iso 8601:

h clock-hour-of-am-pm (1-12) number 12 H hour-of-day (0-23) number 0

but real data is not in the allowed range:

date 26/08/2015 18:00 27/10/2014 18:30 28/01/2016 20:00

MaxGekk · 2018-12-01T20:29:13Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala

+      "timestampFormat" -> "yyyy-MM-dd HH:mm:ss",
+      "dateFormat" -> "yyyy-MM-dd"), false, "UTC")
+    parser = new UnivocityParser(StructType(Seq.empty), timestampsOptions)
+    val expected = 1420070400 * DateTimeUtils.MICROS_PER_SECOND


I set number of seconds since epoch in UTC because DateTimeUtils.stringToTime(timestamp).getTime depends on current local time zone in jvm.

SparkQA · 2018-12-01T23:48:43Z

Test build #99556 has finished for PR 23150 at commit 00509d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-02T17:21:21Z

Test build #99575 has finished for PR 23150 at commit f8097b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-12-02T18:06:38Z

@srowen I think this PR is ready.

mgaido91

LGTM apart 2 minor

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2018-12-03T18:50:43Z

Test build #4449 has finished for PR 23150 at commit 60c0974.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-03T19:31:32Z

Test build #99608 has finished for PR 23150 at commit 60c0974.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-12-04T14:36:54Z

Merged to master

HyukjinKwon · 2018-12-17T04:12:27Z

docs/sql-migration-guide-upgrade.md

@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide

  - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0.

+  - Since Spark 3.0, CSV datasource uses java.time API for parsing and generating CSV content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.


@MaxGekk, can you check if this legacy configuration works or not?

I checked it as below:

diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala index 2b8d22dde92..08795972fb7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala @@ -26,6 +26,7 @@ import scala.util.Try import org.apache.commons.lang3.time.FastDateFormat +import org.apache.spark.internal.Logging import org.apache.spark.sql.internal.SQLConf sealed trait TimestampFormatter { @@ -112,11 +113,13 @@ class LegacyFallbackTimestampFormatter( } } -object TimestampFormatter { +object TimestampFormatter extends Logging { def apply(format: String, timeZone: TimeZone, locale: Locale): TimestampFormatter = { if (SQLConf.get.legacyTimeParserEnabled) { + logError("LegacyFallbackTimestampFormatter is being used") new LegacyFallbackTimestampFormatter(format, timeZone, locale) } else { + logError("Iso8601TimestampFormatter is being used") new Iso8601TimestampFormatter(format, timeZone, locale) } }

$ ./bin/spark-shell --conf spark.sql.legacy.timeParser.enabled=true

scala> spark.conf.get("spark.sql.legacy.timeParser.enabled") res0: String = true scala> Seq("2010|10|10").toDF.repartition(1).write.mode("overwrite").text("/tmp/foo") scala> spark.read.option("inferSchema", "true").option("header", "false").option("timestampFormat", "yyyy|MM|dd").csv("/tmp/foo").printSchema() 18/12/17 12:11:47 ERROR TimestampFormatter: Iso8601TimestampFormatter is being used root |-- _c0: timestamp (nullable = true)

adding @cloud-fan

I'm surprised it doesn't work, as this pattern of using SQLConf appears in many places.

Can you create a ticket for it? Is this only a problem when setting conf via spark shell?

Definitely the flag switches behavior since I used in a test recently: https://github.com/apache/spark/pull/23196/files#diff-fde14032b0e6ef8086461edf79a27c5dR1454

Yea, I saw the test but weirdly it doesn't work in shall. Do you mind if I check it as I did? Something is weird. Want to be very sure if it's an issue or something I did wrongly by myself.

I see the same but it is interesting that:

scala> spark.range(1).map(_ => org.apache.spark.sql.internal.SQLConf.get.legacyTimeParserEnabled).show +-----+ |value| +-----+ | true| +-----+

It seems when an instance of CSVInferSchema is created SQL configs haven't been set on executor side yet.

Ah, good that I wasn't doing something stupid alone. Let's file a JIRA.

Here is the ticket: https://issues.apache.org/jira/browse/SPARK-26384

We removed this conf in the follow-up PRs?

#23495 is the PR that removed the conf.

…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs #23150 and #23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes #23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes apache#23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… from CSV ## What changes were proposed in this pull request? In the PR, I propose to use **java.time API** for parsing timestamps and dates from CSV content with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behaviour with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates. ## How was this patch tested? It was tested by `UnivocityParserSuite`, `CsvExpressionsSuite`, `CsvFunctionsSuite` and `CsvSuite`. Closes apache#23150 from MaxGekk/time-parser. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes apache#23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

MaxGekk added 17 commits November 24, 2018 12:37

New and legacy time parser

74a76c2

Add config spark.sql.legacy.timeParser.enabled

63cf611

Fallback legacy parser

2a2ab83

something

667bf9f

Using instances

227a7bd

Added generator

73ee560

Refactoring of TimeFormatter

f35f6e1

Renaming to DateTimeFormatter

1c09b58

Added DateFormatter

7b213d5

Default values in parsing

242ba47

Parse as date type because format for timestamp is not not matched to…

db48ee6

… values

Fix tests

e18841b

CSVSuite passed

8db0238

Fix imports

0b9ed92

Revert test back

799ebb3

Set timeZone

5a22391

Removing default for micros because it causes conflicts in parsing

f287b77

MaxGekk mentioned this pull request Nov 26, 2018

[SPARK-19228][SQL] Migrate on Java 8 time from FastDateFormat for meet the ISO8601 #21363

Closed

Set timezone otherwise default is using

52074f7

mgaido91 reviewed Nov 29, 2018

View reviewed changes

Removing unrelated changes

00509d3

MaxGekk commented Dec 1, 2018

View reviewed changes

MaxGekk and others added 4 commits December 2, 2018 14:01

Merge remote-tracking branch 'origin/master' into time-parser

9b0570e

Using floorDiv to take days from seconds

e9d6bb0

A test for roundtrip timestamp parsing

1ad1184

Tests for DateTimeFormatter

f8097b4

srowen approved these changes Dec 2, 2018

View reviewed changes

mgaido91 reviewed Dec 3, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

MaxGekk added 2 commits December 3, 2018 11:15

Fix typo

3848795

Merge remote-tracking branch 'fork/time-parser' into time-parser

60c0974

asfgit closed this in f982ca0 Dec 4, 2018

MaxGekk mentioned this pull request Dec 15, 2018

[SPARK-26248][SQL] Infer date type from CSV #23202

Closed

HyukjinKwon reviewed Dec 17, 2018

View reviewed changes

MaxGekk mentioned this pull request Dec 22, 2018

[SPARK-26178][SPARK-26243][SQL][FOLLOWUP] Replacing SimpleDateFormat by DateTimeFormatter in comments #23374

Closed

MaxGekk mentioned this pull request Dec 29, 2018

[SPARK-26002][SQL] Fix day of year calculation for Julian calendar days #23000

Closed

MaxGekk deleted the time-parser branch August 17, 2019 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26178][SQL] Use java.time API for parsing timestamps and dates from CSV #23150

[SPARK-26178][SQL] Use java.time API for parsing timestamps and dates from CSV #23150

MaxGekk commented Nov 26, 2018 •

edited by gatorsmile

Loading

SparkQA commented Nov 27, 2018

SparkQA commented Nov 27, 2018

MaxGekk commented Nov 27, 2018

SparkQA commented Nov 27, 2018

MaxGekk commented Nov 28, 2018

srowen commented Nov 28, 2018

MaxGekk commented Nov 28, 2018

MaxGekk commented Nov 28, 2018 •

edited

Loading

srowen commented Nov 28, 2018

MaxGekk commented Nov 28, 2018

mgaido91 Nov 29, 2018

MaxGekk commented Dec 1, 2018

MaxGekk Dec 1, 2018

MaxGekk Dec 1, 2018

MaxGekk Dec 1, 2018

SparkQA commented Dec 1, 2018

SparkQA commented Dec 2, 2018

MaxGekk commented Dec 2, 2018

mgaido91 left a comment

SparkQA commented Dec 3, 2018

SparkQA commented Dec 3, 2018

srowen commented Dec 4, 2018

HyukjinKwon Dec 17, 2018 •

edited

Loading

HyukjinKwon Dec 17, 2018

cloud-fan Dec 17, 2018

MaxGekk Dec 17, 2018

HyukjinKwon Dec 17, 2018

MaxGekk Dec 17, 2018

HyukjinKwon Dec 17, 2018

MaxGekk Dec 17, 2018

gatorsmile Dec 23, 2019

gatorsmile Dec 23, 2019

		@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide

		- Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0.

		- Since Spark 3.0, CSV datasource uses java.time API for parsing and generating CSV content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.

[SPARK-26178][SQL] Use java.time API for parsing timestamps and dates from CSV #23150

[SPARK-26178][SQL] Use java.time API for parsing timestamps and dates from CSV #23150

Conversation

MaxGekk commented Nov 26, 2018 • edited by gatorsmile Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 27, 2018

SparkQA commented Nov 27, 2018

MaxGekk commented Nov 27, 2018

SparkQA commented Nov 27, 2018

MaxGekk commented Nov 28, 2018

srowen commented Nov 28, 2018

MaxGekk commented Nov 28, 2018

MaxGekk commented Nov 28, 2018 • edited Loading

srowen commented Nov 28, 2018

MaxGekk commented Nov 28, 2018

Choose a reason for hiding this comment

MaxGekk commented Dec 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 1, 2018

SparkQA commented Dec 2, 2018

MaxGekk commented Dec 2, 2018

mgaido91 left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 3, 2018

SparkQA commented Dec 3, 2018

srowen commented Dec 4, 2018

HyukjinKwon Dec 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk commented Nov 26, 2018 •

edited by gatorsmile

Loading

MaxGekk commented Nov 28, 2018 •

edited

Loading

HyukjinKwon Dec 17, 2018 •

edited

Loading