Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26178][SQL] Use java.time API for parsing timestamps and dates from CSV #23150

Closed
wants to merge 32 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 26, 2018

What changes were proposed in this pull request?

In the PR, I propose to use java.time API for parsing timestamps and dates from CSV content with microseconds precision. The SQL config spark.sql.legacy.timeParser.enabled allow to switch back to previous behaviour with using java.text.SimpleDateFormat/FastDateFormat for parsing/generating timestamps/dates.

How was this patch tested?

It was tested by UnivocityParserSuite, CsvExpressionsSuite, CsvFunctionsSuite and CsvSuite.

@SparkQA
Copy link

SparkQA commented Nov 27, 2018

Test build #99296 has finished for PR 23150 at commit f287b77.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2018

Test build #99316 has finished for PR 23150 at commit 52074f7.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 27, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 27, 2018

Test build #99327 has finished for PR 23150 at commit 52074f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 28, 2018

@srowen @HyukjinKwon @viirya @mgaido91 May I ask you to look at the PR. The changes are related to another PR which you have reviewed already.

@srowen
Copy link
Member

srowen commented Nov 28, 2018

The code looks good at a glance. So the flag lets people select the old behavior; that keeps it pretty safe. Are there any other behavior changes with the new code, besides being able to parse microseconds?

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 28, 2018

Are there any other behavior changes with the new code, besides being able to parse microseconds?

The main one is new parser doesn't have the fallback to DateTimeUtils.stringToTime if it cannot parse its input. Though I am not sure about this fallback ... if an user wants to parse input which doesn't fit to specified pattern (like "yyyy-MM-dd'T'HH:mm:ss.SSSXXX") why should guess and try another pattern instead of forcing him/her to set appropriate one. I could guess this fallback was added for backward compatibility to previous versions. If so, should we still keep compatibility to such version?

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 28, 2018

I have to correct timestamp/date pattern in a few tests to follow ISO 8601 (see Patterns for Formatting and Parsing ) like hh -> HH and mm -> MM.

@srowen
Copy link
Member

srowen commented Nov 28, 2018

Looks like that kind of fallback logic was there for compatibility with Spark 1.x and 2.0; see some comments about 'backwards compatibility' in for example https://github.com/apache/spark/blame/af8250e12490c77f02587275eff9aa225e5dcdba/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

I think we can remove this fallback and document it as a behavior change.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 28, 2018

Some tests didn't pass on new changes till I set time zone explicitly. The tests use the same functions for checking correctness as the code that is supposed to test. I think need more investigation here, I guess old (current) implementation doesn't take timezones into account in some cases.

import org.apache.spark.sql.types._

object CSVInferSchema {
class CSVInferSchema(val options: CSVOptions) extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we get the CSVOptions in the constructor, shall we remove it as a parameter of the several methods? it is pretty confusing which one is used right now...

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 1, 2018

they pass right? is there another test you were unable to add?

For now everything has been passed. I run all test localy on different timezones (set via jvm parameter -Duser.timezone).

I updated the migration guide since I enabled new parser by default.

@@ -1107,7 +1111,7 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te

test("SPARK-18699 put malformed records in a `columnNameOfCorruptRecord` field") {
Seq(false, true).foreach { multiLine =>
val schema = new StructType().add("a", IntegerType).add("b", TimestampType)
val schema = new StructType().add("a", IntegerType).add("b", DateType)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the type because supposed to valid date "1983-08-04" cannot be parsed with default timestamp pattern.

@@ -622,10 +623,11 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
val options = Map(
"header" -> "true",
"inferSchema" -> "false",
"dateFormat" -> "dd/MM/yyyy hh:mm")
"dateFormat" -> "dd/MM/yyyy HH:mm")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to iso 8601:

h       clock-hour-of-am-pm (1-12)  number            12
H       hour-of-day (0-23)          number            0

but real data is not in the allowed range:

date
26/08/2015 18:00
27/10/2014 18:30
28/01/2016 20:00

"timestampFormat" -> "yyyy-MM-dd HH:mm:ss",
"dateFormat" -> "yyyy-MM-dd"), false, "UTC")
parser = new UnivocityParser(StructType(Seq.empty), timestampsOptions)
val expected = 1420070400 * DateTimeUtils.MICROS_PER_SECOND
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set number of seconds since epoch in UTC because DateTimeUtils.stringToTime(timestamp).getTime depends on current local time zone in jvm.

@SparkQA
Copy link

SparkQA commented Dec 1, 2018

Test build #99556 has finished for PR 23150 at commit 00509d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2018

Test build #99575 has finished for PR 23150 at commit f8097b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 2, 2018

@srowen I think this PR is ready.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart 2 minor

@SparkQA
Copy link

SparkQA commented Dec 3, 2018

Test build #4449 has finished for PR 23150 at commit 60c0974.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2018

Test build #99608 has finished for PR 23150 at commit 60c0974.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 4, 2018

Merged to master

@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide

- Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0.

- Since Spark 3.0, CSV datasource uses java.time API for parsing and generating CSV content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
Copy link
Member

@HyukjinKwon HyukjinKwon Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, can you check if this legacy configuration works or not?

I checked it as below:

diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
index 2b8d22dde92..08795972fb7 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
@@ -26,6 +26,7 @@ import scala.util.Try

 import org.apache.commons.lang3.time.FastDateFormat

+import org.apache.spark.internal.Logging
 import org.apache.spark.sql.internal.SQLConf

 sealed trait TimestampFormatter {
@@ -112,11 +113,13 @@ class LegacyFallbackTimestampFormatter(
   }
 }

-object TimestampFormatter {
+object TimestampFormatter extends Logging {
   def apply(format: String, timeZone: TimeZone, locale: Locale): TimestampFormatter = {
     if (SQLConf.get.legacyTimeParserEnabled) {
+      logError("LegacyFallbackTimestampFormatter is being used")
       new LegacyFallbackTimestampFormatter(format, timeZone, locale)
     } else {
+      logError("Iso8601TimestampFormatter is being used")
       new Iso8601TimestampFormatter(format, timeZone, locale)
     }
   }
$ ./bin/spark-shell --conf spark.sql.legacy.timeParser.enabled=true
scala> spark.conf.get("spark.sql.legacy.timeParser.enabled")
res0: String = true

scala> Seq("2010|10|10").toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")

scala> spark.read.option("inferSchema", "true").option("header", "false").option("timestampFormat", "yyyy|MM|dd").csv("/tmp/foo").printSchema()
18/12/17 12:11:47 ERROR TimestampFormatter: Iso8601TimestampFormatter is being used
root
 |-- _c0: timestamp (nullable = true)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised it doesn't work, as this pattern of using SQLConf appears in many places.

Can you create a ticket for it? Is this only a problem when setting conf via spark shell?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely the flag switches behavior since I used in a test recently: https://github.com/apache/spark/pull/23196/files#diff-fde14032b0e6ef8086461edf79a27c5dR1454

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I saw the test but weirdly it doesn't work in shall. Do you mind if I check it as I did? Something is weird. Want to be very sure if it's an issue or something I did wrongly by myself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the same but it is interesting that:

scala> spark.range(1).map(_ => org.apache.spark.sql.internal.SQLConf.get.legacyTimeParserEnabled).show
+-----+
|value|
+-----+
| true|
+-----+

It seems when an instance of CSVInferSchema is created SQL configs haven't been set on executor side yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good that I wasn't doing something stupid alone. Let's file a JIRA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed this conf in the follow-up PRs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23495 is the PR that removed the conf.

asfgit pushed a commit that referenced this pull request Dec 24, 2018
…by DateTimeFormatter in comments

## What changes were proposed in this pull request?

The PRs #23150 and #23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes.

Closes #23374 from MaxGekk/java-time-docs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
…by DateTimeFormatter in comments

## What changes were proposed in this pull request?

The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes.

Closes apache#23374 from MaxGekk/java-time-docs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… from CSV

## What changes were proposed in this pull request?

In the PR, I propose to use **java.time API** for parsing timestamps and dates from CSV content with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behaviour with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates.

## How was this patch tested?

It was tested by `UnivocityParserSuite`, `CsvExpressionsSuite`, `CsvFunctionsSuite` and `CsvSuite`.

Closes apache#23150 from MaxGekk/time-parser.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…by DateTimeFormatter in comments

## What changes were proposed in this pull request?

The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes.

Closes apache#23374 from MaxGekk/java-time-docs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@MaxGekk MaxGekk deleted the time-parser branch August 17, 2019 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants