-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26178][SQL] Use java.time API for parsing timestamps and dates from CSV #23150
Conversation
Test build #99296 has finished for PR 23150 at commit
|
Test build #99316 has finished for PR 23150 at commit
|
jenkins, retest this, please |
Test build #99327 has finished for PR 23150 at commit
|
@srowen @HyukjinKwon @viirya @mgaido91 May I ask you to look at the PR. The changes are related to another PR which you have reviewed already. |
The code looks good at a glance. So the flag lets people select the old behavior; that keeps it pretty safe. Are there any other behavior changes with the new code, besides being able to parse microseconds? |
The main one is new parser doesn't have the fallback to |
I have to correct timestamp/date pattern in a few tests to follow ISO 8601 (see Patterns for Formatting and Parsing ) like |
Looks like that kind of fallback logic was there for compatibility with Spark 1.x and 2.0; see some comments about 'backwards compatibility' in for example https://github.com/apache/spark/blame/af8250e12490c77f02587275eff9aa225e5dcdba/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala I think we can remove this fallback and document it as a behavior change. |
Some tests didn't pass on new changes till I set time zone explicitly. The tests use the same functions for checking correctness as the code that is supposed to test. I think need more investigation here, I guess old (current) implementation doesn't take timezones into account in some cases. |
import org.apache.spark.sql.types._ | ||
|
||
object CSVInferSchema { | ||
class CSVInferSchema(val options: CSVOptions) extends Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we get the CSVOptions
in the constructor, shall we remove it as a parameter of the several methods? it is pretty confusing which one is used right now...
For now everything has been passed. I run all test localy on different timezones (set via jvm parameter I updated the migration guide since I enabled new parser by default. |
@@ -1107,7 +1111,7 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te | |||
|
|||
test("SPARK-18699 put malformed records in a `columnNameOfCorruptRecord` field") { | |||
Seq(false, true).foreach { multiLine => | |||
val schema = new StructType().add("a", IntegerType).add("b", TimestampType) | |||
val schema = new StructType().add("a", IntegerType).add("b", DateType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the type because supposed to valid date "1983-08-04"
cannot be parsed with default timestamp pattern.
@@ -622,10 +623,11 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te | |||
val options = Map( | |||
"header" -> "true", | |||
"inferSchema" -> "false", | |||
"dateFormat" -> "dd/MM/yyyy hh:mm") | |||
"dateFormat" -> "dd/MM/yyyy HH:mm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to iso 8601:
h clock-hour-of-am-pm (1-12) number 12
H hour-of-day (0-23) number 0
but real data is not in the allowed range:
date
26/08/2015 18:00
27/10/2014 18:30
28/01/2016 20:00
"timestampFormat" -> "yyyy-MM-dd HH:mm:ss", | ||
"dateFormat" -> "yyyy-MM-dd"), false, "UTC") | ||
parser = new UnivocityParser(StructType(Seq.empty), timestampsOptions) | ||
val expected = 1420070400 * DateTimeUtils.MICROS_PER_SECOND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set number of seconds since epoch in UTC because DateTimeUtils.stringToTime(timestamp).getTime
depends on current local time zone in jvm.
Test build #99556 has finished for PR 23150 at commit
|
Test build #99575 has finished for PR 23150 at commit
|
@srowen I think this PR is ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart 2 minor
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
Test build #4449 has finished for PR 23150 at commit
|
Test build #99608 has finished for PR 23150 at commit
|
Merged to master |
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide | |||
|
|||
- Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. | |||
|
|||
- Since Spark 3.0, CSV datasource uses java.time API for parsing and generating CSV content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, can you check if this legacy configuration works or not?
I checked it as below:
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
index 2b8d22dde92..08795972fb7 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
@@ -26,6 +26,7 @@ import scala.util.Try
import org.apache.commons.lang3.time.FastDateFormat
+import org.apache.spark.internal.Logging
import org.apache.spark.sql.internal.SQLConf
sealed trait TimestampFormatter {
@@ -112,11 +113,13 @@ class LegacyFallbackTimestampFormatter(
}
}
-object TimestampFormatter {
+object TimestampFormatter extends Logging {
def apply(format: String, timeZone: TimeZone, locale: Locale): TimestampFormatter = {
if (SQLConf.get.legacyTimeParserEnabled) {
+ logError("LegacyFallbackTimestampFormatter is being used")
new LegacyFallbackTimestampFormatter(format, timeZone, locale)
} else {
+ logError("Iso8601TimestampFormatter is being used")
new Iso8601TimestampFormatter(format, timeZone, locale)
}
}
$ ./bin/spark-shell --conf spark.sql.legacy.timeParser.enabled=true
scala> spark.conf.get("spark.sql.legacy.timeParser.enabled")
res0: String = true
scala> Seq("2010|10|10").toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
scala> spark.read.option("inferSchema", "true").option("header", "false").option("timestampFormat", "yyyy|MM|dd").csv("/tmp/foo").printSchema()
18/12/17 12:11:47 ERROR TimestampFormatter: Iso8601TimestampFormatter is being used
root
|-- _c0: timestamp (nullable = true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised it doesn't work, as this pattern of using SQLConf appears in many places.
Can you create a ticket for it? Is this only a problem when setting conf via spark shell?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely the flag switches behavior since I used in a test recently: https://github.com/apache/spark/pull/23196/files#diff-fde14032b0e6ef8086461edf79a27c5dR1454
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I saw the test but weirdly it doesn't work in shall. Do you mind if I check it as I did? Something is weird. Want to be very sure if it's an issue or something I did wrongly by myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the same but it is interesting that:
scala> spark.range(1).map(_ => org.apache.spark.sql.internal.SQLConf.get.legacyTimeParserEnabled).show
+-----+
|value|
+-----+
| true|
+-----+
It seems when an instance of CSVInferSchema
is created SQL configs haven't been set on executor side yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good that I wasn't doing something stupid alone. Let's file a JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the ticket: https://issues.apache.org/jira/browse/SPARK-26384
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We removed this conf in the follow-up PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#23495 is the PR that removed the conf.
…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs #23150 and #23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes #23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes apache#23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… from CSV ## What changes were proposed in this pull request? In the PR, I propose to use **java.time API** for parsing timestamps and dates from CSV content with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behaviour with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates. ## How was this patch tested? It was tested by `UnivocityParserSuite`, `CsvExpressionsSuite`, `CsvFunctionsSuite` and `CsvSuite`. Closes apache#23150 from MaxGekk/time-parser. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes apache#23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
In the PR, I propose to use java.time API for parsing timestamps and dates from CSV content with microseconds precision.
The SQL configspark.sql.legacy.timeParser.enabled
allow to switch back to previous behaviour with usingjava.text.SimpleDateFormat
/FastDateFormat
for parsing/generating timestamps/dates.How was this patch tested?
It was tested by
UnivocityParserSuite
,CsvExpressionsSuite
,CsvFunctionsSuite
andCsvSuite
.