-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37326][SQL] Support TimestampNTZ in CSV data source #34596
Conversation
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145220 has finished for PR 34596 at commit
|
Let me fix the tests first. |
@gengliangwang @MaxGekk @cloud-fan could you review this PR? Thanks. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145262 has finished for PR 34596 at commit
|
I would like to discuss how to handle the following case:
This case is ambiguous: default timestamp pattern for legacy requires timezone but the value does not have any. However, the code works just fine when the pattern does not have timezone. In that case we would write the value successfully. To resolve ambiguity, I identified 2 options:
@gengliangwang Do you have any preferences here? Thanks. |
@@ -38,6 +39,13 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { | |||
legacyFormat = FAST_DATE_FORMAT, | |||
isParsing = true) | |||
|
|||
private val timestampNTZFormatter = TimestampFormatter( | |||
options.timestampFormatInRead, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fail earlier if the pattern string contains timezone? It doesn't make sense to have timezone in NTZ string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moreover, maybe we should add a new CSV option to define pattern for timestamp ntz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
cc @MaxGekk @cloud-fan @gengliangwang for review. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Show resolved
Hide resolved
} | ||
|
||
private def tryParseTimestampNTZ(field: String): DataType = { | ||
if ((allCatch opt !timestampNTZFormatter.isTimeZoneSet(field)).getOrElse(false) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's add one-line comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks.
@@ -164,6 +164,20 @@ class CSVOptions( | |||
s"${DateFormatter.defaultPattern}'T'HH:mm:ss[.SSS][XXX]" | |||
}) | |||
|
|||
val timestampNTZFormatInRead: Option[String] = parameters.get("timestampNTZFormat").orElse { | |||
if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ignore the legacy behavior in TimestampNTZ.
For the casting between string and TimestampNTZ, we didn't respect it either.
} | ||
} | ||
val timestampNTZFormatInWrite: String = parameters.getOrElse("timestampNTZFormat", | ||
if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Kubernetes integration test starting |
val parsed = formatter.parse(s) | ||
val parsedZoneId = parsed.query(TemporalQueries.zone()) | ||
parsedZoneId != null | ||
} catch checkParsedDiff(s, legacyFormatter.isTimeZoneSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
try { | ||
val (_, zoneIdOpt, _) = parseTimestampString(UTF8String.fromString(s)) | ||
zoneIdOpt.isDefined | ||
} catch checkParsedDiff(s, legacyFormatter.isTimeZoneSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work
* refer to `parseTimestampString` for the allowed formats. | ||
*/ | ||
def stringToTimestampWithoutTimeZone(s: UTF8String): Option[Long] = { | ||
stringToTimestampWithoutTimeZone(s, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used false
to preserve the original behaviour in this method and the one below.
@gengliangwang @cloud-fan I updated the code, could you do the penultimate review? Thanks. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145686 has finished for PR 34596 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145731 has finished for PR 34596 at commit
|
// We can only parse the value as TimestampNTZType if it does not have zone-offset or | ||
// time-zone component and can be parsed with the timestamp formatter. | ||
// Otherwise, it is likely to be a timestamp with timezone. | ||
if ((allCatch opt timestampNTZFormatter.parseWithoutTimeZone(field, true)).isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should maybe we skip the parsing if SQLConf.get.timestampType
is set to TIMESTAMP_LTZ
since parsing is non-trivial op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a bit more? Thanks.
My understanding was that the config indicated whether the output of parsing should be treated as TimestampNTZ or TimestampLTZ.
options.zoneId, | ||
legacyFormat = FAST_DATE_FORMAT, | ||
isParsing = true, | ||
forTimestampNTZ = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part I'd defer to @MaxGekk to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good otherwise
@@ -164,6 +164,10 @@ class CSVOptions( | |||
s"${DateFormatter.defaultPattern}'T'HH:mm:ss[.SSS][XXX]" | |||
}) | |||
|
|||
val timestampNTZFormatInRead: Option[String] = parameters.get("timestampNTZFormat") | |||
val timestampNTZFormatInWrite: String = parameters.getOrElse("timestampNTZFormat", | |||
s"${DateFormatter.defaultPattern}'T'HH:mm:ss[.SSS]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what's the reason to have the optional field [.SSS]
in write. How should CSV writer decide whether to write milliseconds or not?
Another question, why the precision is in milliseconds but not in microseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same reason as for timestampFormat above, I just copy-pasted it for timestampNTZFormat and removed the zone component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same reason as for timestampFormat above ...
The option was added when Spark's timestamp type had milliseconds precision. Now the precision is microsecond, and don't see any reasons to loose info in write, by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change it for both LTZ and NTZ, if the precision lose here is a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather change it separately, sounds more like a general problem with timestamps.
StructType(StructField("a", TimestampNTZType) :: Nil), | ||
Map.empty[String, String]) as "value") | ||
.selectExpr("value.a") | ||
checkAnswer(fromCsvDF, Row(null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about positive test for the functions from_csv
/to_csv
and schema_of_csv
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a test right above that verifies the happy path for those functions. I only added a new test because it was missing from the original patch.
val path = s"${dir.getCanonicalPath}/csv" | ||
|
||
val exp = spark.sql("select timestamp_ntz'2020-12-12 12:12:12' as col0") | ||
exp.write.format("csv").option("timestampNTZFormat", "yyyy-MM-dd HH:mm:ss").save(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you test max precision with pattern .SSSSSS
, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@@ -1012,6 +1012,196 @@ abstract class CSVSuite | |||
} | |||
} | |||
|
|||
test("SPARK-37326: Use different pattern to write and infer TIMESTAMP_NTZ values") { | |||
withTempDir { dir => | |||
val path = s"${dir.getCanonicalPath}/csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just write directly to dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I could not because the directory already exists by the time the method is called. withTempDir
creates a directory, but it needs to be non-existent for Spark to write into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is withTempPath
which doesn't create any dirs:
spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SQLHelper.scala
Lines 62 to 66 in 0ba1d38
/** | |
* Generates a temporary path without creating the actual file/directory, then pass it to `f`. If | |
* a file/directory is created there by `f`, it will be delete after `f` returns. | |
*/ | |
protected def withTempPath(f: File => Unit): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to use withTempPath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there is a mixture of withTempDir
and withTempPath
in this file arguably being used with the same purpose. I will open a PR to fix the tests I added but I am not going to update other occurrences.
.option("timestampNTZFormat", "yyyy-MM-dd HH:mm:ss") | ||
.load(path) | ||
|
||
checkAnswer(res, exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkAnswer
doesn't check the type. Could you check that the inferred type is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that fail the answer since the values would be different? Yes, I can check the type explicitly, thanks.
@@ -1012,6 +1012,196 @@ abstract class CSVSuite | |||
} | |||
} | |||
|
|||
test("SPARK-37326: Use different pattern to write and infer TIMESTAMP_NTZ values") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test title says about different patterns but the patterns are the same in write and in inferring, in fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate please? The code does use different timestamp format compared to the default one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test title confuses because it can be read as the patterns in write and infer are different in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rewrite this.
.option("header", "true") | ||
.load(path) | ||
|
||
for (policy <- Seq("exception", "corrected", "legacy")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is policy
used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good point. I forgot to change. Now it is updated.
I addressed the comments, can you review again? |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145778 has finished for PR 34596 at commit
|
+1, LGTM. Merging to master. |
* The return type is [[Option]] in order to distinguish between 0L and null. Please | ||
* refer to `parseTimestampString` for the allowed formats. | ||
*/ | ||
def stringToTimestampWithoutTimeZone(s: UTF8String): Option[Long] = { | ||
def stringToTimestampWithoutTimeZone(s: UTF8String, failOnError: Boolean): Option[Long] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name failOnError
is misleading, as this method never fail. how about allowTimeZone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Respectfully, I have changed this exact code and variable 3 times by now. Yes, I will update 😞.
@@ -66,10 +68,23 @@ sealed trait TimestampFormatter extends Serializable { | |||
@throws(classOf[DateTimeParseException]) | |||
@throws(classOf[DateTimeException]) | |||
@throws(classOf[IllegalStateException]) | |||
def parseWithoutTimeZone(s: String): Long = | |||
def parseWithoutTimeZone(s: String, failOnError: Boolean): Long = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, allowTimeZone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question is, instead of adding a new parameter, can the caller side pick one of parse
and parseWithoutTimeZone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we can do it, because parseWithoutTimeZone
method can be used with different options depending on the context, and it is, in fact. Unfortunately, this is the artifact of the problem that we discussed earlier - some functions are not supposed to fail while parsing the value and others are.
@@ -2489,10 +2693,6 @@ abstract class CSVSuite | |||
} | |||
|
|||
test("SPARK-36536: use casting when datetime pattern is not set") { | |||
def isLegacy: Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still test the legacy behavior for LTZ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is for TimestampNTZ change - isLegacy flag was only applied to TimestampNTZ, TimestampLTZ code is unchanged.
### What changes were proposed in this pull request? This PR fixes an issue that the test added in SPARK-37326 (#34596) fails with Java 11. https://github.com/apache/spark/runs/4381645820?check_suite_focus=true#step:9:11681 The reason is that the error message `DateTimeFormatter` was changed as of Java 9. https://bugs.openjdk.java.net/browse/JDK-8085887 http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/28df1af8e872 ### Why are the changes needed? To keep the build stable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests with the following command success. ``` JAVA_HOME=/path/to/java11/ build/sbt -Phive -Phive-thriftserver "testOnly org.apache.spark.sql.execution.datasources.csv.CSVv*Suite" ``` Closes #34771 from sarutak/followup-SPARK-37326. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…support in CSV data source ### What changes were proposed in this pull request? This is a follow-up PR to #34596. There were a few comments and suggestions raised after the PR was merged, so I addressed them in this follow-up: - Instead of using `failOnError`, which was confusing as no error was thrown in the method, we use `allowTimeZone` which has an opposite meaning of `failOnError` and far more descriptive. - I updated a few test names to resolve ambiguity. - I changed the tests to use `withTempPath` as was suggested in the original PR. ### Why are the changes needed? Code cleanup and clarifications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit and integration tests. Closes #34777 from sadikovi/timestamp-ntz-csv-follow-up. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR adds support for TimestampNTZ type in the CSV data source.
Most of the functionality has already been added, this patch verifies that writes + reads work for TimestampNTZ type and adds schema inference depending on the timestamp value format written. The following applies:
TIMESTAMP_NTZ
andTIMESTAMP_LTZ
values, useTIMESTAMP_LTZ
.TIMESTAMP_NTZ
values, resolve using the the default timestamp type configured withspark.sql.timestampType
.In addition, I introduced a new CSV option
timestampNTZFormat
which is similar totimestampFormat
but it allows to configure read/write pattern forTIMESTAMP_NTZ
types. It is basically a copy of timestamp pattern but without timezone.The schema inference works in the following way:
parseWithoutTimeZone
method throwsQueryExecutionErrors.cannotParseStringAsDataTypeError
which is aRuntimeException
.Why are the changes needed?
The current CSV source could write values as TimestampNTZ into a file but could not preserve this type when reading the file back, this PR fixes the issue.
Does this PR introduce any user-facing change?
Previously, CSV data source would infer timestamp values as
TimestampType
when reading a CSV file. Now, the data source would infer the timestamp value type based on the format (with or without timezone) and default timestamp type based onspark.sql.timestampType
.A new CSV option
timestampNTZFormat
is added to control the way values are formatted during writes or parsed during reads.Now if the timestamp cannot be parsed as a timestamp without timezone, e.g. contains the zone-offset or zone-id component,
parseWithTimeZone
throwsRuntimeException
signalling the inference code to try the next type.How was this patch tested?
I extended
CSVSuite
with a few unit tests to verify that write-read roundtrip works forTIMESTAMP_NTZ
andTIMESTAMP_LTZ
values.