-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26248][SQL] Infer date type from CSV #23202
Conversation
Test build #99581 has finished for PR 23202 at commit
|
@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends Serializable { | |||
compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType) | |||
case DoubleType => tryParseDouble(field) | |||
case TimestampType => tryParseTimestamp(field) | |||
case DateType => tryParseDate(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is it looks a bit odd that we try date type later. IIRC the root cause is related with date paring library. Couldn't we try date first if we switch the parsing library? I thought that's in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, IIRC, if the pattern is, for instance, yyyy-MM-dd
, 2010-10-10 and also 2018-12-02T21:04:00.123567 are parsed as dates because the current parsing library checks if the string is matched and ignore the rest of them.
So, if we try date first, it will works for its default patterns but if we do some weird patterns, it wouldn't work again.
I was thinking we can fix it if we use DateTimeFormatter
, which does an exact match IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, I did exact match here as well, see https://github.com/apache/spark/pull/23202/files#diff-17719da188b2c15129f848f654a0e6feR174 . If date parser didn't consume all input (pos.getIndex != field.length
), it fails. If I move it up in type inferring pipeline, it should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Can we try date first above? was wondering if there was a reason to try date first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please, have a look at the changes.
Test build #99607 has finished for PR 23202 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know this part well but the change looks reasonable.
@HyukjinKwon @srowen Is there anything which worries you in the PR? |
I'd defer to @HyukjinKwon ; looks OK in broad strokes but he would know much more about the CSV parsing. |
Similar discussion is going on at #23201 (comment). Let me keep tracking them. Sorry for late response, @MaxGekk |
I have rebased this branch on the master and as a consequence of that Also I changed order of type inference here. For now |
Test build #100194 has finished for PR 23202 at commit
|
retest this please |
Test build #100198 has started for PR 23202 at commit |
Test build #100201 has finished for PR 23202 at commit
|
Test build #100203 has finished for PR 23202 at commit
|
thanks, merging to master! |
It works for default values but don't work when, for instance, other patterns are set. (I wrongly made examples and removed .. see #23202 (comment)) |
That's why CSV didn't introduced date type yet because the pattern can be arbitrarily set. How did we handle exact match problem here, @MaxGekk? Doesn't it cause such problem when legacy is on? Also, how do we define the precedence between |
What I mean by exact match is, for instance, we use spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala Line 95 in 8a27952
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala Line 157 in 8a27952
How do we handle the case below when legacy is on?
It's going to introduce some arbitrary behaviours to end users. |
Let's don't add date type inference support yet until we get rid of legacy conf. Introducing exact match itself is a breaking change .. Since date type inference support is an additional improvement, we can do it later during minor releases as well. |
@@ -104,6 +108,7 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { | |||
compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType) | |||
case DoubleType => tryParseDouble(field) | |||
case TimestampType => tryParseTimestamp(field) | |||
case DateType => tryParseDate(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another problem here is the order here matters when type is being merged. For instance, date type is inferred first, and then if timestamp is found, it won't detect timestamp type anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we decided to follow the order in partition inference and infer timestamp first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I mean:
Seq("2010|10|10", "2010_10_10")
.toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
.option("inferSchema", "true")
.option("header", "false")
.option("dateFormat", "yyyy|MM|dd")
.option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()
root
|-- _c0: string (nullable = true)
Seq("2010_10_10", "2010|10|10")
.toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
.option("inferSchema", "true")
.option("header", "false")
.option("dateFormat", "yyyy|MM|dd")
.option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()
root
|-- _c0: date (nullable = true)
Seq("2010_10_10")
.toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
.option("inferSchema", "true")
.option("header", "false")
.option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()
root
|-- _c0: timestamp (nullable = true)
Seq("2010|10|10")
.toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
.option("inferSchema", "true")
.option("header", "false")
.option("dateFormat", "yyyy|MM|dd").csv("/tmp/foo").printSchema()
root
|-- _c0: date (nullable = true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see your point. So the order here is not only for how we infer the type for a single token, but also how we merge types.
This is super weird, as the order has different meaning according to the context:
- for single token, the case appears first has higher priority. Here timestamp is prefered over date
- for type merge, the case appears last has higher priority. Once a type is inferred as date, we can't go back to timestamp anymore.
If the specified format of data and timestamp is not compatible, timestamp and date type should be incompatible and we should fallback to string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of this, I'm +1 for reverting. We should think of a better way to do it. Sorry for not realizing the tricky stuff here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay .. sorry for rushing comments. I realised my comments are hard to read now.
@MaxGekk and @cloud-fan, if I am not completely wrong here, I would like to revert this .. but let me wait for your guys input for a while. |
@HyukjinKwon can you elaborate what's broken after this patch? If the legacy stuff is your only concern, can we disable this feature when legacy is on? |
Give me a min .. let me summarise with examples .. |
Ah, thank you @cloud-fan. Let me leave a comment for summarising it first here anyway. So that we can see what's problem and what should be addressed when this PR is proposed again. |
Problem 1. #23202 (comment) - I left some examples there. If there are multiple rows, and the first row is inferred as date type in the same partition, Problem 2. If legacy is on, we have ambiguity about date/timestamp pattern matching, because they can be arbitrarily set by users. We are able to do this only when Questions: How do we define the precedence between |
I have reverted this as discussed with @cloud-fan above. |
## What changes were proposed in this pull request? The `CSVInferSchema` class is extended to support inferring of `DateType` from CSV input. The attempt to infer `DateType` is performed after inferring `TimestampType`. ## How was this patch tested? Added new test for inferring date types from CSV . It was also tested by existing suites like `CSVInferSchemaSuite`, `CsvExpressionsSuite`, `CsvFunctionsSuite` and `CsvSuite`. Closes apache#23202 from MaxGekk/csv-date-inferring. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? The `CSVInferSchema` class is extended to support inferring of `DateType` from CSV input. The attempt to infer `DateType` is performed after inferring `TimestampType`. ## How was this patch tested? Added new test for inferring date types from CSV . It was also tested by existing suites like `CSVInferSchemaSuite`, `CsvExpressionsSuite`, `CsvFunctionsSuite` and `CsvSuite`. Closes apache#23202 from MaxGekk/csv-date-inferring. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
The
CSVInferSchema
class is extended to support inferring ofDateType
from CSV input. The attempt to inferDateType
is performed after inferringTimestampType
.How was this patch tested?
Added new test for inferring date types from CSV . It was also tested by existing suites like
CSVInferSchemaSuite
,CsvExpressionsSuite
,CsvFunctionsSuite
andCsvSuite
.