Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26248][SQL] Infer date type from CSV #23202

Closed
wants to merge 9 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 2, 2018

What changes were proposed in this pull request?

The CSVInferSchema class is extended to support inferring of DateType from CSV input. The attempt to infer DateType is performed after inferring TimestampType.

How was this patch tested?

Added new test for inferring date types from CSV . It was also tested by existing suites like CSVInferSchemaSuite, CsvExpressionsSuite, CsvFunctionsSuite and CsvSuite.

@SparkQA
Copy link

SparkQA commented Dec 3, 2018

Test build #99581 has finished for PR 23202 at commit fa915fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends Serializable {
compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType)
case DoubleType => tryParseDouble(field)
case TimestampType => tryParseTimestamp(field)
case DateType => tryParseDate(field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is it looks a bit odd that we try date type later. IIRC the root cause is related with date paring library. Couldn't we try date first if we switch the parsing library? I thought that's in progress.

Copy link
Member

@HyukjinKwon HyukjinKwon Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, IIRC, if the pattern is, for instance, yyyy-MM-dd, 2010-10-10 and also 2018-12-02T21:04:00.123567 are parsed as dates because the current parsing library checks if the string is matched and ignore the rest of them.

So, if we try date first, it will works for its default patterns but if we do some weird patterns, it wouldn't work again.

I was thinking we can fix it if we use DateTimeFormatter, which does an exact match IIRC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, I did exact match here as well, see https://github.com/apache/spark/pull/23202/files#diff-17719da188b2c15129f848f654a0e6feR174 . If date parser didn't consume all input (pos.getIndex != field.length), it fails. If I move it up in type inferring pipeline, it should work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Can we try date first above? was wondering if there was a reason to try date first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Please, have a look at the changes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2018

Test build #99607 has finished for PR 23202 at commit a6723f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know this part well but the change looks reasonable.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 6, 2018

@HyukjinKwon @srowen Is there anything which worries you in the PR?

@srowen
Copy link
Member

srowen commented Dec 6, 2018

I'd defer to @HyukjinKwon ; looks OK in broad strokes but he would know much more about the CSV parsing.

@HyukjinKwon
Copy link
Member

Similar discussion is going on at #23201 (comment). Let me keep tracking them. Sorry for late response, @MaxGekk

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 15, 2018

I have rebased this branch on the master and as a consequence of that CsvInferSchema uses new date/timestamp parser for type inference. Can we continue with this PR since it is used new Date/TimeFormatter introduced by #23150 and probably will be not affected by #23196.

Also I changed order of type inference here. For now TimestampType is inferred before DateType. /cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Dec 15, 2018

Test build #100194 has finished for PR 23202 at commit 0ec5c76.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 16, 2018

Test build #100198 has started for PR 23202 at commit 0ec5c76.

@SparkQA
Copy link

SparkQA commented Dec 16, 2018

Test build #100201 has finished for PR 23202 at commit ba4a9dc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2018

Test build #100203 has finished for PR 23202 at commit beb6912.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 5217f7b Dec 17, 2018
@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 17, 2018

It works for default values but don't work when, for instance, other patterns are set.

(I wrongly made examples and removed .. see #23202 (comment))

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 17, 2018

That's why CSV didn't introduced date type yet because the pattern can be arbitrarily set. How did we handle exact match problem here, @MaxGekk? Doesn't it cause such problem when legacy is on?

Also, how do we define the precedence between dateFormat and timestampFormat? (for instance, if the patterns are same, then, does it become timestamp or date?)

@HyukjinKwon
Copy link
Member

What I mean by exact match is, for instance, we use FastDateFormat (which is compatible with SimpleDateFormat):

val format = FastDateFormat.getInstance(pattern, timeZone, locale)

val format = FastDateFormat.getInstance(pattern, locale)

How do we handle the case below when legacy is on?

scala> import org.apache.commons.lang3.time.FastDateFormat
import org.apache.commons.lang3.time.FastDateFormat

scala> FastDateFormat.getInstance("yyyy-MM").parse("2010-10-10")
res19: java.util.Date = Fri Oct 01 00:00:00 SGT 2010

scala> FastDateFormat.getInstance("yyyy-MM-dd").parse("2010-10-10")
res20: java.util.Date = Sun Oct 10 00:00:00 SGT 2010

It's going to introduce some arbitrary behaviours to end users.

@HyukjinKwon
Copy link
Member

Let's don't add date type inference support yet until we get rid of legacy conf. Introducing exact match itself is a breaking change .. Since date type inference support is an additional improvement, we can do it later during minor releases as well.

@@ -104,6 +108,7 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType)
case DoubleType => tryParseDouble(field)
case TimestampType => tryParseTimestamp(field)
case DateType => tryParseDate(field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another problem here is the order here matters when type is being merged. For instance, date type is inferred first, and then if timestamp is found, it won't detect timestamp type anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we decided to follow the order in partition inference and infer timestamp first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I mean:

Seq("2010|10|10", "2010_10_10")
  .toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
  .option("inferSchema", "true")
  .option("header", "false")
  .option("dateFormat", "yyyy|MM|dd")
  .option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()
root
 |-- _c0: string (nullable = true)
Seq("2010_10_10", "2010|10|10")
  .toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
  .option("inferSchema", "true")
  .option("header", "false")
  .option("dateFormat", "yyyy|MM|dd")
  .option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()
root
 |-- _c0: date (nullable = true)
Seq("2010_10_10")
  .toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
  .option("inferSchema", "true")
  .option("header", "false")
  .option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()
root
 |-- _c0: timestamp (nullable = true)
Seq("2010|10|10")
  .toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
  .option("inferSchema", "true")
  .option("header", "false")
  .option("dateFormat", "yyyy|MM|dd").csv("/tmp/foo").printSchema()
root
 |-- _c0: date (nullable = true)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see your point. So the order here is not only for how we infer the type for a single token, but also how we merge types.

This is super weird, as the order has different meaning according to the context:

  1. for single token, the case appears first has higher priority. Here timestamp is prefered over date
  2. for type merge, the case appears last has higher priority. Once a type is inferred as date, we can't go back to timestamp anymore.

If the specified format of data and timestamp is not compatible, timestamp and date type should be incompatible and we should fallback to string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this, I'm +1 for reverting. We should think of a better way to do it. Sorry for not realizing the tricky stuff here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay .. sorry for rushing comments. I realised my comments are hard to read now.

@HyukjinKwon
Copy link
Member

@MaxGekk and @cloud-fan, if I am not completely wrong here, I would like to revert this .. but let me wait for your guys input for a while.

@cloud-fan
Copy link
Contributor

@HyukjinKwon can you elaborate what's broken after this patch? If the legacy stuff is your only concern, can we disable this feature when legacy is on?

@HyukjinKwon
Copy link
Member

Give me a min .. let me summarise with examples ..

@HyukjinKwon
Copy link
Member

Ah, thank you @cloud-fan. Let me leave a comment for summarising it first here anyway. So that we can see what's problem and what should be addressed when this PR is proposed again.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 17, 2018

Problem 1.

#23202 (comment) - I left some examples there.

If there are multiple rows, and the first row is inferred as date type in the same partition,
It will not be able to infer timestamp afterward.

Problem 2.

#23202 (comment)

If legacy is on, we have ambiguity about date/timestamp pattern matching, because they can be arbitrarily set by users.
It does not do the exact match, which means it's not going to distinguish yyyy-MM and yyyy-MM-dd for input, for instane, 2010-10-10.

We are able to do this only when spark.sql.legacy.timeParser.enabled is disabled (by default), however, I was thinking it's going to introduce complexity.
I was thinking we could do it later when we remove spark.sql.legacy.timeParser.enabled. Date type inference isn't super important IMHO becase we infer timestamps.
I would like to talk about this further if anyone thinks differently. If the change isn't complicated then I thought, it should also be okay to go ahead.

Questions:

How do we define the precedence between dateFormat and timestampFormat? (for instance, if the patterns are same, then, does it become timestamp or date?)

@HyukjinKwon
Copy link
Member

I have reverted this as discussed with @cloud-fan above.

holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
## What changes were proposed in this pull request?

The `CSVInferSchema` class is extended to support inferring of `DateType` from CSV input. The attempt to infer `DateType` is performed after inferring `TimestampType`.

## How was this patch tested?

Added new test for inferring date types from CSV . It was also tested by existing suites like `CSVInferSchemaSuite`, `CsvExpressionsSuite`, `CsvFunctionsSuite` and `CsvSuite`.

Closes apache#23202 from MaxGekk/csv-date-inferring.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

The `CSVInferSchema` class is extended to support inferring of `DateType` from CSV input. The attempt to infer `DateType` is performed after inferring `TimestampType`.

## How was this patch tested?

Added new test for inferring date types from CSV . It was also tested by existing suites like `CSVInferSchemaSuite`, `CsvExpressionsSuite`, `CsvFunctionsSuite` and `CsvSuite`.

Closes apache#23202 from MaxGekk/csv-date-inferring.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the csv-date-inferring branch August 17, 2019 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants