[SPARK-26248][SQL] Infer date type from CSV #23202

MaxGekk · 2018-12-02T22:02:09Z

What changes were proposed in this pull request?

The CSVInferSchema class is extended to support inferring of DateType from CSV input. The attempt to infer DateType is performed after inferring TimestampType.

How was this patch tested?

Added new test for inferring date types from CSV . It was also tested by existing suites like CSVInferSchemaSuite, CsvExpressionsSuite, CsvFunctionsSuite and CsvSuite.

SparkQA · 2018-12-03T01:45:22Z

Test build #99581 has finished for PR 23202 at commit fa915fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-12-03T04:15:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends Serializable {
          compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType)
        case DoubleType => tryParseDouble(field)
        case TimestampType => tryParseTimestamp(field)
+        case DateType => tryParseDate(field)


The problem here is it looks a bit odd that we try date type later. IIRC the root cause is related with date paring library. Couldn't we try date first if we switch the parsing library? I thought that's in progress.

I mean, IIRC, if the pattern is, for instance, yyyy-MM-dd, 2010-10-10 and also 2018-12-02T21:04:00.123567 are parsed as dates because the current parsing library checks if the string is matched and ignore the rest of them.

So, if we try date first, it will works for its default patterns but if we do some weird patterns, it wouldn't work again.

I was thinking we can fix it if we use DateTimeFormatter, which does an exact match IIRC.

Just in case, I did exact match here as well, see https://github.com/apache/spark/pull/23202/files#diff-17719da188b2c15129f848f654a0e6feR174 . If date parser didn't consume all input (pos.getIndex != field.length), it fails. If I move it up in type inferring pipeline, it should work.

I see. Can we try date first above? was wondering if there was a reason to try date first.

Done. Please, have a look at the changes.

SparkQA · 2018-12-03T19:17:10Z

Test build #99607 has finished for PR 23202 at commit a6723f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I don't know this part well but the change looks reasonable.

MaxGekk · 2018-12-06T16:30:21Z

@HyukjinKwon @srowen Is there anything which worries you in the PR?

srowen · 2018-12-06T16:41:16Z

I'd defer to @HyukjinKwon ; looks OK in broad strokes but he would know much more about the CSV parsing.

HyukjinKwon · 2018-12-10T10:32:50Z

Similar discussion is going on at #23201 (comment). Let me keep tracking them. Sorry for late response, @MaxGekk

MaxGekk · 2018-12-15T16:59:42Z

I have rebased this branch on the master and as a consequence of that CsvInferSchema uses new date/timestamp parser for type inference. Can we continue with this PR since it is used new Date/TimeFormatter introduced by #23150 and probably will be not affected by #23196.

Also I changed order of type inference here. For now TimestampType is inferred before DateType. /cc @cloud-fan

SparkQA · 2018-12-15T20:40:44Z

Test build #100194 has finished for PR 23202 at commit 0ec5c76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-16T01:40:14Z

retest this please

SparkQA · 2018-12-16T02:00:46Z

Test build #100198 has started for PR 23202 at commit 0ec5c76.

SparkQA · 2018-12-16T08:45:14Z

Test build #100201 has finished for PR 23202 at commit ba4a9dc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-16T13:45:46Z

Test build #100203 has finished for PR 23202 at commit beb6912.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-17T00:24:38Z

thanks, merging to master!

HyukjinKwon · 2018-12-17T01:49:40Z

It works for default values but don't work when, for instance, other patterns are set.

(I wrongly made examples and removed .. see #23202 (comment))

HyukjinKwon · 2018-12-17T01:52:03Z

That's why CSV didn't introduced date type yet because the pattern can be arbitrarily set. How did we handle exact match problem here, @MaxGekk? Doesn't it cause such problem when legacy is on?

Also, how do we define the precedence between dateFormat and timestampFormat? (for instance, if the patterns are same, then, does it become timestamp or date?)

HyukjinKwon · 2018-12-17T02:09:54Z

What I mean by exact match is, for instance, we use FastDateFormat (which is compatible with SimpleDateFormat):

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala

Line 95 in 8a27952

val format = FastDateFormat.getInstance(pattern, timeZone, locale)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala

Line 157 in 8a27952

val format = FastDateFormat.getInstance(pattern, locale)

How do we handle the case below when legacy is on?

scala> import org.apache.commons.lang3.time.FastDateFormat
import org.apache.commons.lang3.time.FastDateFormat

scala> FastDateFormat.getInstance("yyyy-MM").parse("2010-10-10")
res19: java.util.Date = Fri Oct 01 00:00:00 SGT 2010

scala> FastDateFormat.getInstance("yyyy-MM-dd").parse("2010-10-10")
res20: java.util.Date = Sun Oct 10 00:00:00 SGT 2010

It's going to introduce some arbitrary behaviours to end users.

HyukjinKwon · 2018-12-17T02:12:20Z

Let's don't add date type inference support yet until we get rid of legacy conf. Introducing exact match itself is a breaking change .. Since date type inference support is an additional improvement, we can do it later during minor releases as well.

HyukjinKwon · 2018-12-17T02:15:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

@@ -104,6 +108,7 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
          compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType)
        case DoubleType => tryParseDouble(field)
        case TimestampType => tryParseTimestamp(field)
+        case DateType => tryParseDate(field)


Another problem here is the order here matters when type is being merged. For instance, date type is inferred first, and then if timestamp is found, it won't detect timestamp type anymore.

IIRC we decided to follow the order in partition inference and infer timestamp first?

Here's what I mean:

Seq("2010|10|10", "2010_10_10") .toDF.repartition(1).write.mode("overwrite").text("/tmp/foo") spark.read .option("inferSchema", "true") .option("header", "false") .option("dateFormat", "yyyy|MM|dd") .option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()

root |-- _c0: string (nullable = true)

Seq("2010_10_10", "2010|10|10") .toDF.repartition(1).write.mode("overwrite").text("/tmp/foo") spark.read .option("inferSchema", "true") .option("header", "false") .option("dateFormat", "yyyy|MM|dd") .option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()

root |-- _c0: date (nullable = true)

Seq("2010_10_10") .toDF.repartition(1).write.mode("overwrite").text("/tmp/foo") spark.read .option("inferSchema", "true") .option("header", "false") .option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()

root |-- _c0: timestamp (nullable = true)

Seq("2010|10|10") .toDF.repartition(1).write.mode("overwrite").text("/tmp/foo") spark.read .option("inferSchema", "true") .option("header", "false") .option("dateFormat", "yyyy|MM|dd").csv("/tmp/foo").printSchema()

root |-- _c0: date (nullable = true)

ah i see your point. So the order here is not only for how we infer the type for a single token, but also how we merge types.

This is super weird, as the order has different meaning according to the context:

for single token, the case appears first has higher priority. Here timestamp is prefered over date

for type merge, the case appears last has higher priority. Once a type is inferred as date, we can't go back to timestamp anymore.

If the specified format of data and timestamp is not compatible, timestamp and date type should be incompatible and we should fallback to string.

Because of this, I'm +1 for reverting. We should think of a better way to do it. Sorry for not realizing the tricky stuff here.

It's okay .. sorry for rushing comments. I realised my comments are hard to read now.

HyukjinKwon · 2018-12-17T02:18:28Z

@MaxGekk and @cloud-fan, if I am not completely wrong here, I would like to revert this .. but let me wait for your guys input for a while.

cloud-fan · 2018-12-17T03:13:07Z

@HyukjinKwon can you elaborate what's broken after this patch? If the legacy stuff is your only concern, can we disable this feature when legacy is on?

HyukjinKwon · 2018-12-17T03:20:07Z

Give me a min .. let me summarise with examples ..

HyukjinKwon · 2018-12-17T03:31:12Z

Ah, thank you @cloud-fan. Let me leave a comment for summarising it first here anyway. So that we can see what's problem and what should be addressed when this PR is proposed again.

HyukjinKwon · 2018-12-17T03:52:21Z

Problem 1.

#23202 (comment) - I left some examples there.

If there are multiple rows, and the first row is inferred as date type in the same partition,
It will not be able to infer timestamp afterward.

Problem 2.

#23202 (comment)

If legacy is on, we have ambiguity about date/timestamp pattern matching, because they can be arbitrarily set by users.
It does not do the exact match, which means it's not going to distinguish yyyy-MM and yyyy-MM-dd for input, for instane, 2010-10-10.

We are able to do this only when spark.sql.legacy.timeParser.enabled is disabled (by default), however, I was thinking it's going to introduce complexity.
I was thinking we could do it later when we remove spark.sql.legacy.timeParser.enabled. Date type inference isn't super important IMHO becase we infer timestamps.
I would like to talk about this further if anyone thinks differently. If the change isn't complicated then I thought, it should also be okay to go ahead.

Questions:

How do we define the precedence between dateFormat and timestampFormat? (for instance, if the patterns are same, then, does it become timestamp or date?)

HyukjinKwon · 2018-12-17T03:53:46Z

I have reverted this as discussed with @cloud-fan above.

## What changes were proposed in this pull request? The `CSVInferSchema` class is extended to support inferring of `DateType` from CSV input. The attempt to infer `DateType` is performed after inferring `TimestampType`. ## How was this patch tested? Added new test for inferring date types from CSV . It was also tested by existing suites like `CSVInferSchemaSuite`, `CsvExpressionsSuite`, `CsvFunctionsSuite` and `CsvSuite`. Closes apache#23202 from MaxGekk/csv-date-inferring. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk added 2 commits December 2, 2018 22:57

Test for inferring the date type

a8d27d6

Inferring date type

fa915fd

HyukjinKwon reviewed Dec 3, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Dec 3, 2018

[SPARK-26246][SQL] Inferring TimestampType from JSON #23201

Closed

MaxGekk added 2 commits December 3, 2018 11:12

Additional tests

45958bd

Infer date type before timestamp type

a6723f3

MaxGekk mentioned this pull request Dec 4, 2018

[SPARK-19228][SQL] Migrate on Java 8 time from FastDateFormat for meet the ISO8601 #21363

Closed

srowen reviewed Dec 4, 2018

View reviewed changes

MaxGekk added 3 commits December 15, 2018 17:36

Merge remote-tracking branch 'origin/master' into csv-date-inferring

af8059a

Moving on new date parser

e99a619

Infer TimestampType before DateType

0ec5c76

Merge branch 'master' into csv-date-inferring

ba4a9dc

Fix merge

beb6912

asfgit closed this in 5217f7b Dec 17, 2018

HyukjinKwon reviewed Dec 17, 2018

View reviewed changes

MaxGekk mentioned this pull request Mar 4, 2019

[SPARK-25517][SQL] fixes for auto inferSchema for date columns as date instead of string #23966

Closed

MaxGekk deleted the csv-date-inferring branch August 17, 2019 13:35

HyukjinKwon mentioned this pull request May 16, 2021

[SPARK-34953][CORE][SQL] Add the code change for adding the DateType in the infer schema while reading in CSV and JSON #32558

Closed

MaxGekk mentioned this pull request Jul 22, 2022

[SPARK-39469][SQL] Infer date type for CSV schema inference #36871

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26248][SQL] Infer date type from CSV #23202

[SPARK-26248][SQL] Infer date type from CSV #23202

MaxGekk commented Dec 2, 2018

SparkQA commented Dec 3, 2018

HyukjinKwon Dec 3, 2018

HyukjinKwon Dec 3, 2018 •

edited

Loading

MaxGekk Dec 3, 2018

HyukjinKwon Dec 3, 2018

MaxGekk Dec 4, 2018

SparkQA commented Dec 3, 2018

srowen left a comment

MaxGekk commented Dec 6, 2018

srowen commented Dec 6, 2018

HyukjinKwon commented Dec 10, 2018

MaxGekk commented Dec 15, 2018

SparkQA commented Dec 15, 2018

cloud-fan commented Dec 16, 2018

SparkQA commented Dec 16, 2018

SparkQA commented Dec 16, 2018

SparkQA commented Dec 16, 2018

cloud-fan commented Dec 17, 2018

HyukjinKwon commented Dec 17, 2018 •

edited

Loading

HyukjinKwon commented Dec 17, 2018 •

edited

Loading

HyukjinKwon commented Dec 17, 2018

HyukjinKwon commented Dec 17, 2018

HyukjinKwon Dec 17, 2018

cloud-fan Dec 17, 2018

HyukjinKwon Dec 17, 2018

cloud-fan Dec 17, 2018

cloud-fan Dec 17, 2018

HyukjinKwon Dec 17, 2018

HyukjinKwon commented Dec 17, 2018

cloud-fan commented Dec 17, 2018

HyukjinKwon commented Dec 17, 2018

HyukjinKwon commented Dec 17, 2018

HyukjinKwon commented Dec 17, 2018 •

edited

Loading

HyukjinKwon commented Dec 17, 2018

[SPARK-26248][SQL] Infer date type from CSV #23202

[SPARK-26248][SQL] Infer date type from CSV #23202

Conversation

MaxGekk commented Dec 2, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 3, 2018

Choose a reason for hiding this comment

HyukjinKwon Dec 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 3, 2018

srowen left a comment

Choose a reason for hiding this comment

MaxGekk commented Dec 6, 2018

srowen commented Dec 6, 2018

HyukjinKwon commented Dec 10, 2018

MaxGekk commented Dec 15, 2018

SparkQA commented Dec 15, 2018

cloud-fan commented Dec 16, 2018

SparkQA commented Dec 16, 2018

SparkQA commented Dec 16, 2018

SparkQA commented Dec 16, 2018

cloud-fan commented Dec 17, 2018

HyukjinKwon commented Dec 17, 2018 • edited Loading

HyukjinKwon commented Dec 17, 2018 • edited Loading

HyukjinKwon commented Dec 17, 2018

HyukjinKwon commented Dec 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Dec 17, 2018

cloud-fan commented Dec 17, 2018

HyukjinKwon commented Dec 17, 2018

HyukjinKwon commented Dec 17, 2018

HyukjinKwon commented Dec 17, 2018 • edited Loading

HyukjinKwon commented Dec 17, 2018

HyukjinKwon Dec 3, 2018 •

edited

Loading

HyukjinKwon commented Dec 17, 2018 •

edited

Loading

HyukjinKwon commented Dec 17, 2018 •

edited

Loading

HyukjinKwon commented Dec 17, 2018 •

edited

Loading