-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26248][SQL] Infer date type from CSV #23202
Changes from all commits
a8d27d6
fa915fd
45958bd
a6723f3
af8059a
e99a619
0ec5c76
ba4a9dc
beb6912
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,16 +22,20 @@ import scala.util.control.Exception.allCatch | |
import org.apache.spark.rdd.RDD | ||
import org.apache.spark.sql.catalyst.analysis.TypeCoercion | ||
import org.apache.spark.sql.catalyst.expressions.ExprUtils | ||
import org.apache.spark.sql.catalyst.util.TimestampFormatter | ||
import org.apache.spark.sql.catalyst.util.{DateFormatter, TimestampFormatter} | ||
import org.apache.spark.sql.types._ | ||
|
||
class CSVInferSchema(val options: CSVOptions) extends Serializable { | ||
|
||
@transient | ||
private lazy val timestampParser = TimestampFormatter( | ||
private lazy val timestampFormatter = TimestampFormatter( | ||
options.timestampFormat, | ||
options.timeZone, | ||
options.locale) | ||
@transient | ||
private lazy val dateFormatter = DateFormatter( | ||
options.dateFormat, | ||
options.locale) | ||
|
||
private val decimalParser = { | ||
ExprUtils.getDecimalParser(options.locale) | ||
|
@@ -104,6 +108,7 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { | |
compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType) | ||
case DoubleType => tryParseDouble(field) | ||
case TimestampType => tryParseTimestamp(field) | ||
case DateType => tryParseDate(field) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another problem here is the order here matters when type is being merged. For instance, date type is inferred first, and then if timestamp is found, it won't detect timestamp type anymore. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIRC we decided to follow the order in partition inference and infer timestamp first? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's what I mean: Seq("2010|10|10", "2010_10_10")
.toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
.option("inferSchema", "true")
.option("header", "false")
.option("dateFormat", "yyyy|MM|dd")
.option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()
Seq("2010_10_10", "2010|10|10")
.toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
.option("inferSchema", "true")
.option("header", "false")
.option("dateFormat", "yyyy|MM|dd")
.option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()
Seq("2010_10_10")
.toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
.option("inferSchema", "true")
.option("header", "false")
.option("timestampFormat", "yyyy_MM_dd").csv("/tmp/foo").printSchema()
Seq("2010|10|10")
.toDF.repartition(1).write.mode("overwrite").text("/tmp/foo")
spark.read
.option("inferSchema", "true")
.option("header", "false")
.option("dateFormat", "yyyy|MM|dd").csv("/tmp/foo").printSchema()
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah i see your point. So the order here is not only for how we infer the type for a single token, but also how we merge types. This is super weird, as the order has different meaning according to the context:
If the specified format of data and timestamp is not compatible, timestamp and date type should be incompatible and we should fallback to string. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because of this, I'm +1 for reverting. We should think of a better way to do it. Sorry for not realizing the tricky stuff here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's okay .. sorry for rushing comments. I realised my comments are hard to read now. |
||
case BooleanType => tryParseBoolean(field) | ||
case StringType => StringType | ||
case other: DataType => | ||
|
@@ -159,9 +164,16 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { | |
} | ||
|
||
private def tryParseTimestamp(field: String): DataType = { | ||
// This case infers a custom `dataFormat` is set. | ||
if ((allCatch opt timestampParser.parse(field)).isDefined) { | ||
if ((allCatch opt timestampFormatter.parse(field)).isDefined) { | ||
TimestampType | ||
} else { | ||
tryParseDate(field) | ||
} | ||
} | ||
|
||
private def tryParseDate(field: String): DataType = { | ||
if ((allCatch opt dateFormatter.parse(field)).isDefined) { | ||
DateType | ||
} else { | ||
tryParseBoolean(field) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is it looks a bit odd that we try date type later. IIRC the root cause is related with date paring library. Couldn't we try date first if we switch the parsing library? I thought that's in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, IIRC, if the pattern is, for instance,
yyyy-MM-dd
, 2010-10-10 and also 2018-12-02T21:04:00.123567 are parsed as dates because the current parsing library checks if the string is matched and ignore the rest of them.So, if we try date first, it will works for its default patterns but if we do some weird patterns, it wouldn't work again.
I was thinking we can fix it if we use
DateTimeFormatter
, which does an exact match IIRC.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, I did exact match here as well, see https://github.com/apache/spark/pull/23202/files#diff-17719da188b2c15129f848f654a0e6feR174 . If date parser didn't consume all input (
pos.getIndex != field.length
), it fails. If I move it up in type inferring pipeline, it should work.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Can we try date first above? was wondering if there was a reason to try date first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please, have a look at the changes.