-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19228][SQL] Migrate on Java 8 time from FastDateFormat for meet the ISO8601 #21363
Conversation
…t the ISO8601 and parsing dates in csv correctly. Add support for inferring DateType and custom "dateFormat" option.
Previous pull request #20140 is closed |
@HyukjinKwon, @gatorsmile please take a look |
ok to test |
Test build #90870 has finished for PR 21363 at commit
|
@sergey-rubtsov, would we be able to add a configuration to control this behaviour? Sounds we should better have a configuration to control this behaviour for now since the date / timestamp parsing logic is affected by the library change. |
val castedDate = | ||
parser.makeConverter("_1", DateType, nullable = true, options = dateOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep this line as was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -119,7 +119,6 @@ class CSVOptions( | |||
val positiveInf = parameters.getOrElse("positiveInf", "Inf") | |||
val negativeInf = parameters.getOrElse("negativeInf", "-Inf") | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds unrelated change.
def dateTimeToMicroseconds(localDateTime: LocalDateTime, timeZone: TimeZone): Long = { | ||
val microOfSecond = localDateTime.getLong(ChronoField.MICRO_OF_SECOND) | ||
val epochSecond = localDateTime.atZone(timeZone.toZoneId).toInstant.getEpochSecond | ||
epochSecond * 1000000L + microOfSecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000000L
-> MICROS_PER_SECOND
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
@@ -140,14 +141,23 @@ private[csv] object CSVInferSchema { | |||
private def tryParseDouble(field: String, options: CSVOptions): DataType = { | |||
if ((allCatch opt field.toDouble).isDefined || isInfOrNan(field, options)) { | |||
DoubleType | |||
} else { | |||
tryParseDate(field, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a behavior change? Previously timestamp type, now date type/timestamp type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, by mistake we have identical "timestampFormat" and "dateFormat" options.
Let it be "yyyy-MM-dd"
'TimestampType' (8 bytes) is larger than 'DateType' (4 bytes)
So if they can overlap, we need to try parse it as date firstly, because both of these types are suitable, but you need to try to use a more compact by default and it will be correct inferring of type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, DateType here is ignored at all, I'm not sure that it was conceived when the type was created
test("Timestamp field types are inferred correctly via custom data format") { | ||
var options = new CSVOptions(Map("timestampFormat" -> "yyyy-mm"), "GMT") | ||
test("Timestamp field types are inferred correctly via custom date format") { | ||
var options = new CSVOptions(Map("timestampFormat" -> "yyyy-MM"), "GMT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to change this? The format string change is also a behavior change for users. We might need a config for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"yyyy-mm" means years and minutes, this is a date format, not a time format
"yyyy-MM" means years and months, but I do not insist on this change
@@ -90,6 +90,7 @@ private[csv] object CSVInferSchema { | |||
// DecimalTypes have different precisions and scales, so we try to find the common type. | |||
findTightestCommonType(typeSoFar, tryParseDecimal(field, options)).getOrElse(StringType) | |||
case DoubleType => tryParseDouble(field, options) | |||
case DateType => tryParseDate(field, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also is a behavior change. Shall we document it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do it, but where exactly it should be documented?
@HyukjinKwon please, clarify, how to add a configuration to control this behaviour? Do you mean to keep backward compatibility? |
@sergey-rubtsov we have to keep backward compatibility. If a user upgrades, with your change a running application may break because of data not being anymore timestamp but date. We can add a new entry in Moreover, we have to mention in the migration guide all the behavioral changes we introduce. |
ok to test |
Test build #91599 has finished for PR 21363 at commit
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can probably use Java 8 Date classes now instead of 3rd party ones, but can we do so without a behavior change? is it actually required?
Looks difficult because the behaviours themselves are different. One possibility is a fallback and the other possibility is configuration. |
I am going to suggest to close this since it's being active more then few weeks. It should be good to fix. Let me leave some cc's who might be interested in this just FYI. Feel free to take over this when you guys find some time or are interested in this. Adding @xuanyuanking, @mgaido91, @viirya, @MaxGekk, @softmanu who I could think of for now. Feel free to ignore my cc if you guys are busy or having more important fixes you guys are working on. |
@HyukjinKwon Great thanks for ping me, I'll try to work on this and cc all reviewer in this PR. |
@xuanyuanking Are you still working on this? If not, I could continue. |
@MaxGekk Sorry for the late, something inserted in the my scheduler, I plan to start this PR in this weekend, if its too late please just take it, sorry for the late again. |
Here is the PR: #23150 which allows to switch on java.time API (in CSV so far). |
@MaxGekk now that your change is merge, can this proceed, @xuanyuanking ? or is it obsolete? |
I'm going to close this since @sergey-rubtsov is inactive. Most of feasible changes extracted from here are merged. One last part is date type inference which's going to need some more discussions (IMHO). |
What changes were proposed in this pull request?
Add support for inferring DateType and custom "dateFormat" option.
Fix an old bug with parse string to SQL's timestamp value in microseconds accuracy.
Add a type-widening rule in findTightestCommonType between DateType and TimestampType.
How was this patch tested?
Fix some tests to accord with an internationally agreed way to represent dates.
Add an end-to-end test case and unit tests.