Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19228][SQL] Migrate on Java 8 time from FastDateFormat for meet the ISO8601 #21363

Closed
wants to merge 1 commit into from

Conversation

sergey-rubtsov
Copy link

What changes were proposed in this pull request?

Add support for inferring DateType and custom "dateFormat" option.
Fix an old bug with parse string to SQL's timestamp value in microseconds accuracy.
Add a type-widening rule in findTightestCommonType between DateType and TimestampType.

How was this patch tested?

Fix some tests to accord with an internationally agreed way to represent dates.
Add an end-to-end test case and unit tests.

…t the ISO8601 and parsing dates in csv correctly. Add support for inferring DateType and custom "dateFormat" option.
@sergey-rubtsov
Copy link
Author

Previous pull request #20140 is closed

@sergey-rubtsov
Copy link
Author

sergey-rubtsov commented May 18, 2018

@HyukjinKwon, @gatorsmile please take a look

@sergey-rubtsov
Copy link
Author

cc @viirya @mgaido91 @dongjoon-hyun

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90870 has finished for PR 21363 at commit 65179a2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 21, 2018

@sergey-rubtsov, would we be able to add a configuration to control this behaviour? Sounds we should better have a configuration to control this behaviour for now since the date / timestamp parsing logic is affected by the library change.

val castedDate =
parser.makeConverter("_1", DateType, nullable = true, options = dateOptions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this line as was.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -119,7 +119,6 @@ class CSVOptions(
val positiveInf = parameters.getOrElse("positiveInf", "Inf")
val negativeInf = parameters.getOrElse("negativeInf", "-Inf")


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds unrelated change.

def dateTimeToMicroseconds(localDateTime: LocalDateTime, timeZone: TimeZone): Long = {
val microOfSecond = localDateTime.getLong(ChronoField.MICRO_OF_SECOND)
val epochSecond = localDateTime.atZone(timeZone.toZoneId).toInstant.getEpochSecond
epochSecond * 1000000L + microOfSecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000000L -> MICROS_PER_SECOND?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@@ -140,14 +141,23 @@ private[csv] object CSVInferSchema {
private def tryParseDouble(field: String, options: CSVOptions): DataType = {
if ((allCatch opt field.toDouble).isDefined || isInfOrNan(field, options)) {
DoubleType
} else {
tryParseDate(field, options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a behavior change? Previously timestamp type, now date type/timestamp type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, by mistake we have identical "timestampFormat" and "dateFormat" options.
Let it be "yyyy-MM-dd"
'TimestampType' (8 bytes) is larger than 'DateType' (4 bytes)
So if they can overlap, we need to try parse it as date firstly, because both of these types are suitable, but you need to try to use a more compact by default and it will be correct inferring of type

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, DateType here is ignored at all, I'm not sure that it was conceived when the type was created

test("Timestamp field types are inferred correctly via custom data format") {
var options = new CSVOptions(Map("timestampFormat" -> "yyyy-mm"), "GMT")
test("Timestamp field types are inferred correctly via custom date format") {
var options = new CSVOptions(Map("timestampFormat" -> "yyyy-MM"), "GMT")
Copy link
Member

@viirya viirya May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to change this? The format string change is also a behavior change for users. We might need a config for that.

Copy link
Author

@sergey-rubtsov sergey-rubtsov May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"yyyy-mm" means years and minutes, this is a date format, not a time format
"yyyy-MM" means years and months, but I do not insist on this change

@@ -90,6 +90,7 @@ private[csv] object CSVInferSchema {
// DecimalTypes have different precisions and scales, so we try to find the common type.
findTightestCommonType(typeSoFar, tryParseDecimal(field, options)).getOrElse(StringType)
case DoubleType => tryParseDouble(field, options)
case DateType => tryParseDate(field, options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also is a behavior change. Shall we document it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do it, but where exactly it should be documented?

@sergey-rubtsov
Copy link
Author

@HyukjinKwon please, clarify, how to add a configuration to control this behaviour? Do you mean to keep backward compatibility?
But it seems, now, we don't handle a Spark "DataType" at all.
I can change the behaviour in next way:
Hanlde "DataType" if option "dateFormat" was set by customer, if it is not set, ignore it at all, is it okay?

@mgaido91
Copy link
Contributor

@sergey-rubtsov we have to keep backward compatibility. If a user upgrades, with your change a running application may break because of data not being anymore timestamp but date. We can add a new entry in SQLConf for that.

Moreover, we have to mention in the migration guide all the behavioral changes we introduce.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 9, 2018

Test build #91599 has finished for PR 21363 at commit 65179a2.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we can probably use Java 8 Date classes now instead of 3rd party ones, but can we do so without a behavior change? is it actually required?

@HyukjinKwon
Copy link
Member

Looks difficult because the behaviours themselves are different. One possibility is a fallback and the other possibility is configuration.

@HyukjinKwon
Copy link
Member

I am going to suggest to close this since it's being active more then few weeks. It should be good to fix. Let me leave some cc's who might be interested in this just FYI. Feel free to take over this when you guys find some time or are interested in this.

Adding @xuanyuanking, @mgaido91, @viirya, @MaxGekk, @softmanu who I could think of for now. Feel free to ignore my cc if you guys are busy or having more important fixes you guys are working on.

@xuanyuanking
Copy link
Member

@HyukjinKwon Great thanks for ping me, I'll try to work on this and cc all reviewer in this PR.

@MaxGekk
Copy link
Member

MaxGekk commented Nov 18, 2018

@xuanyuanking Are you still working on this? If not, I could continue.

@xuanyuanking
Copy link
Member

@MaxGekk Sorry for the late, something inserted in the my scheduler, I plan to start this PR in this weekend, if its too late please just take it, sorry for the late again.

@MaxGekk
Copy link
Member

MaxGekk commented Nov 26, 2018

Here is the PR: #23150 which allows to switch on java.time API (in CSV so far).

@srowen
Copy link
Member

srowen commented Dec 4, 2018

@MaxGekk now that your change is merge, can this proceed, @xuanyuanking ? or is it obsolete?

@MaxGekk
Copy link
Member

MaxGekk commented Dec 4, 2018

@srowen There is another PR with related changes - inferring DateType from CSV: #23202

@HyukjinKwon
Copy link
Member

I'm going to close this since @sergey-rubtsov is inactive.

Most of feasible changes extracted from here are merged. One last part is date type inference which's going to need some more discussions (IMHO).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants