-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29486][SQL] CalendarInterval should have 3 fields: months, days and microseconds #26134
Conversation
throw new IllegalArgumentException(s"Doesn't support month or year interval: $interval") | ||
if (cal.months > 0 || cal.days > 0) { | ||
throw new IllegalArgumentException( | ||
s"Doesn't support day, week, month or year interval: $interval") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail a current valid usage: Trigger.ProcessingTime("10 day")
Maybe we should still keep converting day interval and use a 24 hours = 1 day assumption
assert(timestampAddInterval(ts3, 36, 0, 123000, TimeZoneGMT.toZoneId) === ts5) | ||
} | ||
|
||
test("timestamp add days") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test case to demonstrate the daylight saving case
Let's create a ticket in https://issues.apache.org/jira/secure/Dashboard.jspa and put it in the title. You can follow other PRs: https://github.com/apache/spark/pulls |
sorry, my mistake. ticket id is attached now. |
ok to test |
Test build #112141 has finished for PR 26134 at commit
|
retest this please |
microseconds += toLong(m.group(6)) * MICROS_PER_MINUTE; | ||
microseconds += toLong(m.group(7)) * MICROS_PER_SECOND; | ||
microseconds += toLong(m.group(8)) * MICROS_PER_MILLI; | ||
microseconds += toLong(m.group(9)); | ||
return new CalendarInterval((int) months, microseconds); | ||
return new CalendarInterval((int) months, (int) days, microseconds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to catch int overflow by Math.toIntExact
instead of silently ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe we can change this to long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid range of dates/timestamps is [0..10000). So, we should support days up to 10000 * 366 = 3660000
. Even int
is so much but it is ok.
@@ -195,7 +194,7 @@ public static CalendarInterval fromDayTimeString(String s, String from, String t | |||
} else { | |||
try { | |||
int sign = m.group(1) != null && m.group(1).equals("-") ? -1 : 1; | |||
long days = m.group(2) == null ? 0 : toLongWithRange("day", m.group(3), | |||
int days = m.group(2) == null ? 0 : (int) toLongWithRange("day", m.group(3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Math.toIntExact
and in other similar places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use Math.toIntExact
for only days field? as you know, there are long -> int in months filed as well, should we cover both of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK here, we call toLongWithRange
with Integer.MAX_VALUE
so it's safe to cast to int.
@@ -332,31 +331,35 @@ public static long parseSecondNano(String secondNano) throws IllegalArgumentExce | |||
} | |||
|
|||
public final int months; | |||
public final int days; | |||
public final long microseconds; | |||
|
|||
public long milliseconds() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method will return different values because you exclude days from microseconds
@@ -132,16 +132,17 @@ private void writeUnalignedBytes( | |||
|
|||
public final void write(int ordinal, CalendarInterval input) { | |||
// grow the global buffer before writing data. | |||
grow(16); | |||
grow(24); | |||
|
|||
// Write the months and microseconds fields of Interval to the variable length portion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the days field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx
|
||
// Write the months and microseconds fields of Interval to the variable length portion. | ||
Platform.putLong(getBuffer(), cursor(), input.months); | ||
Platform.putLong(getBuffer(), cursor() + 8, input.microseconds); | ||
Platform.putLong(getBuffer(), cursor() + 8, input.days); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, why we put long if days is int?
|
||
private val backedSequenceImpl = new IntegralSequenceImpl[T](dt) | ||
private val microsPerMonth = 28 * CalendarInterval.MICROS_PER_DAY | ||
private val microsPerDay = 24 * CalendarInterval.MICROS_PER_HOUR | ||
private val microsPerMonth = 28 * microsPerDay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to the PR but it is interesting why 28 days per months here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code will use microsPerMonth to estimate the array length first. use the minimum days in one month can make sure the array long enough. (smaller days means smaller steps in micro and means longer array length)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but would be great if we can add a comment to explain it in the code.
@@ -238,7 +238,7 @@ from interval_arithmetic | |||
-- !query 17 schema | |||
struct<tsval:timestamp,CAST(tsval - interval 14 weeks 1 days 11 hours 22 minutes 33 seconds 123 milliseconds 456 microseconds AS TIMESTAMP):timestamp,CAST(tsval - interval -14 weeks -1 days -11 hours -22 minutes -33 seconds -123 milliseconds -456 microseconds AS TIMESTAMP):timestamp,CAST(tsval + interval 14 weeks 1 days 11 hours 22 minutes 33 seconds 123 milliseconds 456 microseconds AS TIMESTAMP):timestamp,CAST(tsval + interval -14 weeks -1 days -11 hours -22 minutes -33 seconds -123 milliseconds -456 microseconds AS TIMESTAMP):timestamp,CAST(tsval + (- interval 14 weeks 1 days 11 hours 22 minutes 33 seconds 123 milliseconds 456 microseconds) AS TIMESTAMP):timestamp,CAST(tsval + interval 14 weeks 1 days 11 hours 22 minutes 33 seconds 123 milliseconds 456 microseconds AS TIMESTAMP):timestamp> | |||
-- !query 17 output | |||
2012-01-01 00:00:00 2011-09-23 13:37:26.876544 2012-04-09 12:22:33.123456 2012-04-09 12:22:33.123456 2011-09-23 13:37:26.876544 2011-09-23 13:37:26.876544 2012-04-09 12:22:33.123456 | |||
2012-01-01 00:00:00 2011-09-23 12:37:26.876544 2012-04-09 11:22:33.123456 2012-04-09 11:22:33.123456 2011-09-23 12:37:26.876544 2011-09-23 12:37:26.876544 2012-04-09 11:22:33.123456 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it there difference of 1 hour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you know it's caused by daylight saving
} | ||
|
||
@Override | ||
public int hashCode() { | ||
return 31 * months + (int) microseconds; | ||
return 31 * (31 * months + days) + (int) microseconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to the PR but I am wondering why the most significant bits of microseconds are completely excluded from the hash.
Yes, actual duration of the daylight saving day can be 25 hours but logically any day has 24 hours. Spark As you point out, the daylight saving in PST will happen at
At the second time:
Could you elaborate how your changes resolve this situation? |
one day is always 24 hours, but one day interval may not. Daylight saving doesn't change the # of hours in a day, but the timezone offset of a region. For example, California is in UTC-7 right now, but will be in UTC-8 after 3 Nov 2019. So adding one day to However, I'd like to know if it's allowed to set timezone to a region name that has undetermined timezone offset. If it's not allowed then one day interval is always 24 hours in SQL. |
As I can understand, new
and to get
Allowed by whom (SQL standard?)? For example, Java 8 time API allows to get 0, 1 or 2 zone offsets depend on local timestamp, see https://docs.oracle.com/javase/8/docs/api/java/time/zone/ZoneRules.html#getValidOffsets-java.time.LocalDateTime- , see:
The SQL standard defines time zones as TIMEZONE_HOUR:TIMEZONE_MINUTE. Due to this, there is no issue of converting zone id (region name) to offsets. |
Test build #112154 has finished for PR 26134 at commit
|
Since SQL has always determined zone offsets, a day has 24 hours always, and interval hours are in Current implementation of Taking into account that local hours of timestamps are always in the range of |
@MaxGekk @cloud-fan thanks for your review, here is my answer for the questions:
my solution won't fix this, but the original code won't, too. actually I don't think this matters.
the first case is adding the second case is in SS, if we do a window agg like this:
Since this function is considering timezone, in current implementation, we may have window like this: |
@@ -365,12 +368,12 @@ public boolean equals(Object other) { | |||
if (other == null || !(other instanceof CalendarInterval)) return false; | |||
|
|||
CalendarInterval o = (CalendarInterval) other; | |||
return this.months == o.months && this.microseconds == o.microseconds; | |||
return this.months == o.months && this.days == o.days && this.microseconds == o.microseconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line changed manually? We'd better use IDE to generate the equals and hashCode.
I spend more time thinking about it. There are 3 kinds of timestamps in the SQL world:
However, things are different in the Java world. The java time API use @MaxGekk do you think Spark violates the SQL standard if |
// about a month length in microseconds | ||
val intervalStepInMicros = stepMicros + stepMonths * microsPerMonth | ||
// about a month length in days and a day length in microseconds | ||
// We choose a minimum days(28) in one month to make sure the estimated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this comment to https://github.com/apache/spark/pull/26134/files#diff-9853dcf5ce3d2ac1e94d473197ff5768R2617 ?
@@ -87,6 +87,7 @@ object IntervalUtils { | |||
// Returns total number of seconds with microseconds fractional part in the given interval. | |||
def getEpoch(interval: CalendarInterval): Decimal = { | |||
var result = interval.microseconds | |||
result += DateTimeUtils.MICROS_PER_DAY * interval.days | |||
result += MICROS_PER_YEAR * (interval.months / MONTHS_PER_YEAR) | |||
result += MICROS_PER_MONTH * (interval.months % MONTHS_PER_YEAR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but isn't it simply result += MICROS_PER_MONTH * interval.months
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be if MICROS_PER_YEAR = 12 * MICROS_PER_MONTH . I strictly followed PostgreSQL implementation: https://github.com/postgres/postgres/blob/bffe1bd68457e43925c362d8728ce3b25bdf1c94/src/backend/utils/adt/timestamp.c#L5016-L5022
336f0db
to
4e97bc0
Compare
Test build #112813 has finished for PR 26134 at commit
|
Test build #112814 has finished for PR 26134 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Show resolved
Hide resolved
Test build #113013 has finished for PR 26134 at commit
|
micros = Math.addExact(micros, Math.multiplyExact(hours, MICROS_PER_HOUR)) | ||
micros = Math.addExact(micros, Math.multiplyExact(minutes, MICROS_PER_MINUTE)) | ||
micros = Math.addExact(micros, Math.multiplyExact(seconds, DateTimeUtils.MICROS_PER_SECOND)) | ||
new CalendarInterval(0, sign * micros) | ||
new CalendarInterval(0, sign * days.toInt, sign * micros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can cast to int at the beginning
val days = if (m.group(2) == null) {
0
} else {
toLongWithRange("day", m.group(3), 0, Integer.MAX_VALUE).toInt
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thx
StringType), | ||
"interval 1 years 3 months -3 days") | ||
"interval 1 years 3 months 1 weeks 2 days -3 hours") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk @dongjoon-hyun this is another instance we can think of: I feel 9 days
is more readable than 1 week 2 days
.
LGTM except a few minor comments, thanks for your work and patience @LinhongLiu ! |
@@ -2702,7 +2715,10 @@ object Sequence { | |||
| $arr[$i] = ($elemType) ($t / ${scale}L); | |||
| $i += 1; | |||
| $t = org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampAddInterval( | |||
| $startMicros, $i * $stepMonths, $i * $stepMicros, $zid); | |||
| $startMicros, | |||
| new org.apache.spark.unsafe.types.CalendarInterval( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this place. It has performance penalty if we create an object every time. Let's go back to the previous version. Sorry for the back and forth!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np. I reset the changes back.
Test build #113034 has finished for PR 26134 at commit
|
fb1591e
to
2dd8e43
Compare
Test build #113061 has finished for PR 26134 at commit
|
2dd8e43
to
2f90189
Compare
Test build #113073 has finished for PR 26134 at commit
|
@cloud-fan tests are all pass, could you please check this PR again? |
thanks, merging to master! |
thanks @cloud-fan @xuanyuanking @MaxGekk for your reviewing. |
public final long microseconds; | ||
|
||
public long milliseconds() { | ||
return this.microseconds / MICROS_PER_MILLI; | ||
} | ||
|
||
public CalendarInterval(int months, long microseconds) { | ||
public CalendarInterval(int months, int days, long microseconds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LinhongLiu Could you send out a follow up PR to document why we need days
and how do we use it in this file? This is definitely worth to document. Otherwise, everyone reading this class may need to git blame and go through the long discussion in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LinhongLiu Could you send out a follow up PR to document why we need
days
and how do we use it in this file? This is definitely worth to document. Otherwise, everyone reading this class may need to git blame and go through the long discussion in this PR.
Sure, will do.
### What changes were proposed in this pull request? Follow up of #26134 to document the reason to add days filed and explain how do we use it ### Why are the changes needed? only comment ### Does this PR introduce any user-facing change? no ### How was this patch tested? no need test Closes #26701 from LinhongLiu/spark-29486-followup. Authored-by: Liu,Linhong <liulinhong@baidu.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request? Follow up of apache#26134 to document the reason to add days filed and explain how do we use it ### Why are the changes needed? only comment ### Does this PR introduce any user-facing change? no ### How was this patch tested? no need test Closes apache#26701 from LinhongLiu/spark-29486-followup. Authored-by: Liu,Linhong <liulinhong@baidu.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
Current CalendarInterval has 2 fields: months and microseconds. This PR try to change it
to 3 fields: months, days and microseconds. This is because one logical day interval may
have different number of microseconds (daylight saving).
Why are the changes needed?
One logical day interval may have different number of microseconds (daylight saving).
For example, in PST timezone, there will be 25 hours from 2019-11-2 12:00:00 to
2019-11-3 12:00:00
Does this PR introduce any user-facing change?
no
How was this patch tested?
unit test and new added test cases