Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29486][SQL] CalendarInterval should have 3 fields: months, days and microseconds #26134

Closed
wants to merge 22 commits into from

Conversation

LinhongLiu
Copy link
Contributor

What changes were proposed in this pull request?

Current CalendarInterval has 2 fields: months and microseconds. This PR try to change it
to 3 fields: months, days and microseconds. This is because one logical day interval may
have different number of microseconds (daylight saving).

Why are the changes needed?

One logical day interval may have different number of microseconds (daylight saving).
For example, in PST timezone, there will be 25 hours from 2019-11-2 12:00:00 to
2019-11-3 12:00:00

Does this PR introduce any user-facing change?

no

How was this patch tested?

unit test and new added test cases

throw new IllegalArgumentException(s"Doesn't support month or year interval: $interval")
if (cal.months > 0 || cal.days > 0) {
throw new IllegalArgumentException(
s"Doesn't support day, week, month or year interval: $interval")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail a current valid usage: Trigger.ProcessingTime("10 day")
Maybe we should still keep converting day interval and use a 24 hours = 1 day assumption

assert(timestampAddInterval(ts3, 36, 0, 123000, TimeZoneGMT.toZoneId) === ts5)
}

test("timestamp add days") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test case to demonstrate the daylight saving case

@LinhongLiu LinhongLiu changed the title Calendarinterval CalendarInterval should have 3 fields: months, days and microseconds Oct 16, 2019
@LinhongLiu
Copy link
Contributor Author

cc @xuanyuanking @cloud-fan

@cloud-fan
Copy link
Contributor

Let's create a ticket in https://issues.apache.org/jira/secure/Dashboard.jspa and put it in the title. You can follow other PRs: https://github.com/apache/spark/pulls

@LinhongLiu LinhongLiu changed the title CalendarInterval should have 3 fields: months, days and microseconds [SPARK-29486][SQL] CalendarInterval should have 3 fields: months, days and microseconds Oct 16, 2019
@LinhongLiu
Copy link
Contributor Author

Let's create a ticket in https://issues.apache.org/jira/secure/Dashboard.jspa and put it in the title. You can follow other PRs: https://github.com/apache/spark/pulls

sorry, my mistake. ticket id is attached now.

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Oct 16, 2019

Test build #112141 has finished for PR 26134 at commit 3d954d6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member

retest this please

microseconds += toLong(m.group(6)) * MICROS_PER_MINUTE;
microseconds += toLong(m.group(7)) * MICROS_PER_SECOND;
microseconds += toLong(m.group(8)) * MICROS_PER_MILLI;
microseconds += toLong(m.group(9));
return new CalendarInterval((int) months, microseconds);
return new CalendarInterval((int) months, (int) days, microseconds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to catch int overflow by Math.toIntExact instead of silently ignore it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe we can change this to long

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid range of dates/timestamps is [0..10000). So, we should support days up to 10000 * 366 = 3660000. Even int is so much but it is ok.

@@ -195,7 +194,7 @@ public static CalendarInterval fromDayTimeString(String s, String from, String t
} else {
try {
int sign = m.group(1) != null && m.group(1).equals("-") ? -1 : 1;
long days = m.group(2) == null ? 0 : toLongWithRange("day", m.group(3),
int days = m.group(2) == null ? 0 : (int) toLongWithRange("day", m.group(3),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Math.toIntExact and in other similar places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use Math.toIntExact for only days field? as you know, there are long -> int in months filed as well, should we cover both of them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK here, we call toLongWithRange with Integer.MAX_VALUE so it's safe to cast to int.

@@ -332,31 +331,35 @@ public static long parseSecondNano(String secondNano) throws IllegalArgumentExce
}

public final int months;
public final int days;
public final long microseconds;

public long milliseconds() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method will return different values because you exclude days from microseconds

@@ -132,16 +132,17 @@ private void writeUnalignedBytes(

public final void write(int ordinal, CalendarInterval input) {
// grow the global buffer before writing data.
grow(16);
grow(24);

// Write the months and microseconds fields of Interval to the variable length portion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the days field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx


// Write the months and microseconds fields of Interval to the variable length portion.
Platform.putLong(getBuffer(), cursor(), input.months);
Platform.putLong(getBuffer(), cursor() + 8, input.microseconds);
Platform.putLong(getBuffer(), cursor() + 8, input.days);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, why we put long if days is int?


private val backedSequenceImpl = new IntegralSequenceImpl[T](dt)
private val microsPerMonth = 28 * CalendarInterval.MICROS_PER_DAY
private val microsPerDay = 24 * CalendarInterval.MICROS_PER_HOUR
private val microsPerMonth = 28 * microsPerDay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the PR but it is interesting why 28 days per months here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code will use microsPerMonth to estimate the array length first. use the minimum days in one month can make sure the array long enough. (smaller days means smaller steps in micro and means longer array length)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but would be great if we can add a comment to explain it in the code.

@@ -238,7 +238,7 @@ from interval_arithmetic
-- !query 17 schema
struct<tsval:timestamp,CAST(tsval - interval 14 weeks 1 days 11 hours 22 minutes 33 seconds 123 milliseconds 456 microseconds AS TIMESTAMP):timestamp,CAST(tsval - interval -14 weeks -1 days -11 hours -22 minutes -33 seconds -123 milliseconds -456 microseconds AS TIMESTAMP):timestamp,CAST(tsval + interval 14 weeks 1 days 11 hours 22 minutes 33 seconds 123 milliseconds 456 microseconds AS TIMESTAMP):timestamp,CAST(tsval + interval -14 weeks -1 days -11 hours -22 minutes -33 seconds -123 milliseconds -456 microseconds AS TIMESTAMP):timestamp,CAST(tsval + (- interval 14 weeks 1 days 11 hours 22 minutes 33 seconds 123 milliseconds 456 microseconds) AS TIMESTAMP):timestamp,CAST(tsval + interval 14 weeks 1 days 11 hours 22 minutes 33 seconds 123 milliseconds 456 microseconds AS TIMESTAMP):timestamp>
-- !query 17 output
2012-01-01 00:00:00 2011-09-23 13:37:26.876544 2012-04-09 12:22:33.123456 2012-04-09 12:22:33.123456 2011-09-23 13:37:26.876544 2011-09-23 13:37:26.876544 2012-04-09 12:22:33.123456
2012-01-01 00:00:00 2011-09-23 12:37:26.876544 2012-04-09 11:22:33.123456 2012-04-09 11:22:33.123456 2011-09-23 12:37:26.876544 2011-09-23 12:37:26.876544 2012-04-09 11:22:33.123456
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it there difference of 1 hour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you know it's caused by daylight saving

}

@Override
public int hashCode() {
return 31 * months + (int) microseconds;
return 31 * (31 * months + days) + (int) microseconds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the PR but I am wondering why the most significant bits of microseconds are completely excluded from the hash.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 16, 2019

Why are the changes needed?
One logical day interval may have different number of microseconds (daylight saving).
For example, in PST timezone, there will be 25 hours from 2019-11-2 12:00:00 to
2019-11-3 12:00:00

Yes, actual duration of the daylight saving day can be 25 hours but logically any day has 24 hours. Spark INTERVAL type defines not duration, it is adjustment of local date-time. After your changes, it adjusts logical (local) months, days and for some reasons physical time duration of a day. It looks like inconsistent, doesn't it?

As you point out, the daylight saving in PST will happen at 2019-11-03 02:00:00. So, the next timestamp with granularity of 1 microsecond is 2019-11-03 01:00:01. Even more, the local timestamp 2019-11-03 02:00:00 will be showed on local clock twice.
At the first time:

timestamp'2019-11-03 02:00:00' + interval 1 microsecond = timestamp'2019-11-03 01:00:00.000001'

At the second time:

timestamp'2019-11-03 02:00:00' + interval 1 microsecond = timestamp'2019-11-03 02:00:00.000001'

Could you elaborate how your changes resolve this situation?

@cloud-fan
Copy link
Contributor

one day is always 24 hours, but one day interval may not. Daylight saving doesn't change the # of hours in a day, but the timezone offset of a region. For example, California is in UTC-7 right now, but will be in UTC-8 after 3 Nov 2019. So adding one day to 3 Nov 2019 at local time maybe different from adding 24 hours.

However, I'd like to know if it's allowed to set timezone to a region name that has undetermined timezone offset. If it's not allowed then one day interval is always 24 hours in SQL.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 16, 2019

... but one day interval may not.

As I can understand, new microseconds field means actual time duration inside of a day, correct? Back to the example, it means:

timestamp'2019-11-03 01:59:59 PST' + interval 2 seconds = timestamp'2019-11-03 01:00:01 PST'

and to get timestamp'2019-11-03 02:00:01 PST', we should add another interval:

timestamp'2019-11-03 01:59:59 PST' + interval 1 hour 2 seconds = timestamp'2019-11-03 02:00:01 PST'

However, I'd like to know if it's allowed to set timezone to a region name that has undetermined timezone offset.

Allowed by whom (SQL standard?)? For example, Java 8 time API allows to get 0, 1 or 2 zone offsets depend on local timestamp, see https://docs.oracle.com/javase/8/docs/api/java/time/zone/ZoneRules.html#getValidOffsets-java.time.LocalDateTime- , see:

The mapping from a local date-time to an offset is not straightforward. There are three cases:
 -   Normal, with one valid offset. For the vast majority of the year, the normal case applies, where there is a single valid offset for the local date-time.
 -   Gap, with zero valid offsets. This is when clocks jump forward typically due to the spring daylight savings change from "winter" to "summer". In a gap there are local date-time values with no valid offset.
 -   Overlap, with two valid offsets. This is when clocks are set back typically due to the autumn daylight savings change from "summer" to "winter". In an overlap there are local date-time values with two valid offsets.

If it's not allowed then one day interval is always 24 hours in SQL.

The SQL standard defines time zones as TIMEZONE_HOUR:TIMEZONE_MINUTE. Due to this, there is no issue of converting zone id (region name) to offsets.

@SparkQA
Copy link

SparkQA commented Oct 16, 2019

Test build #112154 has finished for PR 26134 at commit 3d954d6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 16, 2019

Since SQL has always determined zone offsets, a day has 24 hours always, and interval hours are in 0..23:
Screen Shot 2019-10-16 at 13 48 43

Current implementation of CalendarIntervalType is correct union of SQL year-month INTERVAL and day-time INTERVAL. Also there is no issue of adding intervals to timestamps on Daylight saving day if we assume that the time part of intervals is just "logical" adjustment.

Taking into account that local hours of timestamps are always in the range of [0, 23], @LinhongLiu could you show concrete example when old (current) implementation returns wrong results, and your changes fix that. And it would be nice to put such example in the description of this PR.

@LinhongLiu
Copy link
Contributor Author

LinhongLiu commented Oct 16, 2019

@MaxGekk @cloud-fan thanks for your review, here is my answer for the questions:

local timestamp 2019-11-03 02:00:00 will be showed on local clock twice

my solution won't fix this, but the original code won't, too. actually I don't think this matters.

could you show concrete example when old (current) implementation returns wrong results, and your changes fix that

the first case is adding 1 day on 2019-11-02 12:00:00 PST in current code will result in 2019-11-03 11:00:00 PST. But as you said, this won't produce a wrong results.

the second case is in SS, if we do a window agg like this:

df.groupBy(window($"time", "1 day")).sum

Since this function is considering timezone, in current implementation, we may have window like this:
2019-11-01 00:00:00 - 2019-11-01 23:59:59
2019-11-02 00:00:00 - 2019-11-02 23:59:59
2019-11-03 00:00:00 - 2019-11-03 22:59:59
2019-11-03 23:00:00 - 2019-11-04 22:59:59
I think the user don't want to see aggregation in this kind of window.

@@ -365,12 +368,12 @@ public boolean equals(Object other) {
if (other == null || !(other instanceof CalendarInterval)) return false;

CalendarInterval o = (CalendarInterval) other;
return this.months == o.months && this.microseconds == o.microseconds;
return this.months == o.months && this.days == o.days && this.microseconds == o.microseconds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line changed manually? We'd better use IDE to generate the equals and hashCode.

@cloud-fan
Copy link
Contributor

I spend more time thinking about it. There are 3 kinds of timestamps in the SQL world:

  1. TIMESTAMP WITHOUT TIMEZONE: It's timezone-agnostic, and is simply a combination of the datetime fields YEAR, MONTH, DAY, HOUR, MINUTE and SECOND. So adding one day is the same as adding 24 hours.
  2. TIMESTAMP WITH TIMEZONE: It's similar to the above one but has a timezone associated with each value. Since SQL spec requires a timezone offset (TIMEZONE_HOUR:TIMEZONE_MINUTE), there is no ambiguity and adding 1 day is the same as adding 24 hours.
  3. TIMESTAMP WITH LOCAL TIMEZONE (the Spark timestamp): It's similar to the first one but has a local timezone associated with all the values. Since SQL spec requires a timezone offset, the local timezone should also be a specific offset, and has no ambiguity as well.

However, things are different in the Java world. The java time API use ZoneId, which can be a region name with undeterminate zone offset (see the doc).

@MaxGekk do you think Spark violates the SQL standard if ZoneId is used to represent the session local timezone? I see many datetime operations in DateTimeUtils are using ZoneId.

// about a month length in microseconds
val intervalStepInMicros = stepMicros + stepMonths * microsPerMonth
// about a month length in days and a day length in microseconds
// We choose a minimum days(28) in one month to make sure the estimated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -87,6 +87,7 @@ object IntervalUtils {
// Returns total number of seconds with microseconds fractional part in the given interval.
def getEpoch(interval: CalendarInterval): Decimal = {
var result = interval.microseconds
result += DateTimeUtils.MICROS_PER_DAY * interval.days
result += MICROS_PER_YEAR * (interval.months / MONTHS_PER_YEAR)
result += MICROS_PER_MONTH * (interval.months % MONTHS_PER_YEAR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but isn't it simply result += MICROS_PER_MONTH * interval.months?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be if MICROS_PER_YEAR = 12 * MICROS_PER_MONTH . I strictly followed PostgreSQL implementation: https://github.com/postgres/postgres/blob/bffe1bd68457e43925c362d8728ce3b25bdf1c94/src/backend/utils/adt/timestamp.c#L5016-L5022

@SparkQA
Copy link

SparkQA commented Oct 29, 2019

Test build #112813 has finished for PR 26134 at commit 336f0db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 29, 2019

Test build #112814 has finished for PR 26134 at commit 4e97bc0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113013 has finished for PR 26134 at commit 0e87e2d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait HasRelativeError extends Params
  • class _ImputerParams(HasInputCol, HasInputCols, HasOutputCol, HasOutputCols, HasRelativeError):
  • class _OneHotEncoderParams(HasInputCol, HasInputCols, HasOutputCol, HasOutputCols,
  • class _RobustScalerParams(HasInputCol, HasOutputCol, HasRelativeError):
  • class HasRelativeError(Params):
  • class ResolveCoalesceHints(conf: SQLConf) extends Rule[LogicalPlan]
  • case class AlterTableRecoverPartitionsStatement(
  • case class DropNamespaceStatement(
  • case class LoadDataStatement(
  • case class ShowColumnsStatement(
  • case class DropNamespace(
  • case class DropNamespaceExec(

micros = Math.addExact(micros, Math.multiplyExact(hours, MICROS_PER_HOUR))
micros = Math.addExact(micros, Math.multiplyExact(minutes, MICROS_PER_MINUTE))
micros = Math.addExact(micros, Math.multiplyExact(seconds, DateTimeUtils.MICROS_PER_SECOND))
new CalendarInterval(0, sign * micros)
new CalendarInterval(0, sign * days.toInt, sign * micros)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can cast to int at the beginning

    val days = if (m.group(2) == null) {
        0
      } else {
        toLongWithRange("day", m.group(3), 0, Integer.MAX_VALUE).toInt
      }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thx

StringType),
"interval 1 years 3 months -3 days")
"interval 1 years 3 months 1 weeks 2 days -3 hours")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk @dongjoon-hyun this is another instance we can think of: I feel 9 days is more readable than 1 week 2 days.

@cloud-fan
Copy link
Contributor

LGTM except a few minor comments, thanks for your work and patience @LinhongLiu !

@@ -2702,7 +2715,10 @@ object Sequence {
| $arr[$i] = ($elemType) ($t / ${scale}L);
| $i += 1;
| $t = org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampAddInterval(
| $startMicros, $i * $stepMonths, $i * $stepMicros, $zid);
| $startMicros,
| new org.apache.spark.unsafe.types.CalendarInterval(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this place. It has performance penalty if we create an object every time. Let's go back to the previous version. Sorry for the back and forth!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np. I reset the changes back.

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113034 has finished for PR 26134 at commit fb1591e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 1, 2019

Test build #113061 has finished for PR 26134 at commit 2dd8e43.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 1, 2019

Test build #113073 has finished for PR 26134 at commit 2f90189.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LinhongLiu
Copy link
Contributor Author

@cloud-fan tests are all pass, could you please check this PR again?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in a4382f7 Nov 1, 2019
@LinhongLiu
Copy link
Contributor Author

thanks @cloud-fan @xuanyuanking @MaxGekk for your reviewing.

public final long microseconds;

public long milliseconds() {
return this.microseconds / MICROS_PER_MILLI;
}

public CalendarInterval(int months, long microseconds) {
public CalendarInterval(int months, int days, long microseconds) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LinhongLiu Could you send out a follow up PR to document why we need days and how do we use it in this file? This is definitely worth to document. Otherwise, everyone reading this class may need to git blame and go through the long discussion in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LinhongLiu Could you send out a follow up PR to document why we need days and how do we use it in this file? This is definitely worth to document. Otherwise, everyone reading this class may need to git blame and go through the long discussion in this PR.

Sure, will do.

srowen pushed a commit that referenced this pull request Nov 30, 2019
### What changes were proposed in this pull request?
Follow up of #26134 to document the reason to add days filed and explain how do we use it

### Why are the changes needed?
only comment

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
no need test

Closes #26701 from LinhongLiu/spark-29486-followup.

Authored-by: Liu,Linhong <liulinhong@baidu.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
### What changes were proposed in this pull request?
Follow up of apache#26134 to document the reason to add days filed and explain how do we use it

### Why are the changes needed?
only comment

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
no need test

Closes apache#26701 from LinhongLiu/spark-29486-followup.

Authored-by: Liu,Linhong <liulinhong@baidu.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants