Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison of DataType::Interval and Interval**Array #5200

Open
alamb opened this issue Dec 11, 2023 · 4 comments
Open

Comparison of DataType::Interval and Interval**Array #5200

alamb opened this issue Dec 11, 2023 · 4 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Data of type DateTime::Interval (spans of time measured in calendar units like days and months) are tricky to compare because the absolute size (in number of seconds) is not a fixed quantity. For example the 1 month is 28 days for February but 1 month is 31 days in December.

This makes the seemingly simple operation of comparing two intervals quite complicated in practice. For example is 1 month more or less than 30 days? The answer depends on what month you are talking about.

Arrow also includes a type DateTime::Duration that is a fixed width of time and suffers from many fewer challenges in comparisons, but may not be as intuitive for humans to use;

#5180 from @berkaysynnada and @ozankabak noted that while in general it is impossible to determine if any arbitrary Interval is greater/equal/less than another arbitrary Interval it is possible to to order some intervals. For example, 10 days is always less than 30 days, even though it is not possible to know if 1 month is less than 30 days without additional information not encoded in the Interval

Describe the solution you'd like
I am not sure.

I filed this ticket to 1. Try and describe the challenge beter and 2. Have a discussion about what, if any, additional semantics are needed

Describe alternatives you've considered

Nothing / Improve Documentation

The current handling of intervals is now documented after #5192 so I hope there is less confusion about the semantics. If downstream crates want to compare intervals they can do so by implementing their own kernels

New kernels

One possibility would be to add new comparison kernels that were specific to intervals and implemented partial order interval comparisons as proposed in #5180

A comparison returns true if it holds certainly. Otherwise, it returns false. Going back to our example, we return false for both 1 month < 30 days and 1 month > 30 days. It is impossible to impose a total order on intervals, but we can impose a consistent partial order with this logic.

Something like

let arr1 = IntervalMonthDayNanoArray::from(...);
let arr2 = IntervalMonthDayNanoArray::from(...);

// compare arr1 and arr2 using interval specific logic / partial order
let res = interval_partial_lt()

This would have the benefit of implementing interval specific semantics and would be very clear it is different than other comparison kernels

Forced cast to Duration

One way to define a deterministic comparsison is to covert months to 30 day increments. That would result in a well defined and fast comparisons, but may lead to intuitive results

Additional context
Here is a related discussion in DataFusion: apache/datafusion#8468

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Dec 11, 2023
@tustvold
Copy link
Contributor

I think we should establish what other systems do, such as postgres, and go from there

@berkaysynnada
Copy link
Contributor

AFAIK Postgre defines a month as exactly 30 days. While this simplification facilitates certain implementations, it can potentially introduce silent bugs. In our DataFusion fork, we have implemented a PartialOrd with semantics as described in this comment. In short, our implementation treats '1 month < 30 days' as false and '1 month >= 30 days' as false. However, '1 month + 1 day > 1 month' is true, and '1 month > 27 days' is also true. The fix also addresses problematic cases involving negative-signed LSD blocks.

If you're interested in how such an implementation could be integrated into DataFusion, we can help with the process.

@tustvold
Copy link
Contributor

AFAIK Postgre defines a month as exactly 30 days

This was what I had understood as well. What do you think of updating the cast kernel to allow casting interval expressions with non-zero days/months to durations, using the same assumptions as postgres makes?

This would have a number of quite compelling advantages in my opinion:

  • The conversion is explicit and user-visible
  • The ordering behaviour for durations is easy to understand
  • No additional kernels required

My main hesitancy with defining our own custom partial ordering is that I am not very confident of all the nuances involved in performing such an operation correctly, how to communicate these to users, and am concerned about adding further edge-cases for people to be wary of. Whilst I agree the postgres approach is making some pretty crude assumptions, it is easy to understand and comes with prior precedent...

@alamb
Copy link
Contributor Author

alamb commented Dec 12, 2023

What do you think of updating the cast kernel to allow casting interval expressions with non-zero days/months to durations, using the same assumptions as postgres makes?

This makes sense to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants