Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support timestamps and steps of less than a day for range/generate_series #12400

Merged
merged 5 commits into from
Sep 12, 2024

Conversation

Omega359
Copy link
Contributor

@Omega359 Omega359 commented Sep 9, 2024

Which issue does this PR close?

Closes #11822

Rationale for this change

Add support for timestamps and steps less than a day for timestamps.

What changes are included in this PR?

Code, tests, doc.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Sep 9, 2024
@Omega359 Omega359 marked this pull request as ready for review September 9, 2024 19:20
@@ -109,8 +121,11 @@ impl ScalarUDFImpl for Range {
match args[0].data_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably factor our this block to avoid make a modifications twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is so much duplication in these UDF's. I think it's best to push the cleanup of that to a separate PR and keep this one as small as possible.

return exec_err!(
"arguments length must be 3 for {}",
if include_upper {
"generate_series"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"generate_series"
"GENERATE_SERIES"

if include_upper {
"generate_series"
} else {
"range"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"range"
"RANGE"

@@ -394,3 +412,136 @@ fn gen_range_date(args: &[ArrayRef], include_upper: bool) -> Result<ArrayRef> {

Ok(arr)
}

fn gen_range_timestamp(args: &[ArrayRef], include_upper: bool) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn gen_range_timestamp(args: &[ArrayRef], include_upper: bool) -> Result<ArrayRef> {
fn gen_range_timestamp(args: &[ArrayRef], include_upper_bound: bool) -> Result<ArrayRef> {


// values are timestamps
let values_builder = if let Some(start_tz_str) = start_tz_opt {
TimestampNanosecondBuilder::new().with_timezone(&**start_tz_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks somewhat weird, do we really need so many indirections here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it was written, yes, it was required. I changed it up to use map_or_else to accomplish the same thing.

_ => {
internal_err!(
"Unexpected argument type for {} : {}",
if include_upper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error block is third time showing up, perhaps we can factor it out

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Omega359 great work
Please check some code comments

For tests its also good to have examples like

series of date2 less than date1 with positive interval

fractional intervals - which might be tricky

@Omega359
Copy link
Contributor Author

Omega359 commented Sep 9, 2024

Thanks @comphead for the review, much appreciated. I'll go through the proposed changes and comments tonight or tomorrow and push updates.

@Omega359
Copy link
Contributor Author

For tests its also good to have examples like

series of date2 less than date1 with positive interval

I think that may already be in there for range and generate_series

query ?
select range(TIMESTAMP '1993-03-01', TIMESTAMP '1989-04-01', INTERVAL '1' YEAR)
----
[]

query ?
select generate_series(TIMESTAMP '1993-03-01', TIMESTAMP '1989-04-01', INTERVAL '1' YEAR)
----
[]

fractional intervals - which might be tricky

I added this one - I think this may be along the lines of what you are thinking of:

query ?
select generate_series(arrow_cast('2021-01-01T00:00:00', 'Timestamp(Nanosecond, Some("-05:00"))'), arrow_cast('2021-01-01T06:00:00', 'Timestamp(Nanosecond, Some("-05:00"))'), INTERVAL '1 HOUR 30 MINUTE -5500000000 NANOSECOND');
----

@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

I plan to review this as well tomorrow -- thank you @Omega359

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Omega359 for this PR and @dmitrybugakov and @comphead for the reviews. I think this PR is good to go from my perspective (I left some optional nit picky comments but I don't think they are required)

I also ran the reproducer from @Abdullahsab3 on #11822 locally on this branch and it appears to work 👌

> select UNNEST(generate_series('2023-01-01 10:00:00'::timestamp,   '2023-01-31 23:59:00'::timestamp, '1 hour'::interval)) as date_time;
+---------------------+
| date_time           |
+---------------------+
| 2023-01-01T10:00:00 |
| 2023-01-01T11:00:00 |
| 2023-01-01T12:00:00 |
| 2023-01-01T13:00:00 |
| 2023-01-01T14:00:00 |
| 2023-01-01T15:00:00 |
| 2023-01-01T16:00:00 |
| 2023-01-01T17:00:00 |
| 2023-01-01T18:00:00 |
| 2023-01-01T19:00:00 |
| 2023-01-01T20:00:00 |
| 2023-01-01T21:00:00 |
| 2023-01-01T22:00:00 |
| 2023-01-01T23:00:00 |
| 2023-01-02T00:00:00 |
| 2023-01-02T01:00:00 |
| 2023-01-02T02:00:00 |
| 2023-01-02T03:00:00 |
| 2023-01-02T04:00:00 |
| 2023-01-02T05:00:00 |
| 2023-01-02T06:00:00 |
| 2023-01-02T07:00:00 |
| 2023-01-02T08:00:00 |
| 2023-01-02T09:00:00 |
| 2023-01-02T10:00:00 |
| 2023-01-02T11:00:00 |
| 2023-01-02T12:00:00 |
| 2023-01-02T13:00:00 |
| 2023-01-02T14:00:00 |
| 2023-01-02T15:00:00 |
| 2023-01-02T16:00:00 |
| 2023-01-02T17:00:00 |
| 2023-01-02T18:00:00 |
| 2023-01-02T19:00:00 |
| 2023-01-02T20:00:00 |
| 2023-01-02T21:00:00 |
| 2023-01-02T22:00:00 |
| 2023-01-02T23:00:00 |
| 2023-01-03T00:00:00 |
| 2023-01-03T01:00:00 |
| .                   |
| .                   |
| .                   |
+---------------------+
734 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.014 seconds.

make_scalar_function(|args| gen_range_timestamp(args, false))(args)
}
dt => {
exec_err!("unsupported type for RANGE. Expected Int64, Date32 or Timestamp, got: {dt}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ for a much better error

datafusion/functions-nested/src/range.rs Outdated Show resolved Hide resolved
Ok(arr)
}

fn cast_timestamp_arg(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

You could potentially use as_timestamp_nanosecond_array (the datafusion version) from: https://docs.rs/datafusion/latest/datafusion/common/cast/index.html

And then call timezone to get the timezone from the array

I personally found the use of the word cast somewhat confusing as function doesn't actually call the arrow cast kernel, but instead does the rust as_any stuff (which is like a cast in C or Java)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My programming roots are showing through :) I can take a look at that

[2021-01-01T00:00:00, 2021-01-01T01:00:00, 2021-01-01T02:00:00, 2021-01-01T03:00:00, 2021-01-01T04:00:00, 2021-01-01T05:00:00, 2021-01-01T06:00:00, 2021-01-01T07:00:00, 2021-01-01T08:00:00, 2021-01-01T09:00:00, 2021-01-01T10:00:00, 2021-01-01T11:00:00, 2021-01-01T12:00:00, 2021-01-01T13:00:00, 2021-01-01T14:00:00, 2021-01-01T15:00:00]

query ?
select generate_series('2021-01-01T00:00:00EST'::timestamp, '2021-01-01T15:00:00-12:00'::timestamp, INTERVAL '1' HOUR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth testing when the end timestamp is not an exact increment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test for -5.5sec (in nanoseconds) somewhere in that file.

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb
Copy link
Contributor

alamb commented Sep 12, 2024

🚀 - thanks again @comphead @Omega359 @dmitrybugakov and @Abdullahsab3

@alamb alamb merged commit 1f06308 into apache:main Sep 12, 2024
25 checks passed
@Omega359 Omega359 deleted the range_gen_series_11822 branch October 6, 2024 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support of timestamps and steps of less than a day for generate_series
4 participants