Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Use timestamp_floor when granularity is not safe. #13206

Merged
merged 6 commits into from
Oct 17, 2022

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Oct 11, 2022

Fixes #13182.

PR #12944 added a check at the execution layer to avoid materializing excessive amounts of time-granular buckets. This patch modifies the SQL planner to avoid generating queries that would throw such errors, by switching certain plans to use the timestamp_floor function instead of granularities. This applies both to the Timeseries query type, and the GroupBy timestampResultFieldGranularity feature.

The patch also goes one step further: we switch to timestamp_floor not just in the ETERNITY + non-ALL case, but also if the estimated number of time-granular buckets exceeds 100,000.

Finally, the patch modifies the timestampResultFieldGranularity field to consistently be a String rather than a Granularity. This ensures that it can be round-trip serialized and deserialized, which is useful when trying to execute the results of "EXPLAIN PLAN FOR" with GroupBy queries that use the timestampResultFieldGranularity feature.

PR apache#12944 added a check at the execution layer to avoid materializing
excessive amounts of time-granular buckets. This patch modifies the SQL
planner to avoid generating queries that would throw such errors, by
switching certain plans to use the timestamp_floor function instead of
granularities. This applies both to the Timeseries query type, and the
GroupBy timestampResultFieldGranularity feature.

The patch also goes one step further: we switch to timestamp_floor
not just in the ETERNITY + non-ALL case, but also if the estimated
number of time-granular buckets exceeds 100,000.

Finally, the patch modifies the timestampResultFieldGranularity
field to consistently be a String rather than a Granularity. This
ensures that it can be round-trip serialized and deserialized, which is
useful when trying to execute the results of "EXPLAIN PLAN FOR" with
GroupBy queries that use the timestampResultFieldGranularity feature.
@gianm
Copy link
Contributor Author

gianm commented Oct 11, 2022

There are two better solutions that I didn't do in this patch due to them being more complex:

  1. Improving the query execution logic so the granularity feature doesn't incur overhead for buckets that aren't needed. The idea here would be to let the cursor drive the bucketing, rather than letting the bucketing drive the cursor.
  2. Improving the performance of timestamp_floor so granularity is not needed. The SQL planner would always generate GroupBy with timestamp_floor: no timeseries queries, and no GroupBy-with-granularity.

Personally I think path (2) is the best one for the future. That being said, there is a need to have these queries execute properly today, hence the present patch.

Copy link
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the changes! 👍 Mostly LGTM, some comments.

Regarding the long term solution - I agree that making timestamp_floor faster could be the way to go. The only doubt in my mind is regarding support of various time-grain level operations like limit, ranking might need a new construct (like a new Sequence). That can be discussed further when we decide to do it.

theContext.put(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_INDEX, timestampDimensionIndexInDimensions);
theContext.put(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_GRANULARITY, queryGranularity);

if (canUseQueryGranularity(dataSource, filtration, queryGranularity)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use this on L1272 with if(granularity == null) check? because this sets the queryGranularity to skip other grouping columns but may not set the context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the code structure, I think it's OK, but I rearranged it a bit anyway to hopefully make the logic clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! there are some CI failures although which look legit

// Validate the interval against MAX_TIME_GRAINS_NON_DRUID_TABLE.
// Estimate based on the size of the first bucket, to avoid computing them all. (That's what we're
// trying to avoid!)
final Interval firstBucket = queryGranularity.bucket(filtrationInterval.getStart());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the Granularity object should have a duration method as well - but that can be a follow-up later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is here because not all Granularities have a fixed duration. There is DurationGranularity, which does, but also PeriodGranularity, which does not. For example, P1Y changes durations on leap years and P1D changes durations for daylight savings time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a good point. I missed those cases

@gianm
Copy link
Contributor Author

gianm commented Oct 11, 2022

Thanks for the review @rohangarg. I pushed up some changes based on your comments.

@paul-rogers
Copy link
Contributor

The idea here would be to let the cursor drive the bucketing, rather than letting the bucketing drive the cursor.

This seems like a good longer-term solution. Can we go further? Any reason to create buckets that will be empty? If we're doing time-based grouping, and data is time ordered, we can create buckets on the fly as we read the data. This is the classic streaming grouping solution: one doesn't normally enumerate all possible customers, say, to group sales by customer.

There would have to be code to handle empty groups, if we we want to fill empty time slots. But, this can be done by enumerating the time buckets as we return results. If the next result is greater than the current time bucket, return zeros. For this, the set of buckets need not be materialized, just enumerated.

Improving the performance of timestamp_floor...

For fixed-length intervals, (week or less), the time floor should be a simple mod-then-subtract. For variable-length intervals, the logic is more complex. Can we split the implementations for those two cases? Week-or-less is super fast, Month-or-more pays the extra compute cost?

@gianm
Copy link
Contributor Author

gianm commented Oct 16, 2022

Can we go further? Any reason to create buckets that will be empty? If we're doing time-based grouping, and data is time ordered, we can create buckets on the fly as we read the data.

That's exactly what I meant by "let the cursor drive the bucketing". I meant only generate buckets based on timestamps we actually see. (Unless the user specifically requests that empty buckets be zero-filled.)

For this, the set of buckets need not be materialized, just enumerated.

In general, today, we don't materialize the buckets; we just enumerate them. The issue we see is that even with this approach, the amount of time it takes to enumerate buckets can be prohibitively large. (People raise bugs saying that queries "hang". They don't actually hang, but it seems that way to the user, due to the large number of buckets that are being enumerated.) By letting the cursor drive the bucketing, we can avoid this completely for the case where we aren't zero-filling. The zero-fill case still poses an issue, but we can address that some other way (limit on number of zero-filled buckets)?

For fixed-length intervals, (week or less), the time floor should be a simple mod-then-subtract. For variable-length intervals, the logic is more complex. Can we split the implementations for those two cases? Week-or-less is super fast, Month-or-more pays the extra compute cost?

Ah. In practice the main perf hit isn't from the evaluation speed of the timestamp_floor function itself. It does have some logic to fast-path common granularities in the way you mention. The bigger issues are that we don't take advantage of the fact that the segments are sorted by __time and the timestamp_floor function is monotonic. Two things we should do in order to get perf matching the way we handle granularity:

  • the group-by engine should do a streaming aggregation when the first group-by dimension is timestamp_floor(__time, ...)
  • the computation of timestamp_floor should take advantage of monotonicity: when we encounter a timestamp, we can compute the start and end of its bucket. The start should be re-used as the return value of timestamp_floor for any value of __time up til the end: there is no need to actually compute the function for each row.

Copy link
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % doubt-comment

.build()
),
NullHandling.sqlCompatible()
? ImmutableList.of(new Object[]{946684800000L, "", 1L}, new Object[]{946771200000L, "10.1", 1L})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this because the data in the test table is actually "" and not null? So, in the SQL compat way null and "" are considered different and hence the "" row comes in the join output?
Whereas in the non-compatible mode, "" is treated as null and hence join ignores it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly what is happening.

@gianm gianm merged commit 6aca617 into apache:master Oct 17, 2022
@gianm gianm deleted the fix-sql-granularity-planning branch October 17, 2022 15:22
@kfaraz kfaraz added this to the 25.0 milestone Nov 22, 2022
gianm added a commit to gianm/druid that referenced this pull request Mar 7, 2023
This function is notorious for causing memory exhaustion and excessive
CPU usage; so much so that it was valuable to work around it in the
SQL planner in apache#13206. Hopefully, a warning comment will encourage
developers to stay away and come up with solutions that do not involve
computing all possible buckets.
gianm added a commit that referenced this pull request Mar 7, 2023
This function is notorious for causing memory exhaustion and excessive
CPU usage; so much so that it was valuable to work around it in the
SQL planner in #13206. Hopefully, a warning comment will encourage
developers to stay away and come up with solutions that do not involve
computing all possible buckets.
317brian pushed a commit to 317brian/druid that referenced this pull request Mar 10, 2023
This function is notorious for causing memory exhaustion and excessive
CPU usage; so much so that it was valuable to work around it in the
SQL planner in apache#13206. Hopefully, a warning comment will encourage
developers to stay away and come up with solutions that do not involve
computing all possible buckets.
keytouch pushed a commit to keytouch/druid that referenced this pull request Feb 12, 2025
* SQL: Use timestamp_floor when granularity is not safe.

PR apache#12944 added a check at the execution layer to avoid materializing
excessive amounts of time-granular buckets. This patch modifies the SQL
planner to avoid generating queries that would throw such errors, by
switching certain plans to use the timestamp_floor function instead of
granularities. This applies both to the Timeseries query type, and the
GroupBy timestampResultFieldGranularity feature.

The patch also goes one step further: we switch to timestamp_floor
not just in the ETERNITY + non-ALL case, but also if the estimated
number of time-granular buckets exceeds 100,000.

Finally, the patch modifies the timestampResultFieldGranularity
field to consistently be a String rather than a Granularity. This
ensures that it can be round-trip serialized and deserialized, which is
useful when trying to execute the results of "EXPLAIN PLAN FOR" with
GroupBy queries that use the timestampResultFieldGranularity feature.

* Fix test, address PR comments.

* Fix ControllerImpl.

* Fix test.

* Fix unused import.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The Broker's Java threads hang on certain SQL queries
4 participants