SQL: Use timestamp_floor when granularity is not safe. #13206

gianm · 2022-10-11T03:44:57Z

PR #12944 added a check at the execution layer to avoid materializing excessive amounts of time-granular buckets. This patch modifies the SQL planner to avoid generating queries that would throw such errors, by switching certain plans to use the timestamp_floor function instead of granularities. This applies both to the Timeseries query type, and the GroupBy timestampResultFieldGranularity feature.

The patch also goes one step further: we switch to timestamp_floor not just in the ETERNITY + non-ALL case, but also if the estimated number of time-granular buckets exceeds 100,000.

Finally, the patch modifies the timestampResultFieldGranularity field to consistently be a String rather than a Granularity. This ensures that it can be round-trip serialized and deserialized, which is useful when trying to execute the results of "EXPLAIN PLAN FOR" with GroupBy queries that use the timestampResultFieldGranularity feature.

PR apache#12944 added a check at the execution layer to avoid materializing excessive amounts of time-granular buckets. This patch modifies the SQL planner to avoid generating queries that would throw such errors, by switching certain plans to use the timestamp_floor function instead of granularities. This applies both to the Timeseries query type, and the GroupBy timestampResultFieldGranularity feature. The patch also goes one step further: we switch to timestamp_floor not just in the ETERNITY + non-ALL case, but also if the estimated number of time-granular buckets exceeds 100,000. Finally, the patch modifies the timestampResultFieldGranularity field to consistently be a String rather than a Granularity. This ensures that it can be round-trip serialized and deserialized, which is useful when trying to execute the results of "EXPLAIN PLAN FOR" with GroupBy queries that use the timestampResultFieldGranularity feature.

gianm · 2022-10-11T03:52:27Z

There are two better solutions that I didn't do in this patch due to them being more complex:

Improving the query execution logic so the granularity feature doesn't incur overhead for buckets that aren't needed. The idea here would be to let the cursor drive the bucketing, rather than letting the bucketing drive the cursor.
Improving the performance of timestamp_floor so granularity is not needed. The SQL planner would always generate GroupBy with timestamp_floor: no timeseries queries, and no GroupBy-with-granularity.

Personally I think path (2) is the best one for the future. That being said, there is a need to have these queries execute properly today, hence the present patch.

rohangarg

Thanks a lot for the changes! 👍 Mostly LGTM, some comments.

Regarding the long term solution - I agree that making timestamp_floor faster could be the way to go. The only doubt in my mind is regarding support of various time-grain level operations like limit, ranking might need a new construct (like a new Sequence). That can be discussed further when we decide to do it.

rohangarg · 2022-10-11T15:34:52Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java

-        theContext.put(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_INDEX, timestampDimensionIndexInDimensions);
-        theContext.put(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_GRANULARITY, queryGranularity);
+
+        if (canUseQueryGranularity(dataSource, filtration, queryGranularity)) {


maybe we can use this on L1272 with if(granularity == null) check? because this sets the queryGranularity to skip other grouping columns but may not set the context.

Given the code structure, I think it's OK, but I rearranged it a bit anyway to hopefully make the logic clearer.

thanks! there are some CI failures although which look legit

rohangarg · 2022-10-11T15:36:48Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java

+      // Validate the interval against MAX_TIME_GRAINS_NON_DRUID_TABLE.
+      // Estimate based on the size of the first bucket, to avoid computing them all. (That's what we're
+      // trying to avoid!)
+      final Interval firstBucket = queryGranularity.bucket(filtrationInterval.getStart());


maybe the Granularity object should have a duration method as well - but that can be a follow-up later.

This logic is here because not all Granularities have a fixed duration. There is DurationGranularity, which does, but also PeriodGranularity, which does not. For example, P1Y changes durations on leap years and P1D changes durations for daylight savings time.

Oh, that's a good point. I missed those cases

gianm · 2022-10-11T21:21:53Z

Thanks for the review @rohangarg. I pushed up some changes based on your comments.

paul-rogers · 2022-10-12T23:54:12Z

The idea here would be to let the cursor drive the bucketing, rather than letting the bucketing drive the cursor.

This seems like a good longer-term solution. Can we go further? Any reason to create buckets that will be empty? If we're doing time-based grouping, and data is time ordered, we can create buckets on the fly as we read the data. This is the classic streaming grouping solution: one doesn't normally enumerate all possible customers, say, to group sales by customer.

There would have to be code to handle empty groups, if we we want to fill empty time slots. But, this can be done by enumerating the time buckets as we return results. If the next result is greater than the current time bucket, return zeros. For this, the set of buckets need not be materialized, just enumerated.

Improving the performance of timestamp_floor...

For fixed-length intervals, (week or less), the time floor should be a simple mod-then-subtract. For variable-length intervals, the logic is more complex. Can we split the implementations for those two cases? Week-or-less is super fast, Month-or-more pays the extra compute cost?

gianm · 2022-10-16T19:04:28Z

Can we go further? Any reason to create buckets that will be empty? If we're doing time-based grouping, and data is time ordered, we can create buckets on the fly as we read the data.

That's exactly what I meant by "let the cursor drive the bucketing". I meant only generate buckets based on timestamps we actually see. (Unless the user specifically requests that empty buckets be zero-filled.)

For this, the set of buckets need not be materialized, just enumerated.

In general, today, we don't materialize the buckets; we just enumerate them. The issue we see is that even with this approach, the amount of time it takes to enumerate buckets can be prohibitively large. (People raise bugs saying that queries "hang". They don't actually hang, but it seems that way to the user, due to the large number of buckets that are being enumerated.) By letting the cursor drive the bucketing, we can avoid this completely for the case where we aren't zero-filling. The zero-fill case still poses an issue, but we can address that some other way (limit on number of zero-filled buckets)?

For fixed-length intervals, (week or less), the time floor should be a simple mod-then-subtract. For variable-length intervals, the logic is more complex. Can we split the implementations for those two cases? Week-or-less is super fast, Month-or-more pays the extra compute cost?

Ah. In practice the main perf hit isn't from the evaluation speed of the timestamp_floor function itself. It does have some logic to fast-path common granularities in the way you mention. The bigger issues are that we don't take advantage of the fact that the segments are sorted by __time and the timestamp_floor function is monotonic. Two things we should do in order to get perf matching the way we handle granularity:

the group-by engine should do a streaming aggregation when the first group-by dimension is timestamp_floor(__time, ...)
the computation of timestamp_floor should take advantage of monotonicity: when we encounter a timestamp, we can compute the start and end of its bucket. The start should be re-used as the return value of timestamp_floor for any value of __time up til the end: there is no need to actually compute the function for each row.

rohangarg

LGTM % doubt-comment

rohangarg · 2022-10-17T12:25:48Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java

+                        .build()
+        ),
+        NullHandling.sqlCompatible()
+        ? ImmutableList.of(new Object[]{946684800000L, "", 1L}, new Object[]{946771200000L, "10.1", 1L})


is this because the data in the test table is actually "" and not null? So, in the SQL compat way null and "" are considered different and hence the "" row comes in the join output?
Whereas in the non-compatible mode, "" is treated as null and hence join ignores it?

Yes, that's exactly what is happening.

This function is notorious for causing memory exhaustion and excessive CPU usage; so much so that it was valuable to work around it in the SQL planner in apache#13206. Hopefully, a warning comment will encourage developers to stay away and come up with solutions that do not involve computing all possible buckets.

This function is notorious for causing memory exhaustion and excessive CPU usage; so much so that it was valuable to work around it in the SQL planner in #13206. Hopefully, a warning comment will encourage developers to stay away and come up with solutions that do not involve computing all possible buckets.

This function is notorious for causing memory exhaustion and excessive CPU usage; so much so that it was valuable to work around it in the SQL planner in apache#13206. Hopefully, a warning comment will encourage developers to stay away and come up with solutions that do not involve computing all possible buckets.

* SQL: Use timestamp_floor when granularity is not safe. PR apache#12944 added a check at the execution layer to avoid materializing excessive amounts of time-granular buckets. This patch modifies the SQL planner to avoid generating queries that would throw such errors, by switching certain plans to use the timestamp_floor function instead of granularities. This applies both to the Timeseries query type, and the GroupBy timestampResultFieldGranularity feature. The patch also goes one step further: we switch to timestamp_floor not just in the ETERNITY + non-ALL case, but also if the estimated number of time-granular buckets exceeds 100,000. Finally, the patch modifies the timestampResultFieldGranularity field to consistently be a String rather than a Granularity. This ensures that it can be round-trip serialized and deserialized, which is useful when trying to execute the results of "EXPLAIN PLAN FOR" with GroupBy queries that use the timestampResultFieldGranularity feature. * Fix test, address PR comments. * Fix ControllerImpl. * Fix test. * Fix unused import.

gianm added Bug Area - SQL labels Oct 11, 2022

gianm mentioned this pull request Oct 11, 2022

The Broker's Java threads hang on certain SQL queries #13182

Closed

rohangarg reviewed Oct 11, 2022

View reviewed changes

Fix test, address PR comments.

997f5d0

Fix ControllerImpl.

09718f1

gianm mentioned this pull request Oct 12, 2022

Coordinator crashes trying to compact a datasource ingested through MSQ with PARTITIONED BY ALL #13208

Closed

gianm added 3 commits October 16, 2022 16:12

Merge branch 'master' into fix-sql-granularity-planning

dbb9ceb

Fix test.

fe780b4

Fix unused import.

41f673d

rohangarg approved these changes Oct 17, 2022

View reviewed changes

gianm merged commit 6aca617 into apache:master Oct 17, 2022

gianm deleted the fix-sql-granularity-planning branch October 17, 2022 15:22

kfaraz added this to the 25.0 milestone Nov 22, 2022

gianm mentioned this pull request Mar 7, 2023

Add warning comments to Granularity.getIterable. #13888

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL: Use timestamp_floor when granularity is not safe. #13206

SQL: Use timestamp_floor when granularity is not safe. #13206

gianm commented Oct 11, 2022 •

edited

Loading

gianm commented Oct 11, 2022

rohangarg left a comment

rohangarg Oct 11, 2022

gianm Oct 11, 2022

rohangarg Oct 12, 2022

rohangarg Oct 11, 2022

gianm Oct 11, 2022

rohangarg Oct 12, 2022

gianm commented Oct 11, 2022

paul-rogers commented Oct 12, 2022

gianm commented Oct 16, 2022

rohangarg left a comment

rohangarg Oct 17, 2022

gianm Oct 17, 2022

SQL: Use timestamp_floor when granularity is not safe. #13206

SQL: Use timestamp_floor when granularity is not safe. #13206

Conversation

gianm commented Oct 11, 2022 • edited Loading

gianm commented Oct 11, 2022

rohangarg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Oct 11, 2022

paul-rogers commented Oct 12, 2022

gianm commented Oct 16, 2022

rohangarg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Oct 11, 2022 •

edited

Loading