Disallow timeseries queries with ETERNITY interval and non-ALL granularity #12944

rohangarg · 2022-08-23T08:20:48Z

A user might construct a timeseries query on inline data by mistake which contains no filters on time intervals (hence the query is on ETERNITY interval) and also specifies a non-ALL granularity. Currently, those queries will start generating a large number of time grains while generating the results which can lead to a very high chance for heap OOM or a very very long query runtime.
This change disallows such queries from execution to avoid the problems mentioned above. Semantically as well, there is less point in running a time granularity query over the ETERNITY interval.

An augmentation to the solution could be to find the time-interval of the inline data by traversing it and then using that interval in the RowBasedStorageAdapater. But doing that change today will either require recompuation of inline data or eager materialization of it in the adapater, both of which approaches have their own short-comings. So, for now this change is refrained - it can be considered in future if needed.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

…arity

clintropolis

An augmentation to the solution could be to find the time-interval of the inline data by traversing it and then using that interval ...

Could we make an issue for this to track it? Reason being this seems sort of sad, but agree its better than running out of memory or whatever

rohangarg · 2022-08-24T09:30:21Z

Could we make an issue for this to track it? Reason being this seems sort of sad, but agree its better than running out of memory or whatever

I think the current check would be needed even if we evaluate the exact time interval for data in RowBaseStorageAdapter since there can be data with a custom time column, empty data or a user degenerate case which can lead to a large time interval. Thinking about this again, do you think it would be better if we cap the number of time-grains possible to create? the interval and granularity could be used to find the estimated number of intervals.
Regardless, I will also open an issue for tracking exact time interval for inline data since that'll be generally helpful.

rohangarg · 2022-08-24T10:23:03Z

One more idea around this is to give a lazy support for descending time-grain iterators. The OOM I have seen are because the reversing of iterators materializes them which explodes the heap memory usage. With that change, the queries on large intervals with time-granularity will run slowly but will have much lesser chances of OOMing.

abhishekagarwal87 · 2022-09-06T14:55:44Z

@rohangarg - shall we merge this PR for now?

PR apache#12944 added a check at the execution layer to avoid materializing excessive amounts of time-granular buckets. This patch modifies the SQL planner to avoid generating queries that would throw such errors, by switching certain plans to use the timestamp_floor function instead of granularities. This applies both to the Timeseries query type, and the GroupBy timestampResultFieldGranularity feature. The patch also goes one step further: we switch to timestamp_floor not just in the ETERNITY + non-ALL case, but also if the estimated number of time-granular buckets exceeds 100,000. Finally, the patch modifies the timestampResultFieldGranularity field to consistently be a String rather than a Granularity. This ensures that it can be round-trip serialized and deserialized, which is useful when trying to execute the results of "EXPLAIN PLAN FOR" with GroupBy queries that use the timestampResultFieldGranularity feature.

* SQL: Use timestamp_floor when granularity is not safe. PR #12944 added a check at the execution layer to avoid materializing excessive amounts of time-granular buckets. This patch modifies the SQL planner to avoid generating queries that would throw such errors, by switching certain plans to use the timestamp_floor function instead of granularities. This applies both to the Timeseries query type, and the GroupBy timestampResultFieldGranularity feature. The patch also goes one step further: we switch to timestamp_floor not just in the ETERNITY + non-ALL case, but also if the estimated number of time-granular buckets exceeds 100,000. Finally, the patch modifies the timestampResultFieldGranularity field to consistently be a String rather than a Granularity. This ensures that it can be round-trip serialized and deserialized, which is useful when trying to execute the results of "EXPLAIN PLAN FOR" with GroupBy queries that use the timestampResultFieldGranularity feature. * Fix test, address PR comments. * Fix ControllerImpl. * Fix test. * Fix unused import.

* SQL: Use timestamp_floor when granularity is not safe. PR apache#12944 added a check at the execution layer to avoid materializing excessive amounts of time-granular buckets. This patch modifies the SQL planner to avoid generating queries that would throw such errors, by switching certain plans to use the timestamp_floor function instead of granularities. This applies both to the Timeseries query type, and the GroupBy timestampResultFieldGranularity feature. The patch also goes one step further: we switch to timestamp_floor not just in the ETERNITY + non-ALL case, but also if the estimated number of time-granular buckets exceeds 100,000. Finally, the patch modifies the timestampResultFieldGranularity field to consistently be a String rather than a Granularity. This ensures that it can be round-trip serialized and deserialized, which is useful when trying to execute the results of "EXPLAIN PLAN FOR" with GroupBy queries that use the timestampResultFieldGranularity feature. * Fix test, address PR comments. * Fix ControllerImpl. * Fix test. * Fix unused import.

Disallow timeseries queries with ETERNITY interval and non-ALL granul…

ea344c1

…arity

rohangarg added Bug Area - Querying Area - SQL labels Aug 23, 2022

clintropolis approved these changes Aug 24, 2022

View reviewed changes

abhishekagarwal87 merged commit 2f156b3 into apache:master Sep 7, 2022

This was referenced Oct 11, 2022

SQL: Use timestamp_floor when granularity is not safe. #13206

Merged

The Broker's Java threads hang on certain SQL queries #13182

Closed

Coordinator crashes trying to compact a datasource ingested through MSQ with PARTITIONED BY ALL #13208

Closed

kfaraz added this to the 25.0 milestone Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallow timeseries queries with ETERNITY interval and non-ALL granularity #12944

Disallow timeseries queries with ETERNITY interval and non-ALL granularity #12944

rohangarg commented Aug 23, 2022

clintropolis left a comment

rohangarg commented Aug 24, 2022 •

edited

Loading

rohangarg commented Aug 24, 2022

abhishekagarwal87 commented Sep 6, 2022

Disallow timeseries queries with ETERNITY interval and non-ALL granularity #12944

Disallow timeseries queries with ETERNITY interval and non-ALL granularity #12944

Conversation

rohangarg commented Aug 23, 2022

clintropolis left a comment

Choose a reason for hiding this comment

rohangarg commented Aug 24, 2022 • edited Loading

rohangarg commented Aug 24, 2022

abhishekagarwal87 commented Sep 6, 2022

rohangarg commented Aug 24, 2022 •

edited

Loading