query rejection #6005

erlan-z · 2024-06-11T03:23:39Z

What this PR does:
This PR introduces a new functionality to reject queries based on certain attributes to protect our services from resource-exhausting queries that can lead to service disruptions, such as OOM kills or availability drop for other queries. This change provides a mechanism to block heavy queries before they reach our services, ensuring operational stability and resource efficiency.

Changes Introduced

Query Rejection Mechanism:
Rejects queries based on query api_type, regex, time window, time range, step size, user-agent regex, dashboard UID, and panel ID.
Configured through runtime config, allowing per-tenant management.
Query Priority Mechanism:
Query attributes from the existing query_priority functionality were utilized and extended to support the new properties.
Documentation Generator Fix:
Fixed doc-generator to remove duplicate documentation on repeated fields on slices.

Usage details
Both query priority and query rejection mechanisms utilize the same config object (query_attribute) to determine if a query matches specified criteria. The query_attribute config defines requirements for a query and results in a match or no match based on it's properties. The query_attribute does not directly reject or prioritize queries; rather, the action (query rejection or priority) is determined by the config where it was used.

All provided properties in query_attribute use an AND operation to decide if a query matches. Matching is only done on fields provided in the config.

Example config:

    query_rejection:
      enabled: true
      query_attributes:
        - regex: .*ALERT.*
          query_step_limit:
            min: 6s
            max: 20s
          dashboard_uid: "dash123"

Example query:
curl 'localhost:8005/prometheus/api/v1/query?query=someCustomALERTquery&time=1718383304&end=1718386904&step=7s' -H "User-Agent: other" -H "X-Dashboard-Uid: dash123"

In this example, queries containing the string "ALERT" with a step between 6s and 20s, and a dashboard UID of "dash123" will be rejected. If any criteria specified in query_attribute properties are not met, such as a step of 30s, the query will not be rejected. In the above example, undefined properties (user_agent, time_window, etc.) are ignored, and the query is not checked against them.

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

docs/configuration/config-file-reference.md

friedrichg

@erlan-z This is pretty cool. Thank you for doing this!

Things that I would love to see:

use warnExperimentalUse

cortex/pkg/util/log/experimental.go

Line 18 in 6dd64fc

func WarnExperimentalUse(feature string) {
Document the experimental feature on docs/configuration/v1-guarantees.md
Integration tests

docs/configuration/config-file-reference.md

pkg/querier/tripperware/query_attribute_matcher.go

pkg/querier/tripperware/roundtrip.go

docs/configuration/config-file-reference.md

pkg/querier/tripperware/query_attribute_matcher.go

docs/configuration/config-file-reference.md

erlan-z · 2024-06-25T22:13:32Z

@erlan-z This is pretty cool. Thank you for doing this!

Things that I would love to see:

use warnExperimentalUse

cortex/pkg/util/log/experimental.go

Line 18 in 6dd64fc

func WarnExperimentalUse(feature string) {

Document the experimental feature on docs/configuration/v1-guarantees.md

Integration tests

Documented experimental feature on docs/configuration/v1-guarantees.md
Added integration tests.
I didn't use warnExperimentalUse because this feature is enabled or disabled per tenant, not for the entire system. Additionally, it can change at runtime, so I decided not to add the experimental log or metric. Please let me know if you still think we should add it.

pkg/querier/tripperware/query_attribute_matcher.go

pkg/util/validation/limits.go

integration/e2ecortex/client.go

erlan-z · 2024-06-26T18:47:57Z

For series/label/label_values, at least one match should match the regex.

I wonder why that's the case. How useful it is to match only one?

As we just go through that matchers and send separate query for each of them, if there is heavy matcher, then whole query will fail. I think if we want to reject/prioritize that heavy queries that match regex, than matching single one should be enough.

yeya24 · 2024-06-26T18:52:50Z

As we just go through that matchers and send separate query for each of them, if there is heavy matcher, then whole query will fail. I think if we want to reject/prioritize that heavy queries that match regex, than matching single one should be enough.

Let's say we know a very heavy matcher {cluster="us-west-2"}. Do we reject the query if users have this matcher in the request? I think it is only expensive if they use this matcher along but should be fine as long as they add more matchers.

yeya24

Thanks for addressing all the comments! The change looks good to me

alanprot · 2024-07-02T17:56:22Z

docs/configuration/config-file-reference.md

+time_range_limit:
+  # Query time range should be above or equal to this value to match. If set to
+  # 0, it won't be checked.
+  [min: <int> | default = 0]
+
+  # Query time range should be below or equal to this value to match. If set to
+  # 0, it won't be checked.
+  [max: <int> | default = 0]


Can we document a bit the syntax here? i suspect its something like -1d, 1h etc

Should we incorporate this config here and deprecate it? Those configs seems to overlap?

# Limit the query time range (end - start time of range query parameter and max # - min of data fetched time range). This limit is enforced in the # query-frontend and ruler (on the received query). 0 to disable. # CLI flag: -store.max-query-length [max_query_length: <duration> | default = 0s]

Should we deprecated this one?

Same as:

# Maximum duration into the future you can query. 0 to disable. # CLI flag: -querier.max-query-into-future [max_query_into_future: <duration> | default = 10m]

updated time-range docs

alanprot · 2024-07-02T17:59:19Z

docs/configuration/config-file-reference.md

+query_step_limit:
+  # Query step should be above or equal to this value to match. If set to 0, it
+  # won't be checked.
+  [min: <int> | default = 0]
+
+  # Query step should be below or equal to this value to match. If set to 0, it
+  # won't be checked.
+  [max: <int> | default = 0]


Should we incorporate this config here and deprecate it?

# Max number of steps allowed for every subquery expression in query. Number of # steps is calculated using subquery range / step. A value > 0 enables it. # CLI flag: -querier.max-subquery-steps [max_subquery_steps: <int> | default = 0]

I'm trying to not have 2 ways of setting query limits.

alanprot · 2024-07-02T18:27:35Z

I think this LGTM but im afraid of making the cortex config even more complex. I wonder if those kinds of limits would not live better in another component like the auth gateway.

Said that I'm ok with the change but i think we should at least revisit all the other query limits that we already have on cortex and try to unify/simplify them here - so we have a one stop place to define query limits. This can be done in a follow up PR.

Ex of current query limits:

# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]

# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]


# Limit the query time range (end - start time of range query parameter and max
# - min of data fetched time range). This limit is enforced in the
# query-frontend and ruler (on the received query). 0 to disable.
# CLI flag: -store.max-query-length
[max_query_length: <duration> | default = 0s]

Thoughts @friedrichg @CharlieTLe ?

erlan-z · 2024-07-02T19:37:59Z

I think this LGTM but im afraid of making the cortex config even more complex. I wonder if those kinds of limits would not live better in another component like the auth gateway.

Said that I'm ok with the change but i think we should at least revisit all the other query limits that we already have on cortex and try to unify/simplify them here - so we have a one stop place to define query limits. This can be done in a follow up PR.

Ex of current query limits:
# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]

# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]


# Limit the query time range (end - start time of range query parameter and max
# - min of data fetched time range). This limit is enforced in the
# query-frontend and ruler (on the received query). 0 to disable.
# CLI flag: -store.max-query-length
[max_query_length: <duration> | default = 0s]
Thoughts @friedrichg @CharlieTLe ?

Thanks for the review.
Regarding incorporating all those limits, I initially envisioned query_rejection as a mechanism to protect our services from heavy queries, to be used by operators during events. The other mentioned configurations are intended to set predefined limits on queries. However, there seems to be some overlap, and having these configurations spread across different locations complicates things. Technically, they are all limits on queries and can be combined and managed under query_rejection.
I'm unsure which approach is best: differentiating these concepts by having configurations in separate places or unifying them under query_rejection as a single location for managing all query-related limits.

The main argument I had against placing it in the auth gateway is that the auth gateway is primarily for authentication and shouldn't be aware of the request content. It shouldn't be parsing the query and used to limit queries based on their characteristics.

friedrichg

@alanprot
Combining flags/options is a great idea. I support that.

I think query priority was very interesting when it got merged. This is just very related to it. Moving things over to auth-gateway would require moving that too. We don't have many "facilities" in auth-gateway like limits per tenant, so it would be hard work. Maybe even importing cortex there to use things would be a prerequisite. Or creating a common library.

On the other hand, I mentioned the rejection feature to a couple people and they are very interested so I think is a bad idea to block this feature too much.

I have only a minor nit on the integration tests. Otherwise LGTM

integration/query_frontend_test.go

- query rejection configurations are added. It uses QueryAttributes which is used by priority queue - added tests. priority queue - priority queue was changed to include step, agent, dashboard, panel configs. Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

- add check to eliminate duplicate root blocks in case struct was used several times. query rejection generate docs Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

small fixes. Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

added changelog Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

alanprot · 2024-07-03T21:33:05Z

Ok.. Ship it!

I think I would still try to deprecate the overlapping flags in a following PR

pull-request-size bot added the size/XXL label Jun 11, 2024

erlan-z force-pushed the query-rejection branch 2 times, most recently from 83ac5d3 to c95e7d1 Compare June 13, 2024 00:37

friedrichg reviewed Jun 13, 2024

View reviewed changes

docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved

erlan-z force-pushed the query-rejection branch 2 times, most recently from 3e36d7a to b0300fd Compare June 14, 2024 01:31

erlan-z marked this pull request as ready for review June 14, 2024 01:58

erlan-z requested a review from friedrichg June 14, 2024 01:58

erlan-z force-pushed the query-rejection branch from 01812dc to 6bfecc2 Compare June 18, 2024 18:46

friedrichg reviewed Jun 19, 2024

View reviewed changes

yeya24 reviewed Jun 19, 2024

View reviewed changes

pkg/querier/tripperware/roundtrip.go Outdated Show resolved Hide resolved

yeya24 reviewed Jun 19, 2024

View reviewed changes

docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved

yeya24 reviewed Jun 20, 2024

View reviewed changes

pkg/querier/tripperware/query_attribute_matcher.go Outdated Show resolved Hide resolved

pkg/querier/tripperware/query_attribute_matcher.go Show resolved Hide resolved

erlan-z force-pushed the query-rejection branch 2 times, most recently from 5220716 to 64268bd Compare June 21, 2024 18:29

CharlieTLe reviewed Jun 21, 2024

View reviewed changes

docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved

erlan-z force-pushed the query-rejection branch from 64268bd to b4737ef Compare June 21, 2024 23:31

erlan-z requested review from friedrichg, yeya24 and CharlieTLe June 25, 2024 22:18

erlan-z force-pushed the query-rejection branch 3 times, most recently from 2b94124 to 4b4f1c1 Compare June 26, 2024 03:27

yeya24 reviewed Jun 26, 2024

View reviewed changes

erlan-z requested a review from yeya24 June 26, 2024 18:33

erlan-z force-pushed the query-rejection branch from 5461a54 to 6d10723 Compare June 28, 2024 14:10

yeya24 approved these changes Jun 28, 2024

View reviewed changes

alanprot reviewed Jul 2, 2024

View reviewed changes

friedrichg reviewed Jul 3, 2024

View reviewed changes

integration/query_frontend_test.go Outdated Show resolved Hide resolved

erlan-z added 13 commits July 3, 2024 10:22

query rejection

12de492

- query rejection configurations are added. It uses QueryAttributes which is used by priority queue - added tests. priority queue - priority queue was changed to include step, agent, dashboard, panel configs. Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

doc-generator

c820096

- add check to eliminate duplicate root blocks in case struct was used several times. query rejection generate docs Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

query rejection

6088633

small fixes. Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

query rejection

dccf2e4

added changelog Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

query rejection - update docs

63cfa4c

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

query rejection - add time range attribute

211b153

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

query rejection - address comments

cb8ad44

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

query rejection - add integration test, fix query_step_limit

0dc3f1a

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

query rejection - fix imports and improve doc.

bf3e1cb

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

query rejection - address comments.

739cf23

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

query rejection - remove step limit check for subqueries.

0f6a230

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

query rejection - add API type

572c8ca

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

query rejection - address comments

c0d9115

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>

erlan-z force-pushed the query-rejection branch from 6d10723 to c0d9115 Compare July 3, 2024 18:19

erlan-z requested review from friedrichg and alanprot July 3, 2024 18:55

alanprot approved these changes Jul 3, 2024

View reviewed changes

friedrichg approved these changes Jul 4, 2024

View reviewed changes

friedrichg merged commit defc3c3 into cortexproject:master Jul 4, 2024
16 checks passed

harry671003 mentioned this pull request Jul 31, 2024

Blocking certain queries in Thanos thanos-io/thanos#7579

Open

erlan-z mentioned this pull request Aug 7, 2024

Query Rejection for Metadata Queries Bug #6143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query rejection #6005

query rejection #6005

erlan-z commented Jun 11, 2024 •

edited

Loading

friedrichg left a comment

erlan-z commented Jun 25, 2024

erlan-z commented Jun 26, 2024

yeya24 commented Jun 26, 2024

yeya24 left a comment

alanprot Jul 2, 2024

alanprot Jul 2, 2024 •

edited

Loading

alanprot Jul 2, 2024

erlan-z Jul 3, 2024

alanprot Jul 2, 2024

alanprot commented Jul 2, 2024 •

edited

Loading

erlan-z commented Jul 2, 2024 •

edited

Loading

friedrichg left a comment

alanprot commented Jul 3, 2024 •

edited

Loading

query rejection #6005

query rejection #6005

Conversation

erlan-z commented Jun 11, 2024 • edited Loading

friedrichg left a comment

Choose a reason for hiding this comment

erlan-z commented Jun 25, 2024

erlan-z commented Jun 26, 2024

yeya24 commented Jun 26, 2024

yeya24 left a comment

Choose a reason for hiding this comment

alanprot Jul 2, 2024

Choose a reason for hiding this comment

alanprot Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

alanprot Jul 2, 2024

Choose a reason for hiding this comment

erlan-z Jul 3, 2024

Choose a reason for hiding this comment

alanprot Jul 2, 2024

Choose a reason for hiding this comment

alanprot commented Jul 2, 2024 • edited Loading

erlan-z commented Jul 2, 2024 • edited Loading

friedrichg left a comment

Choose a reason for hiding this comment

alanprot commented Jul 3, 2024 • edited Loading

erlan-z commented Jun 11, 2024 •

edited

Loading

alanprot Jul 2, 2024 •

edited

Loading

alanprot commented Jul 2, 2024 •

edited

Loading

erlan-z commented Jul 2, 2024 •

edited

Loading

alanprot commented Jul 3, 2024 •

edited

Loading