Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query rejection #6005

Merged
merged 13 commits into from
Jul 4, 2024
Merged

query rejection #6005

merged 13 commits into from
Jul 4, 2024

Conversation

erlan-z
Copy link
Contributor

@erlan-z erlan-z commented Jun 11, 2024

What this PR does:
This PR introduces a new functionality to reject queries based on certain attributes to protect our services from resource-exhausting queries that can lead to service disruptions, such as OOM kills or availability drop for other queries. This change provides a mechanism to block heavy queries before they reach our services, ensuring operational stability and resource efficiency.

Changes Introduced

  • Query Rejection Mechanism:
    Rejects queries based on query api_type, regex, time window, time range, step size, user-agent regex, dashboard UID, and panel ID.
    Configured through runtime config, allowing per-tenant management.
  • Query Priority Mechanism:
    Query attributes from the existing query_priority functionality were utilized and extended to support the new properties.
  • Documentation Generator Fix:
    Fixed doc-generator to remove duplicate documentation on repeated fields on slices.

Usage details
Both query priority and query rejection mechanisms utilize the same config object (query_attribute) to determine if a query matches specified criteria. The query_attribute config defines requirements for a query and results in a match or no match based on it's properties. The query_attribute does not directly reject or prioritize queries; rather, the action (query rejection or priority) is determined by the config where it was used.

All provided properties in query_attribute use an AND operation to decide if a query matches. Matching is only done on fields provided in the config.

Example config:

    query_rejection:
      enabled: true
      query_attributes:
        - regex: .*ALERT.*
          query_step_limit:
            min: 6s
            max: 20s
          dashboard_uid: "dash123"

Example query:
curl 'localhost:8005/prometheus/api/v1/query?query=someCustomALERTquery&time=1718383304&end=1718386904&step=7s' -H "User-Agent: other" -H "X-Dashboard-Uid: dash123"

In this example, queries containing the string "ALERT" with a step between 6s and 20s, and a dashboard UID of "dash123" will be rejected. If any criteria specified in query_attribute properties are not met, such as a step of 30s, the query will not be rejected. In the above example, undefined properties (user_agent, time_window, etc.) are ignored, and the query is not checked against them.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@erlan-z erlan-z force-pushed the query-rejection branch 2 times, most recently from 83ac5d3 to c95e7d1 Compare June 13, 2024 00:37
@erlan-z erlan-z force-pushed the query-rejection branch 2 times, most recently from 3e36d7a to b0300fd Compare June 14, 2024 01:31
@erlan-z erlan-z marked this pull request as ready for review June 14, 2024 01:58
@erlan-z erlan-z requested a review from friedrichg June 14, 2024 01:58
Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erlan-z This is pretty cool. Thank you for doing this!

Things that I would love to see:

  • use warnExperimentalUse
    func WarnExperimentalUse(feature string) {
  • Document the experimental feature on docs/configuration/v1-guarantees.md
  • Integration tests

docs/configuration/config-file-reference.md Show resolved Hide resolved
docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved
docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved
docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved
pkg/querier/tripperware/query_attribute_matcher.go Outdated Show resolved Hide resolved
pkg/querier/tripperware/query_attribute_matcher.go Outdated Show resolved Hide resolved
pkg/querier/tripperware/query_attribute_matcher.go Outdated Show resolved Hide resolved
@erlan-z erlan-z force-pushed the query-rejection branch 2 times, most recently from 5220716 to 64268bd Compare June 21, 2024 18:29
@erlan-z
Copy link
Contributor Author

erlan-z commented Jun 25, 2024

@erlan-z This is pretty cool. Thank you for doing this!

Things that I would love to see:

  • use warnExperimentalUse
    func WarnExperimentalUse(feature string) {
  • Document the experimental feature on docs/configuration/v1-guarantees.md
  • Integration tests
  • Documented experimental feature on docs/configuration/v1-guarantees.md
  • Added integration tests.
  • I didn't use warnExperimentalUse because this feature is enabled or disabled per tenant, not for the entire system. Additionally, it can change at runtime, so I decided not to add the experimental log or metric. Please let me know if you still think we should add it.

@erlan-z erlan-z force-pushed the query-rejection branch 3 times, most recently from 2b94124 to 4b4f1c1 Compare June 26, 2024 03:27
pkg/querier/tripperware/query_attribute_matcher.go Outdated Show resolved Hide resolved
pkg/querier/tripperware/query_attribute_matcher.go Outdated Show resolved Hide resolved
pkg/querier/tripperware/query_attribute_matcher.go Outdated Show resolved Hide resolved
pkg/util/validation/limits.go Outdated Show resolved Hide resolved
integration/e2ecortex/client.go Outdated Show resolved Hide resolved
@erlan-z erlan-z requested a review from yeya24 June 26, 2024 18:33
@erlan-z
Copy link
Contributor Author

erlan-z commented Jun 26, 2024

For series/label/label_values, at least one match should match the regex.

I wonder why that's the case. How useful it is to match only one?

As we just go through that matchers and send separate query for each of them, if there is heavy matcher, then whole query will fail. I think if we want to reject/prioritize that heavy queries that match regex, than matching single one should be enough.

@yeya24
Copy link
Contributor

yeya24 commented Jun 26, 2024

As we just go through that matchers and send separate query for each of them, if there is heavy matcher, then whole query will fail. I think if we want to reject/prioritize that heavy queries that match regex, than matching single one should be enough.

Let's say we know a very heavy matcher {cluster="us-west-2"}. Do we reject the query if users have this matcher in the request? I think it is only expensive if they use this matcher along but should be fine as long as they add more matchers.

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments! The change looks good to me

Comment on lines 5394 to 5410
time_range_limit:
# Query time range should be above or equal to this value to match. If set to
# 0, it won't be checked.
[min: <int> | default = 0]

# Query time range should be below or equal to this value to match. If set to
# 0, it won't be checked.
[max: <int> | default = 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document a bit the syntax here? i suspect its something like -1d, 1h etc

Copy link
Member

@alanprot alanprot Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we incorporate this config here and deprecate it? Those configs seems to overlap?

# Limit the query time range (end - start time of range query parameter and max
# - min of data fetched time range). This limit is enforced in the
# query-frontend and ruler (on the received query). 0 to disable.
# CLI flag: -store.max-query-length
[max_query_length: <duration> | default = 0s]

Should we deprecated this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as:

# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated time-range docs

Comment on lines +5406 to +5422
query_step_limit:
# Query step should be above or equal to this value to match. If set to 0, it
# won't be checked.
[min: <int> | default = 0]

# Query step should be below or equal to this value to match. If set to 0, it
# won't be checked.
[max: <int> | default = 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we incorporate this config here and deprecate it?

# Max number of steps allowed for every subquery expression in query. Number of
# steps is calculated using subquery range / step. A value > 0 enables it.
# CLI flag: -querier.max-subquery-steps
[max_subquery_steps: <int> | default = 0]

I'm trying to not have 2 ways of setting query limits.

@alanprot
Copy link
Member

alanprot commented Jul 2, 2024

I think this LGTM but im afraid of making the cortex config even more complex. I wonder if those kinds of limits would not live better in another component like the auth gateway.

Said that I'm ok with the change but i think we should at least revisit all the other query limits that we already have on cortex and try to unify/simplify them here - so we have a one stop place to define query limits. This can be done in a follow up PR.

Ex of current query limits:

# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]

# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]


# Limit the query time range (end - start time of range query parameter and max
# - min of data fetched time range). This limit is enforced in the
# query-frontend and ruler (on the received query). 0 to disable.
# CLI flag: -store.max-query-length
[max_query_length: <duration> | default = 0s]

Thoughts @friedrichg @CharlieTLe ?

@erlan-z
Copy link
Contributor Author

erlan-z commented Jul 2, 2024

I think this LGTM but im afraid of making the cortex config even more complex. I wonder if those kinds of limits would not live better in another component like the auth gateway.

Said that I'm ok with the change but i think we should at least revisit all the other query limits that we already have on cortex and try to unify/simplify them here - so we have a one stop place to define query limits. This can be done in a follow up PR.

Ex of current query limits:

# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]

# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]


# Limit the query time range (end - start time of range query parameter and max
# - min of data fetched time range). This limit is enforced in the
# query-frontend and ruler (on the received query). 0 to disable.
# CLI flag: -store.max-query-length
[max_query_length: <duration> | default = 0s]

Thoughts @friedrichg @CharlieTLe ?

Thanks for the review.
Regarding incorporating all those limits, I initially envisioned query_rejection as a mechanism to protect our services from heavy queries, to be used by operators during events. The other mentioned configurations are intended to set predefined limits on queries. However, there seems to be some overlap, and having these configurations spread across different locations complicates things. Technically, they are all limits on queries and can be combined and managed under query_rejection.
I'm unsure which approach is best: differentiating these concepts by having configurations in separate places or unifying them under query_rejection as a single location for managing all query-related limits.

The main argument I had against placing it in the auth gateway is that the auth gateway is primarily for authentication and shouldn't be aware of the request content. It shouldn't be parsing the query and used to limit queries based on their characteristics.

Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanprot
Combining flags/options is a great idea. I support that.

I think query priority was very interesting when it got merged. This is just very related to it. Moving things over to auth-gateway would require moving that too. We don't have many "facilities" in auth-gateway like limits per tenant, so it would be hard work. Maybe even importing cortex there to use things would be a prerequisite. Or creating a common library.

On the other hand, I mentioned the rejection feature to a couple people and they are very interested so I think is a bad idea to block this feature too much.

I have only a minor nit on the integration tests. Otherwise LGTM

integration/query_frontend_test.go Outdated Show resolved Hide resolved
erlan-z added 13 commits July 3, 2024 10:22
     - query rejection configurations are added. It uses QueryAttributes which is used by priority queue
     - added tests.

     priority queue
     - priority queue was changed to include step, agent, dashboard, panel configs.

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
  - add check to eliminate duplicate root blocks in case struct was used several times.
query rejection
     generate docs

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
     small fixes.

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
     added changelog

Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
@erlan-z erlan-z force-pushed the query-rejection branch from 6d10723 to c0d9115 Compare July 3, 2024 18:19
@erlan-z erlan-z requested review from friedrichg and alanprot July 3, 2024 18:55
@alanprot
Copy link
Member

alanprot commented Jul 3, 2024

Ok.. Ship it!

I think I would still try to deprecate the overlapping flags in a following PR

@friedrichg friedrichg merged commit defc3c3 into cortexproject:master Jul 4, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants