-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query rejection #6005
query rejection #6005
Conversation
83ac5d3
to
c95e7d1
Compare
3e36d7a
to
b0300fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erlan-z This is pretty cool. Thank you for doing this!
Things that I would love to see:
- use warnExperimentalUse
cortex/pkg/util/log/experimental.go
Line 18 in 6dd64fc
func WarnExperimentalUse(feature string) { - Document the experimental feature on docs/configuration/v1-guarantees.md
- Integration tests
5220716
to
64268bd
Compare
|
2b94124
to
4b4f1c1
Compare
As we just go through that matchers and send separate query for each of them, if there is heavy matcher, then whole query will fail. I think if we want to reject/prioritize that heavy queries that match regex, than matching single one should be enough. |
Let's say we know a very heavy matcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all the comments! The change looks good to me
time_range_limit: | ||
# Query time range should be above or equal to this value to match. If set to | ||
# 0, it won't be checked. | ||
[min: <int> | default = 0] | ||
|
||
# Query time range should be below or equal to this value to match. If set to | ||
# 0, it won't be checked. | ||
[max: <int> | default = 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document a bit the syntax here? i suspect its something like -1d
, 1h
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we incorporate this config here and deprecate it? Those configs seems to overlap?
# Limit the query time range (end - start time of range query parameter and max
# - min of data fetched time range). This limit is enforced in the
# query-frontend and ruler (on the received query). 0 to disable.
# CLI flag: -store.max-query-length
[max_query_length: <duration> | default = 0s]
Should we deprecated this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as:
# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated time-range docs
query_step_limit: | ||
# Query step should be above or equal to this value to match. If set to 0, it | ||
# won't be checked. | ||
[min: <int> | default = 0] | ||
|
||
# Query step should be below or equal to this value to match. If set to 0, it | ||
# won't be checked. | ||
[max: <int> | default = 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we incorporate this config here and deprecate it?
# Max number of steps allowed for every subquery expression in query. Number of
# steps is calculated using subquery range / step. A value > 0 enables it.
# CLI flag: -querier.max-subquery-steps
[max_subquery_steps: <int> | default = 0]
I'm trying to not have 2 ways of setting query limits.
I think this LGTM but im afraid of making the cortex config even more complex. I wonder if those kinds of limits would not live better in another component like the auth gateway. Said that I'm ok with the change but i think we should at least revisit all the other query limits that we already have on cortex and try to unify/simplify them here - so we have a one stop place to define query limits. This can be done in a follow up PR. Ex of current query limits:
Thoughts @friedrichg @CharlieTLe ? |
Thanks for the review. The main argument I had against placing it in the auth gateway is that the auth gateway is primarily for authentication and shouldn't be aware of the request content. It shouldn't be parsing the query and used to limit queries based on their characteristics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanprot
Combining flags/options is a great idea. I support that.
I think query priority was very interesting when it got merged. This is just very related to it. Moving things over to auth-gateway would require moving that too. We don't have many "facilities" in auth-gateway like limits per tenant, so it would be hard work. Maybe even importing cortex there to use things would be a prerequisite. Or creating a common library.
On the other hand, I mentioned the rejection feature to a couple people and they are very interested so I think is a bad idea to block this feature too much.
I have only a minor nit on the integration tests. Otherwise LGTM
- query rejection configurations are added. It uses QueryAttributes which is used by priority queue - added tests. priority queue - priority queue was changed to include step, agent, dashboard, panel configs. Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
- add check to eliminate duplicate root blocks in case struct was used several times. query rejection generate docs Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
small fixes. Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
added changelog Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Signed-off-by: Erlan Zholdubai uulu <erlanz@amazon.com>
Ok.. Ship it! I think I would still try to deprecate the overlapping flags in a following PR |
What this PR does:
This PR introduces a new functionality to reject queries based on certain attributes to protect our services from resource-exhausting queries that can lead to service disruptions, such as OOM kills or availability drop for other queries. This change provides a mechanism to block heavy queries before they reach our services, ensuring operational stability and resource efficiency.
Changes Introduced
Rejects queries based on query api_type, regex, time window, time range, step size, user-agent regex, dashboard UID, and panel ID.
Configured through runtime config, allowing per-tenant management.
Query attributes from the existing query_priority functionality were utilized and extended to support the new properties.
Fixed doc-generator to remove duplicate documentation on repeated fields on slices.
Usage details
Both query priority and query rejection mechanisms utilize the same config object (query_attribute) to determine if a query matches specified criteria. The query_attribute config defines requirements for a query and results in a match or no match based on it's properties. The query_attribute does not directly reject or prioritize queries; rather, the action (query rejection or priority) is determined by the config where it was used.
All provided properties in query_attribute use an AND operation to decide if a query matches. Matching is only done on fields provided in the config.
Example config:
Example query:
curl 'localhost:8005/prometheus/api/v1/query?query=someCustomALERTquery&time=1718383304&end=1718386904&step=7s' -H "User-Agent: other" -H "X-Dashboard-Uid: dash123"
In this example, queries containing the string "ALERT" with a step between 6s and 20s, and a dashboard UID of "dash123" will be rejected. If any criteria specified in query_attribute properties are not met, such as a step of 30s, the query will not be rejected. In the above example, undefined properties (user_agent, time_window, etc.) are ignored, and the query is not checked against them.
Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]