-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor and protect distributor from OOMKilled due to too many in progress requests #5917
Monitor and protect distributor from OOMKilled due to too many in progress requests #5917
Conversation
…istributor to one ingester. Created inflight request limit per ingester client. Signed-off-by: Alex Le <leqiyue@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this relate to ? or how is it better?
# Max inflight push requests that this distributor can handle. This limit is
# per-distributor, not per-tenant. Additional requests will be rejected. 0 =
# unlimited.
# CLI flag: -distributor.instance-limits.max-inflight-push-requests
[max_inflight_push_requests: <int> | default = 0]
Signed-off-by: Alex Le <leqiyue@amazon.com>
@friedrichg For example, if RF set o 3, distributor would send same time series to 3 ingesters. According to DoBatch logic, distributor could accept new inflight request when 2 out of 3 requests are succeeded. However, if one of the ingester is unhealthy, the request to bad ingester would hanging there until timeout. At same time, distributor is accepting new inflight requests. In this case, inflight request to distributor does not change since 2 out of 3 ingesters returned 2xx, distributor would consider this inflight request completed. While request number from distributor to ingester increased, because one ingester is bad. Request to that bad ingester would take longer to complete. With new config introduced, number of request from distributor to that bad ingester could have a limit and make it fail fast. Moreover, this new config could also prevent one distributor got overloaded if network connect is impaired between this distributor and ingesters. |
…r_inflight_push_requests Signed-off-by: Alex Le <leqiyue@amazon.com>
Signed-off-by: Alex Le <leqiyue@amazon.com>
Signed-off-by: Alex Le <leqiyue@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great idea!. Thanks for explaining. I have just a minor nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Are we cleaning the metric when we close the client ? Otherwise we will have dangling metrics during deployment or any other ingested replacement .. (as the metric is per ip) |
Will create a new PR to address this |
What this PR does:
Distributor could be overloaded if push requests to ingester getting slower. This would cause distributor uses more memory to hold timeseries objects from those slow requests and distributor could get OOMKilled when large requests kept piling up.
To protect distributor from this,
max_inflight_push_requests
is introduced for ingester client. This is hard limit on how many request one distributor could send through one ingester client. Requests exceed this limit would be dropped.cortex_ingester_client_request_count
is created to keep monitoring how many requests were sent through each ingester client on each distributor. This could help to setmax_inflight_push_requests
with proper value.Which issue(s) this PR fixes:
NA
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]