Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor and protect distributor from OOMKilled due to too many in progress requests #5917

Merged

Conversation

alexqyle
Copy link
Contributor

@alexqyle alexqyle commented May 2, 2024

What this PR does:
Distributor could be overloaded if push requests to ingester getting slower. This would cause distributor uses more memory to hold timeseries objects from those slow requests and distributor could get OOMKilled when large requests kept piling up.

To protect distributor from this,

  • New config max_inflight_push_requests is introduced for ingester client. This is hard limit on how many request one distributor could send through one ingester client. Requests exceed this limit would be dropped.
  • New metric cortex_ingester_client_request_count is created to keep monitoring how many requests were sent through each ingester client on each distributor. This could help to set max_inflight_push_requests with proper value.

Which issue(s) this PR fixes:
NA

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…istributor to one ingester. Created inflight request limit per ingester client.

Signed-off-by: Alex Le <leqiyue@amazon.com>
Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this relate to ? or how is it better?

  # Max inflight push requests that this distributor can handle. This limit is
  # per-distributor, not per-tenant. Additional requests will be rejected. 0 =
  # unlimited.
  # CLI flag: -distributor.instance-limits.max-inflight-push-requests
  [max_inflight_push_requests: <int> | default = 0]

Signed-off-by: Alex Le <leqiyue@amazon.com>
@pull-request-size pull-request-size bot added size/M and removed size/S labels May 2, 2024
@alexqyle
Copy link
Contributor Author

alexqyle commented May 3, 2024

How does this relate to ? or how is it better?

  # Max inflight push requests that this distributor can handle. This limit is
  # per-distributor, not per-tenant. Additional requests will be rejected. 0 =
  # unlimited.
  # CLI flag: -distributor.instance-limits.max-inflight-push-requests
  [max_inflight_push_requests: <int> | default = 0]

@friedrichg max_inflight_push_requests is controlling request coming into distributor. The new config is controlling request sending from distributor to ingester. With replication factor > 1 set, distributor would send time series to more than ingesters.

For example, if RF set o 3, distributor would send same time series to 3 ingesters. According to DoBatch logic, distributor could accept new inflight request when 2 out of 3 requests are succeeded. However, if one of the ingester is unhealthy, the request to bad ingester would hanging there until timeout. At same time, distributor is accepting new inflight requests. In this case, inflight request to distributor does not change since 2 out of 3 ingesters returned 2xx, distributor would consider this inflight request completed. While request number from distributor to ingester increased, because one ingester is bad. Request to that bad ingester would take longer to complete. With new config introduced, number of request from distributor to that bad ingester could have a limit and make it fail fast.

Moreover, this new config could also prevent one distributor got overloaded if network connect is impaired between this distributor and ingesters.

alexqyle added 3 commits May 2, 2024 18:01
…r_inflight_push_requests

Signed-off-by: Alex Le <leqiyue@amazon.com>
Signed-off-by: Alex Le <leqiyue@amazon.com>
Signed-off-by: Alex Le <leqiyue@amazon.com>
@pull-request-size pull-request-size bot added size/L and removed size/M labels May 4, 2024
@alexqyle alexqyle marked this pull request as ready for review May 4, 2024 00:12
Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great idea!. Thanks for explaining. I have just a minor nit

pkg/ingester/client/client.go Outdated Show resolved Hide resolved
Signed-off-by: Alex Le <leqiyue@amazon.com>
Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@friedrichg friedrichg merged commit ba306c3 into cortexproject:master May 7, 2024
16 checks passed
@alanprot
Copy link
Member

alanprot commented May 7, 2024

Are we cleaning the metric when we close the client ? Otherwise we will have dangling metrics during deployment or any other ingested replacement .. (as the metric is per ip)

@alexqyle
Copy link
Contributor Author

alexqyle commented May 7, 2024

Are we cleaning the metric when we close the client ? Otherwise we will have dangling metrics during deployment or any other ingested replacement .. (as the metric is per ip)

Will create a new PR to address this

@alexqyle alexqyle deleted the ingester-client-inflight-limit branch May 7, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants