Monitor and protect distributor from OOMKilled due to too many in progress requests #5917

alexqyle · 2024-05-02T00:10:05Z

What this PR does:
Distributor could be overloaded if push requests to ingester getting slower. This would cause distributor uses more memory to hold timeseries objects from those slow requests and distributor could get OOMKilled when large requests kept piling up.

To protect distributor from this,

New config max_inflight_push_requests is introduced for ingester client. This is hard limit on how many request one distributor could send through one ingester client. Requests exceed this limit would be dropped.
New metric cortex_ingester_client_request_count is created to keep monitoring how many requests were sent through each ingester client on each distributor. This could help to set max_inflight_push_requests with proper value.

Which issue(s) this PR fixes:
NA

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…istributor to one ingester. Created inflight request limit per ingester client. Signed-off-by: Alex Le <leqiyue@amazon.com>

friedrichg

How does this relate to ? or how is it better?

  # Max inflight push requests that this distributor can handle. This limit is
  # per-distributor, not per-tenant. Additional requests will be rejected. 0 =
  # unlimited.
  # CLI flag: -distributor.instance-limits.max-inflight-push-requests
  [max_inflight_push_requests: <int> | default = 0]

Signed-off-by: Alex Le <leqiyue@amazon.com>

alexqyle · 2024-05-03T00:17:06Z

How does this relate to ? or how is it better?

  # Max inflight push requests that this distributor can handle. This limit is
  # per-distributor, not per-tenant. Additional requests will be rejected. 0 =
  # unlimited.
  # CLI flag: -distributor.instance-limits.max-inflight-push-requests
  [max_inflight_push_requests: <int> | default = 0]

@friedrichg max_inflight_push_requests is controlling request coming into distributor. The new config is controlling request sending from distributor to ingester. With replication factor > 1 set, distributor would send time series to more than ingesters.

For example, if RF set o 3, distributor would send same time series to 3 ingesters. According to DoBatch logic, distributor could accept new inflight request when 2 out of 3 requests are succeeded. However, if one of the ingester is unhealthy, the request to bad ingester would hanging there until timeout. At same time, distributor is accepting new inflight requests. In this case, inflight request to distributor does not change since 2 out of 3 ingesters returned 2xx, distributor would consider this inflight request completed. While request number from distributor to ingester increased, because one ingester is bad. Request to that bad ingester would take longer to complete. With new config introduced, number of request from distributor to that bad ingester could have a limit and make it fail fast.

Moreover, this new config could also prevent one distributor got overloaded if network connect is impaired between this distributor and ingesters.

…r_inflight_push_requests Signed-off-by: Alex Le <leqiyue@amazon.com>

Signed-off-by: Alex Le <leqiyue@amazon.com>

friedrichg

It's a great idea!. Thanks for explaining. I have just a minor nit

pkg/ingester/client/client.go

Signed-off-by: Alex Le <leqiyue@amazon.com>

friedrichg

LGTM

alanprot · 2024-05-07T14:00:44Z

Are we cleaning the metric when we close the client ? Otherwise we will have dangling metrics during deployment or any other ingested replacement .. (as the metric is per ip)

alexqyle · 2024-05-07T17:31:38Z

Are we cleaning the metric when we close the client ? Otherwise we will have dangling metrics during deployment or any other ingested replacement .. (as the metric is per ip)

Will create a new PR to address this

Added metric to keep track of how many in progress request from one d…

38a4779

…istributor to one ingester. Created inflight request limit per ingester client. Signed-off-by: Alex Le <leqiyue@amazon.com>

pull-request-size bot added the size/S label May 2, 2024

friedrichg reviewed May 2, 2024

View reviewed changes

moved throttling logic to closableHealthAndIngesterClient

13775a6

Signed-off-by: Alex Le <leqiyue@amazon.com>

pull-request-size bot added size/M and removed size/S labels May 2, 2024

alexqyle added 3 commits May 2, 2024 18:01

update changelog and config doc and reverted metric cortex_distributo…

d69489c

…r_inflight_push_requests Signed-off-by: Alex Le <leqiyue@amazon.com>

Merge branch 'master' into ingester-client-inflight-limit

3176440

Signed-off-by: Alex Le <leqiyue@amazon.com>

add unit test

2212f33

Signed-off-by: Alex Le <leqiyue@amazon.com>

pull-request-size bot added size/L and removed size/M labels May 4, 2024

alexqyle marked this pull request as ready for review May 4, 2024 00:12

friedrichg requested changes May 6, 2024

View reviewed changes

pkg/ingester/client/client.go Outdated Show resolved Hide resolved

rename

793e492

Signed-off-by: Alex Le <leqiyue@amazon.com>

friedrichg approved these changes May 7, 2024

View reviewed changes

friedrichg merged commit ba306c3 into cortexproject:master May 7, 2024
16 checks passed

alexqyle deleted the ingester-client-inflight-limit branch May 7, 2024 17:32

alexqyle mentioned this pull request May 7, 2024

Clean up ingester client inflightPushRequests on close #5932

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor and protect distributor from OOMKilled due to too many in progress requests #5917

Monitor and protect distributor from OOMKilled due to too many in progress requests #5917

alexqyle commented May 2, 2024 •

edited

Loading

friedrichg left a comment

alexqyle commented May 3, 2024

friedrichg left a comment

friedrichg left a comment

alanprot commented May 7, 2024

alexqyle commented May 7, 2024

Monitor and protect distributor from OOMKilled due to too many in progress requests #5917

Monitor and protect distributor from OOMKilled due to too many in progress requests #5917

Conversation

alexqyle commented May 2, 2024 • edited Loading

friedrichg left a comment

Choose a reason for hiding this comment

alexqyle commented May 3, 2024

friedrichg left a comment

Choose a reason for hiding this comment

friedrichg left a comment

Choose a reason for hiding this comment

alanprot commented May 7, 2024

alexqyle commented May 7, 2024

alexqyle commented May 2, 2024 •

edited

Loading