fix(server): Add read and write timeouts #2412

Pokom · 2024-06-05T18:26:35Z

What this PR does / why we need it:

There are a few documented scenarios where kube-state-metrics will lock up(#995, #1028). I believe a much simpler solution to ensure kube-state-metrics doesn't lock up and require a restart to server /metrics requests is to add default read and write timeouts and to allow them to be configurable. At Grafana, we've experienced a few scenarios where kube-state-metrics running in larger clusters falls behind and starts getting scraped multiple times. When this occurs, kube-state-metrics becomes completely unresponsive and requires a reboot. This is somewhat easily reproduceable(I'll provide a script in an issue) and causes other critical workloads(KEDA, VPA) to fail in weird ways.

Adds two flags:

server-read-timeout
server-write-timeout

Updates the metrics http server to set the ReadTimeout and WriteTimeout to the configured values.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)

Does not change.

Fixes #2413

There are a few documented scenarios where `kube-state-metrics` will lock up(kubernetes#995, kubernetes#1028). I believe a much simpler solution to ensure `kube-state-metrics` doesn't lock up and require a restart to server `/metrics` requests is to add default read and write timeouts and to allow them to be configurable. At Grafana, we've experienced a few scenarios where `kube-state-metrics` running in larger clusters falls behind and starts getting scraped multiple times. When this occurs, `kube-state-metrics` becomes completely unresponsive and requires a reboot. This is somewhat easily reproduceable(I'll provide a script in an issue) and causes other critical workloads(KEDA, VPA) to fail in weird ways. Adds two flags: - `server-read-timeout` - `server-write-timeout` Updates the metrics http server to set the `ReadTimeout` and `WriteTimeout` to the configured values.

linux-foundation-easycla · 2024-06-05T18:26:41Z

The committers listed above are authorized under a signed CLA.

✅ login: Pokom / name: Mark (28dbd26, ee39139, cd460fe, e97933b, b4f032e)

k8s-ci-robot · 2024-06-05T18:26:44Z

Welcome @Pokom!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

mrueg · 2024-06-07T16:15:28Z

pkg/options/options.go

@@ -146,6 +151,11 @@ func (o *Options) AddFlags(cmd *cobra.Command) {
 	o.cmd.Flags().Var(&o.Namespaces, "namespaces", fmt.Sprintf("Comma-separated list of namespaces to be enabled. Defaults to %q", &DefaultNamespaces))
 	o.cmd.Flags().Var(&o.NamespacesDenylist, "namespaces-denylist", "Comma-separated list of namespaces not to be enabled. If namespaces and namespaces-denylist are both set, only namespaces that are excluded in namespaces-denylist will be used.")
 	o.cmd.Flags().Var(&o.Resources, "resources", fmt.Sprintf("Comma-separated list of Resources to be enabled. Defaults to %q", &DefaultResources))
+
+	o.cmd.Flags().DurationVar(&o.ServerReadTimeout, "server-read-timeout", 30*time.Second, "The maximum duration for reading the entire request, including the body.")


Can we move the durations into separate variables at the top of the doc? I feel like it's easier to spot when they are not hiding inside the Flag setting

Just to make sure I'm following: Are you suggesting moving the default(30*time.Second) to a variable and place them at the top? That seems entirely reasonable to me.

@mrueg I pushed a new commit with moving the default values into separate variables at the top of the doc. Let me know how they look 👍🏻

I've also made the defaults to be 60 seconds, as I think this is closer aligned with default scrape intervals from Prometheus. I'm wary to setting them to 10s which is the default timeout value for prometheus scrape configs.

Given how long kube-state-metrics has existed without these values, this could be a problematic change for folks that have scrape intervals of over >60s. I'm not sure you'd like to handle that, or if we opt to have the default value being 0 which would effectively mean there is no timeout.

mrueg · 2024-06-07T16:17:35Z

Should we set timeouts for the telemetryserver as well? https://github.com/kubernetes/kube-state-metrics/pull/2412/files#diff-dc8fe36bd1edf1b4cd03bacbd94b02cd4226717fe7e1a6474c534f5b5db30227R315

How did you chose the duration for each timeout setting? Should we provide any additional suggestions?

Pokom · 2024-06-07T16:59:35Z

Should we set timeouts for the telemetryserver as well? https://github.com/kubernetes/kube-state-metrics/pull/2412/files#diff-dc8fe36bd1edf1b4cd03bacbd94b02cd4226717fe7e1a6474c534f5b5db30227R315

I'm sure it would be a good idea to set timeouts for the telemetry server as well, but I don't have enough experience with that type of scraping to provide meaningful defaults 😅 I can follow up with an issue to set those in a future PR, how does that sound?

How did you chose the duration for each timeout setting? Should we provide any additional suggestions?

These were picked somewhat arbitrarily, but the intent is to align them some sane scrape times. When running in our dev cluster with dozens+ nodes, I have both aligned with our scrape interval and timeout(15s). As far as guidance, I think the minimum value should be the scrape timeout. The safest option for the defaults would be following the default scrape_interval from Prometheus which is 60s.

docs/developer/cli-arguments.md

mrueg · 2024-06-07T17:55:29Z

/ok-to-test

What happens if a request takes longer than a timeout? Does the server send any log output? Is it visible on the telemetry server metrics?

mrueg · 2024-06-07T17:57:33Z

Should we set timeouts for the telemetryserver as well? https://github.com/kubernetes/kube-state-metrics/pull/2412/files#diff-dc8fe36bd1edf1b4cd03bacbd94b02cd4226717fe7e1a6474c534f5b5db30227R315

I'm sure it would be a good idea to set timeouts for the telemetry server as well, but I don't have enough experience with that type of scraping to provide meaningful defaults 😅 I can follow up with an issue to set those in a future PR, how does that sound?

That works for me as well. Just want to make sure that we don't have --server- prefixed arguments and it's not clear for which server they are.

How did you chose the duration for each timeout setting? Should we provide any additional suggestions?

These were picked somewhat arbitrarily, but the intent is to align them some sane scrape times. When running in our dev cluster with dozens+ nodes, I have both aligned with our scrape interval and timeout(15s). As far as guidance, I think the minimum value should be the scrape timeout. The safest option for the defaults would be following the default scrape_interval from Prometheus which is 60s.

That makes sense to me, thanks for updating!

Co-authored-by: Manuel Rüger <manuel@rueg.eu>

Pokom · 2024-06-07T18:15:52Z

What happens if a request takes longer than a timeout? Does the server send any log output? Is it visible on the telemetry server metrics?

I'm not sure if anything is logged on kube-state-metrics side, that's a good question. On the scraper side, it entirely depends on how they chose to handle a cancelled context. Ideally you would see something similar to:

2024/06/07 14:15:34 error: Get "http://localhost:8080/metrics": context deadline exceeded

mrueg · 2024-06-13T11:53:50Z

/hold

for others to review.

Changes

/lgtm

dgrisonnet · 2024-06-13T16:54:12Z

/assign @dgrisonnet @richabanker
/triage accepted

richabanker · 2024-06-17T04:39:03Z

/lgtm

k8s-ci-robot · 2024-06-17T04:39:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg, Pokom, richabanker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mrueg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mrueg · 2024-06-17T10:17:29Z

/hold cancel

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2024

k8s-ci-robot requested review from dgrisonnet and mrueg June 5, 2024 18:26

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 5, 2024

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 5, 2024

Pokom changed the title ~~fix(server): Add read and write timeouts~~ [WIP] fix(server): Add read and write timeouts Jun 5, 2024

Add additional flags for IdleTimeouts

b4f032e

Pokom marked this pull request as ready for review June 6, 2024 17:11

k8s-ci-robot requested review from CatherineF-dev and logicalhan June 6, 2024 17:11

Pokom changed the title ~~[WIP] fix(server): Add read and write timeouts~~ fix(server): Add read and write timeouts Jun 6, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2024

mrueg reviewed Jun 7, 2024

View reviewed changes

Create variables for default values of new flags

28dbd26

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 7, 2024

Pokom requested a review from mrueg June 7, 2024 17:15

mrueg reviewed Jun 7, 2024

View reviewed changes

docs/developer/cli-arguments.md Outdated Show resolved Hide resolved

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jun 7, 2024

Update docs/developer/cli-arguments.md

ee39139

Co-authored-by: Manuel Rüger <manuel@rueg.eu>

Update cli-arguments.md

cd460fe

Pokom requested a review from mrueg June 11, 2024 10:53

Pokom mentioned this pull request Jun 12, 2024

Scrapes hang when deadlocked #2413

Closed

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2024

k8s-ci-robot assigned mrueg Jun 13, 2024

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 13, 2024

k8s-ci-robot assigned dgrisonnet and richabanker Jun 13, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024

mrueg added this to the v2.13.0 milestone Jun 14, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2024

k8s-ci-robot merged commit 124117f into kubernetes:main Jun 17, 2024
12 checks passed

Pokom deleted the fix/add-server-timeouts branch June 17, 2024 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): Add read and write timeouts #2412

fix(server): Add read and write timeouts #2412

Pokom commented Jun 5, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jun 5, 2024 •

edited

Loading

k8s-ci-robot commented Jun 5, 2024

mrueg Jun 7, 2024

Pokom Jun 7, 2024

Pokom Jun 7, 2024

mrueg commented Jun 7, 2024

Pokom commented Jun 7, 2024

mrueg commented Jun 7, 2024

mrueg commented Jun 7, 2024

Pokom commented Jun 7, 2024

mrueg commented Jun 13, 2024

dgrisonnet commented Jun 13, 2024

richabanker commented Jun 17, 2024

k8s-ci-robot commented Jun 17, 2024

mrueg commented Jun 17, 2024

fix(server): Add read and write timeouts #2412

fix(server): Add read and write timeouts #2412

Conversation

Pokom commented Jun 5, 2024 • edited Loading

linux-foundation-easycla bot commented Jun 5, 2024 • edited Loading

k8s-ci-robot commented Jun 5, 2024

mrueg Jun 7, 2024

Choose a reason for hiding this comment

Pokom Jun 7, 2024

Choose a reason for hiding this comment

Pokom Jun 7, 2024

Choose a reason for hiding this comment

mrueg commented Jun 7, 2024

Pokom commented Jun 7, 2024

mrueg commented Jun 7, 2024

mrueg commented Jun 7, 2024

Pokom commented Jun 7, 2024

mrueg commented Jun 13, 2024

dgrisonnet commented Jun 13, 2024

richabanker commented Jun 17, 2024

k8s-ci-robot commented Jun 17, 2024

mrueg commented Jun 17, 2024

Pokom commented Jun 5, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jun 5, 2024 •

edited

Loading