Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(server): Add read and write timeouts #2412

Merged
merged 5 commits into from
Jun 17, 2024

Conversation

Pokom
Copy link
Contributor

@Pokom Pokom commented Jun 5, 2024

What this PR does / why we need it:

There are a few documented scenarios where kube-state-metrics will lock up(#995, #1028). I believe a much simpler solution to ensure kube-state-metrics doesn't lock up and require a restart to server /metrics requests is to add default read and write timeouts and to allow them to be configurable. At Grafana, we've experienced a few scenarios where kube-state-metrics running in larger clusters falls behind and starts getting scraped multiple times. When this occurs, kube-state-metrics becomes completely unresponsive and requires a reboot. This is somewhat easily reproduceable(I'll provide a script in an issue) and causes other critical workloads(KEDA, VPA) to fail in weird ways.

Adds two flags:

  • server-read-timeout
  • server-write-timeout

Updates the metrics http server to set the ReadTimeout and WriteTimeout to the configured values.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)

Does not change.

Fixes #2413

There are a few documented scenarios where `kube-state-metrics` will
lock up(kubernetes#995, kubernetes#1028). I believe a much simpler solution to ensure
`kube-state-metrics` doesn't lock up and require a restart to server
`/metrics` requests is to add default read and write timeouts and to
allow them to be configurable. At Grafana, we've experienced a few
scenarios where `kube-state-metrics` running in larger clusters falls
behind and starts getting scraped multiple times. When this occurs,
`kube-state-metrics` becomes completely unresponsive and requires a
reboot. This is somewhat easily reproduceable(I'll provide a script in
an issue) and causes other critical workloads(KEDA, VPA) to fail in
weird ways.

Adds two flags:
- `server-read-timeout`
- `server-write-timeout`

Updates the metrics http server to set the `ReadTimeout` and
`WriteTimeout` to the configured values.
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2024
Copy link

linux-foundation-easycla bot commented Jun 5, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 5, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @Pokom!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 5, 2024
@Pokom Pokom changed the title fix(server): Add read and write timeouts [WIP] fix(server): Add read and write timeouts Jun 5, 2024
@Pokom Pokom marked this pull request as ready for review June 6, 2024 17:11
@Pokom Pokom changed the title [WIP] fix(server): Add read and write timeouts fix(server): Add read and write timeouts Jun 6, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2024
@@ -146,6 +151,11 @@ func (o *Options) AddFlags(cmd *cobra.Command) {
o.cmd.Flags().Var(&o.Namespaces, "namespaces", fmt.Sprintf("Comma-separated list of namespaces to be enabled. Defaults to %q", &DefaultNamespaces))
o.cmd.Flags().Var(&o.NamespacesDenylist, "namespaces-denylist", "Comma-separated list of namespaces not to be enabled. If namespaces and namespaces-denylist are both set, only namespaces that are excluded in namespaces-denylist will be used.")
o.cmd.Flags().Var(&o.Resources, "resources", fmt.Sprintf("Comma-separated list of Resources to be enabled. Defaults to %q", &DefaultResources))

o.cmd.Flags().DurationVar(&o.ServerReadTimeout, "server-read-timeout", 30*time.Second, "The maximum duration for reading the entire request, including the body.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the durations into separate variables at the top of the doc? I feel like it's easier to spot when they are not hiding inside the Flag setting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I'm following: Are you suggesting moving the default(30*time.Second) to a variable and place them at the top? That seems entirely reasonable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrueg I pushed a new commit with moving the default values into separate variables at the top of the doc. Let me know how they look 👍🏻

I've also made the defaults to be 60 seconds, as I think this is closer aligned with default scrape intervals from Prometheus. I'm wary to setting them to 10s which is the default timeout value for prometheus scrape configs.

Given how long kube-state-metrics has existed without these values, this could be a problematic change for folks that have scrape intervals of over >60s. I'm not sure you'd like to handle that, or if we opt to have the default value being 0 which would effectively mean there is no timeout.

@mrueg
Copy link
Member

mrueg commented Jun 7, 2024

Should we set timeouts for the telemetryserver as well? https://github.com/kubernetes/kube-state-metrics/pull/2412/files#diff-dc8fe36bd1edf1b4cd03bacbd94b02cd4226717fe7e1a6474c534f5b5db30227R315

How did you chose the duration for each timeout setting? Should we provide any additional suggestions?

@Pokom
Copy link
Contributor Author

Pokom commented Jun 7, 2024

Should we set timeouts for the telemetryserver as well? https://github.com/kubernetes/kube-state-metrics/pull/2412/files#diff-dc8fe36bd1edf1b4cd03bacbd94b02cd4226717fe7e1a6474c534f5b5db30227R315

I'm sure it would be a good idea to set timeouts for the telemetry server as well, but I don't have enough experience with that type of scraping to provide meaningful defaults 😅 I can follow up with an issue to set those in a future PR, how does that sound?

How did you chose the duration for each timeout setting? Should we provide any additional suggestions?

These were picked somewhat arbitrarily, but the intent is to align them some sane scrape times. When running in our dev cluster with dozens+ nodes, I have both aligned with our scrape interval and timeout(15s). As far as guidance, I think the minimum value should be the scrape timeout. The safest option for the defaults would be following the default scrape_interval from Prometheus which is 60s.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 7, 2024
@Pokom Pokom requested a review from mrueg June 7, 2024 17:15
@mrueg
Copy link
Member

mrueg commented Jun 7, 2024

/ok-to-test

What happens if a request takes longer than a timeout? Does the server send any log output? Is it visible on the telemetry server metrics?

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jun 7, 2024
@mrueg
Copy link
Member

mrueg commented Jun 7, 2024

Should we set timeouts for the telemetryserver as well? https://github.com/kubernetes/kube-state-metrics/pull/2412/files#diff-dc8fe36bd1edf1b4cd03bacbd94b02cd4226717fe7e1a6474c534f5b5db30227R315

I'm sure it would be a good idea to set timeouts for the telemetry server as well, but I don't have enough experience with that type of scraping to provide meaningful defaults 😅 I can follow up with an issue to set those in a future PR, how does that sound?

That works for me as well. Just want to make sure that we don't have --server- prefixed arguments and it's not clear for which server they are.

How did you chose the duration for each timeout setting? Should we provide any additional suggestions?

These were picked somewhat arbitrarily, but the intent is to align them some sane scrape times. When running in our dev cluster with dozens+ nodes, I have both aligned with our scrape interval and timeout(15s). As far as guidance, I think the minimum value should be the scrape timeout. The safest option for the defaults would be following the default scrape_interval from Prometheus which is 60s.

That makes sense to me, thanks for updating!

Co-authored-by: Manuel Rüger <manuel@rueg.eu>
@Pokom
Copy link
Contributor Author

Pokom commented Jun 7, 2024

What happens if a request takes longer than a timeout? Does the server send any log output? Is it visible on the telemetry server metrics?

I'm not sure if anything is logged on kube-state-metrics side, that's a good question. On the scraper side, it entirely depends on how they chose to handle a cancelled context. Ideally you would see something similar to:

2024/06/07 14:15:34 error: Get "http://localhost:8080/metrics": context deadline exceeded

@Pokom Pokom requested a review from mrueg June 11, 2024 10:53
@mrueg
Copy link
Member

mrueg commented Jun 13, 2024

/hold

for others to review.

Changes

/lgtm

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2024
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 13, 2024
@dgrisonnet
Copy link
Member

/assign @dgrisonnet @richabanker
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024
@mrueg mrueg added this to the v2.13.0 milestone Jun 14, 2024
@richabanker
Copy link
Contributor

/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg, Pokom, richabanker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mrueg
Copy link
Member

mrueg commented Jun 17, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2024
@k8s-ci-robot k8s-ci-robot merged commit 124117f into kubernetes:main Jun 17, 2024
12 checks passed
@Pokom Pokom deleted the fix/add-server-timeouts branch June 17, 2024 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scrapes hang when deadlocked
5 participants