Skip to content

Commit

Permalink
fix(server): Add read and write timeouts
Browse files Browse the repository at this point in the history
There are a few documented scenarios where `kube-state-metrics` will
lock up(#995, #1028). I believe a much simpler solution to ensure
`kube-state-metrics` doesn't lock up and require a restart to server
`/metrics` requests is to add default read and write timeouts and to
allow them to be configurable. At Grafana, we've experienced a few
scenarios where `kube-state-metrics` running in larger clusters falls
behind and starts getting scraped multiple times. When this occurs,
`kube-state-metrics` becomes completely unresponsive and requires a
reboot. This is somewhat easily reproduceable(I'll provide a script in
an issue) and causes other critical workloads(KEDA, VPA) to fail in
weird ways.

Adds two flags:
- `server-read-timeout`
- `server-write-timeout`

Updates the metrics http server to set the `ReadTimeout` and
`WriteTimeout` to the configured values.
  • Loading branch information
Pokom committed Jun 5, 2024
1 parent 7995d5f commit e97933b
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 1 deletion.
2 changes: 2 additions & 0 deletions docs/developer/cli-arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ Flags:
--pod-namespace string Name of the namespace of the pod specified by --pod. When set, it is expected that --pod and --pod-namespace are both set. Most likely this should be passed via the downward API. This is used for auto-detecting sharding. If set, this has preference over statically configured sharding. This is experimental, it may be removed without notice.
--port int Port to expose metrics on. (default 8080)
--resources string Comma-separated list of Resources to be enabled. Defaults to "certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments"
--server-read-timeout duration The maximum duration for reading the entire request, including the body. (default 30s)
--server-write-timeout duration The maximum duration before timing out writes of the response. (default 1m0s)
--shard int32 The instances shard nominal (zero indexed) within the total number of shards. (default 0)
--skip_headers If true, avoid header prefixes in the log messages
--skip_log_headers If true, avoid headers when opening log files (no effect when -logtostderr=true)
Expand Down
3 changes: 2 additions & 1 deletion pkg/app/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,8 @@ func RunKubeStateMetrics(ctx context.Context, opts *options.Options) error {
metricsServer := http.Server{
Handler: metricsMux,
ReadHeaderTimeout: 5 * time.Second,
ReadTimeout: opts.ServerReadTimeout,
WriteTimeout: opts.ServerWriteTimeout,
}
metricsFlags := web.FlagConfig{
WebListenAddresses: &[]string{metricsServerListenAddress},
Expand Down Expand Up @@ -401,7 +403,6 @@ func buildMetricsServer(m *metricshandler.MetricsHandler, durationObserver prome
mux.Handle("/debug/pprof/trace", http.HandlerFunc(pprof.Trace))

mux.Handle(metricsPath, promhttp.InstrumentHandlerDuration(durationObserver, m))

// Add healthzPath
mux.HandleFunc(healthzPath, func(w http.ResponseWriter, _ *http.Request) {
w.WriteHeader(http.StatusOK)
Expand Down
5 changes: 5 additions & 0 deletions pkg/options/options.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
"fmt"
"os"
"strings"
"time"

"github.com/prometheus/common/version"
"github.com/spf13/cobra"
Expand Down Expand Up @@ -55,6 +56,8 @@ type Options struct {
TelemetryPort int `yaml:"telemetry_port"`
TotalShards int `yaml:"total_shards"`
UseAPIServerCache bool `yaml:"use_api_server_cache"`
ServerReadTimeout time.Duration `yaml:"server_read_timeout"`
ServerWriteTimeout time.Duration `yaml:"server_write_timeout"`

Config string

Expand Down Expand Up @@ -146,6 +149,8 @@ func (o *Options) AddFlags(cmd *cobra.Command) {
o.cmd.Flags().Var(&o.Namespaces, "namespaces", fmt.Sprintf("Comma-separated list of namespaces to be enabled. Defaults to %q", &DefaultNamespaces))
o.cmd.Flags().Var(&o.NamespacesDenylist, "namespaces-denylist", "Comma-separated list of namespaces not to be enabled. If namespaces and namespaces-denylist are both set, only namespaces that are excluded in namespaces-denylist will be used.")
o.cmd.Flags().Var(&o.Resources, "resources", fmt.Sprintf("Comma-separated list of Resources to be enabled. Defaults to %q", &DefaultResources))
o.cmd.Flags().DurationVar(&o.ServerReadTimeout, "server-read-timeout", 30*time.Second, "The maximum duration for reading the entire request, including the body.")
o.cmd.Flags().DurationVar(&o.ServerWriteTimeout, "server-write-timeout", 60*time.Second, "The maximum duration before timing out writes of the response.")
}

// Parse parses the flag definitions from the argument list.
Expand Down

0 comments on commit e97933b

Please sign in to comment.