Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu utilisation hitting cpu limit #10878

Closed
shubrajp opened this issue Mar 23, 2022 · 7 comments
Closed

cpu utilisation hitting cpu limit #10878

shubrajp opened this issue Mar 23, 2022 · 7 comments
Labels
area/prometheus bug unexpected problem or unintended behavior

Comments

@shubrajp
Copy link

shubrajp commented Mar 23, 2022

Relevant telegraf.conf

[[inputs.prometheus]]
        metric_version = 2
        monitor_kubernetes_pods = true

Logs from Telegraf

2022-03-22T10:30:48Z I! Starting Telegraf 1.21.4
2022-03-22T10:30:48Z I! Using config file: /etc/telegraf/telegraf.conf
2022-03-22T10:30:48Z I! Loaded inputs: internal prometheus
2022-03-22T10:30:48Z I! Loaded aggregators:
2022-03-22T10:30:48Z I! Loaded processors:
2022-03-22T10:30:48Z I! Loaded outputs: prometheus_client
2022-03-22T10:30:48Z I! Tags enabled: host=[...]
2022-03-22T10:30:48Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"[...]", Flush Interval:1m10s
2022-03-22T10:30:48Z D! [agent] Initializing plugins
2022-03-22T10:30:48Z D! [agent] Connecting outputs
2022-03-22T10:30:48Z D! [agent] Attempting connection to [outputs.prometheus_client]
2022-03-22T10:30:48Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
2022-03-22T10:30:48Z D! [agent] Successfully connected to outputs.prometheus_client
2022-03-22T10:30:48Z D! [agent] Starting service inputs

System info

Telegraf 1.21.4

Docker

resources:
limits:
cpu: 100m
memory: 1Gi
requests:
cpu: 10m
memory: 40Mi

Steps to reproduce

  1. Enable prometheus input plugin with monitor_kubernetes_pods set to true
  2. Let it run for about an hour
  3. Keep monitoring cpu usage using kubectl top pod | grep telegraf

Expected behavior

cpu util % to remain nearly constant

Actual behavior

cpu util % to spikes up after some time
12_43_08

Additional info

Profile (after cpu increments):
(pprof) top
Showing nodes accounting for 4590ms, 92.17% of 4980ms total
Dropped 29 nodes (cum <= 24.90ms)
Showing top 10 nodes out of 33
flat flat% sum% cum cum%
1090ms 21.89% 21.89% 1090ms 21.89% runtime.futex
860ms 17.27% 39.16% 860ms 17.27% runtime.lock2
630ms 12.65% 51.81% 630ms 12.65% runtime.unlock2
560ms 11.24% 63.05% 2320ms 46.59% runtime.chanrecv
470ms 9.44% 72.49% 500ms 10.04% sync.(*Mutex).Unlock (inline)
450ms 9.04% 81.53% 970ms 19.48% context.(*cancelCtx).Done
200ms 4.02% 85.54% 3740ms 75.10% github.com/influxdata/telegraf/plugins/inputs/prometheus.(*Prometheus).watchPod
120ms 2.41% 87.95% 120ms 2.41% runtime.memclrNoHeapPointers
110ms 2.21% 90.16% 230ms 4.62% runtime.typedmemclr
100ms 2.01% 92.17% 310ms 6.22% runtime.selectnbrecv

I've looked at the profile multiple times after cpu spikes… watchpod is consistently present…

Possible culprit:
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/prometheus/kubernetes.go#L108
There is no exit from the infinite for loop.
It never hits "<-ctx.Done()" case

Related Issue:
#10148
Here watchPod stops monitoring pod changes (new pod creation / pod deletion / etc.) after 30 minutes.

@shubrajp shubrajp added the bug unexpected problem or unintended behavior label Mar 23, 2022
@powersj
Copy link
Contributor

powersj commented Mar 23, 2022

Hi,

  • Is that all there is in your telegraf config? or do you have other inputs and outputs?
  • Do you know how many pods existed before and after the spike?
  • Are there a number of pods coming and going over this hour?
  • You specify telegraf should have cpu: 100m, which if I understand correctly is 1/10th of a CPU? If a bunch of pods come online and Telegraf is trying to go through them all I can understand why there might be such a spike.

If requests were stacking over time I would expect a possible trend upwards and maybe not a direct spike.

While I do not doubt there are possible issues with the prometheus input, we need an actionable, specific issue to make any changes here.

Thanks

@powersj powersj added the waiting for response waiting for response from contributor label Mar 23, 2022
@shubrajp
Copy link
Author

shubrajp commented Apr 3, 2022

Hi powersj...

  1. All input / output plugins:
    [agent]
    interval = "60s"
    round_interval = true
    metric_batch_size = 75000
    metric_buffer_limit = 150000
    collection_jitter = "5s"
    flush_interval = "70s"
    flush_jitter = "10s"
    precision = ""
    debug = true
    hostname = ""
    omit_hostname = false

    [[outputs.prometheus_client]]
    listen = ":9273"
    metric_version = 2
    collectors_exclude = ["gocollector", "process"]
    export_timestamp = true
    
    [[inputs.internal]]
    collect_memstats = true
    name_prefix = "telegraf."

    [[inputs.prometheus]]
    metric_version = 2
    monitor_kubernetes_pods = true
  1. The number of pods are same before and after the spike

  2. There is no change in the pods. Same pods are running throughout.

  3. I've tried modifying the cpu limit but behaviour is consistent. Even if I make cpu limit 500m, I see the spike to 500m.

We have tried it with different number of pods, different cpu request and cpu limit values but no luck.
After some time the cpu util % increase and stays at that level.

We have narrowed down to watchPod() as culprit.

	watcher, err := client.CoreV1().Pods(p.PodNamespace).Watch(ctx, metav1.ListOptions{
		LabelSelector: p.KubernetesLabelSelector,
		FieldSelector: p.KubernetesFieldSelector,
	})

if I add a big enough timeout (say a week) under ListOptions, I don't see this spike.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Apr 3, 2022
@shubrajp
Copy link
Author

shubrajp commented Apr 3, 2022

Also these two bugs are related:
#10928
#10148

watchPod() got introduced here:
#8937

@MyaLongmire
Copy link
Contributor

@shubrajp will you please test out pr #10932?

@shubrajp
Copy link
Author

shubrajp commented Apr 4, 2022

@MyaLongmire
It's working fine... It hasn't spiked for last 10 hours....

$ kubectl top pod | grep telegraf
telegraf-57b556f444-dmd6c               1/1     Running   0          10h
$ kubectl get pods
telegraf-57b556f444-dmd6c               16m          89Mi

@MyaLongmire
Copy link
Contributor

@shubrajp Yay! Thank you for testing it out and reporting back to us :)

@powersj
Copy link
Contributor

powersj commented Apr 8, 2022

Fixed with #10932

@powersj powersj closed this as completed Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prometheus bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants