cpu utilisation hitting cpu limit #10878

shubrajp · 2022-03-23T15:30:50Z

Relevant telegraf.conf

[[inputs.prometheus]]
        metric_version = 2
        monitor_kubernetes_pods = true

Logs from Telegraf

2022-03-22T10:30:48Z I! Starting Telegraf 1.21.4
2022-03-22T10:30:48Z I! Using config file: /etc/telegraf/telegraf.conf
2022-03-22T10:30:48Z I! Loaded inputs: internal prometheus
2022-03-22T10:30:48Z I! Loaded aggregators:
2022-03-22T10:30:48Z I! Loaded processors:
2022-03-22T10:30:48Z I! Loaded outputs: prometheus_client
2022-03-22T10:30:48Z I! Tags enabled: host=[...]
2022-03-22T10:30:48Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"[...]", Flush Interval:1m10s
2022-03-22T10:30:48Z D! [agent] Initializing plugins
2022-03-22T10:30:48Z D! [agent] Connecting outputs
2022-03-22T10:30:48Z D! [agent] Attempting connection to [outputs.prometheus_client]
2022-03-22T10:30:48Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
2022-03-22T10:30:48Z D! [agent] Successfully connected to outputs.prometheus_client
2022-03-22T10:30:48Z D! [agent] Starting service inputs

System info

Telegraf 1.21.4

Docker

resources:
limits:
cpu: 100m
memory: 1Gi
requests:
cpu: 10m
memory: 40Mi

Steps to reproduce

Enable prometheus input plugin with monitor_kubernetes_pods set to true
Let it run for about an hour
Keep monitoring cpu usage using kubectl top pod | grep telegraf

Expected behavior

cpu util % to remain nearly constant

Actual behavior

cpu util % to spikes up after some time

Additional info

Profile (after cpu increments):
(pprof) top
Showing nodes accounting for 4590ms, 92.17% of 4980ms total
Dropped 29 nodes (cum <= 24.90ms)
Showing top 10 nodes out of 33
flat flat% sum% cum cum%
1090ms 21.89% 21.89% 1090ms 21.89% runtime.futex
860ms 17.27% 39.16% 860ms 17.27% runtime.lock2
630ms 12.65% 51.81% 630ms 12.65% runtime.unlock2
560ms 11.24% 63.05% 2320ms 46.59% runtime.chanrecv
470ms 9.44% 72.49% 500ms 10.04% sync.(*Mutex).Unlock (inline)
450ms 9.04% 81.53% 970ms 19.48% context.(*cancelCtx).Done
200ms 4.02% 85.54% 3740ms 75.10% github.com/influxdata/telegraf/plugins/inputs/prometheus.(*Prometheus).watchPod
120ms 2.41% 87.95% 120ms 2.41% runtime.memclrNoHeapPointers
110ms 2.21% 90.16% 230ms 4.62% runtime.typedmemclr
100ms 2.01% 92.17% 310ms 6.22% runtime.selectnbrecv

I've looked at the profile multiple times after cpu spikes… watchpod is consistently present…

Possible culprit:
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/prometheus/kubernetes.go#L108
There is no exit from the infinite for loop.
It never hits "<-ctx.Done()" case

Related Issue:
#10148
Here watchPod stops monitoring pod changes (new pod creation / pod deletion / etc.) after 30 minutes.

The text was updated successfully, but these errors were encountered:

powersj · 2022-03-23T21:42:57Z

Hi,

Is that all there is in your telegraf config? or do you have other inputs and outputs?
Do you know how many pods existed before and after the spike?
Are there a number of pods coming and going over this hour?
You specify telegraf should have cpu: 100m, which if I understand correctly is 1/10th of a CPU? If a bunch of pods come online and Telegraf is trying to go through them all I can understand why there might be such a spike.

If requests were stacking over time I would expect a possible trend upwards and maybe not a direct spike.

While I do not doubt there are possible issues with the prometheus input, we need an actionable, specific issue to make any changes here.

Thanks

shubrajp · 2022-04-03T15:06:57Z

Hi powersj...

All input / output plugins:

    [agent]
    interval = "60s"
    round_interval = true
    metric_batch_size = 75000
    metric_buffer_limit = 150000
    collection_jitter = "5s"
    flush_interval = "70s"
    flush_jitter = "10s"
    precision = ""
    debug = true
    hostname = ""
    omit_hostname = false

    [[outputs.prometheus_client]]
    listen = ":9273"
    metric_version = 2
    collectors_exclude = ["gocollector", "process"]
    export_timestamp = true
    
    [[inputs.internal]]
    collect_memstats = true
    name_prefix = "telegraf."

    [[inputs.prometheus]]
    metric_version = 2
    monitor_kubernetes_pods = true

The number of pods are same before and after the spike
There is no change in the pods. Same pods are running throughout.
I've tried modifying the cpu limit but behaviour is consistent. Even if I make cpu limit 500m, I see the spike to 500m.

We have tried it with different number of pods, different cpu request and cpu limit values but no luck.
After some time the cpu util % increase and stays at that level.

We have narrowed down to watchPod() as culprit.

	watcher, err := client.CoreV1().Pods(p.PodNamespace).Watch(ctx, metav1.ListOptions{
		LabelSelector: p.KubernetesLabelSelector,
		FieldSelector: p.KubernetesFieldSelector,
	})

if I add a big enough timeout (say a week) under ListOptions, I don't see this spike.

shubrajp · 2022-04-03T15:10:32Z

Also these two bugs are related:
#10928
#10148

watchPod() got introduced here:
#8937

MyaLongmire · 2022-04-04T15:37:17Z

@shubrajp will you please test out pr #10932?

shubrajp · 2022-04-04T16:10:08Z

@MyaLongmire
It's working fine... It hasn't spiked for last 10 hours....

$ kubectl top pod | grep telegraf
telegraf-57b556f444-dmd6c               1/1     Running   0          10h

$ kubectl get pods
telegraf-57b556f444-dmd6c               16m          89Mi

MyaLongmire · 2022-04-04T16:12:24Z

@shubrajp Yay! Thank you for testing it out and reporting back to us :)

powersj · 2022-04-08T14:09:56Z

Fixed with #10932

shubrajp added the bug unexpected problem or unintended behavior label Mar 23, 2022

telegraf-tiger bot added the area/prometheus label Mar 23, 2022

powersj added the waiting for response waiting for response from contributor label Mar 23, 2022

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Apr 3, 2022

mesksr mentioned this issue Apr 3, 2022

fix(inputs.prometheus): moved from watcher to informer #10932

Merged

3 tasks

powersj closed this as completed Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu utilisation hitting cpu limit #10878

cpu utilisation hitting cpu limit #10878

shubrajp commented Mar 23, 2022 •

edited

Loading

powersj commented Mar 23, 2022

shubrajp commented Apr 3, 2022 •

edited

Loading

shubrajp commented Apr 3, 2022 •

edited

Loading

MyaLongmire commented Apr 4, 2022

shubrajp commented Apr 4, 2022

MyaLongmire commented Apr 4, 2022

powersj commented Apr 8, 2022

cpu utilisation hitting cpu limit #10878

cpu utilisation hitting cpu limit #10878

Comments

shubrajp commented Mar 23, 2022 • edited Loading

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

powersj commented Mar 23, 2022

shubrajp commented Apr 3, 2022 • edited Loading

shubrajp commented Apr 3, 2022 • edited Loading

MyaLongmire commented Apr 4, 2022

shubrajp commented Apr 4, 2022

MyaLongmire commented Apr 4, 2022

powersj commented Apr 8, 2022

shubrajp commented Mar 23, 2022 •

edited

Loading

shubrajp commented Apr 3, 2022 •

edited

Loading

shubrajp commented Apr 3, 2022 •

edited

Loading