-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Procstat plugin causes higher cpu with telegraf 15.1 release #7884
Comments
inputs.logparser hasn't received any changes between 1.14.5 and 1.15.1, so it seems unlikely that there would be any performance regressions here. Is this result seen in combination with procstat, or do you also see this performance change when only logparser is running? |
Yep definitely in conjunction with inputs.procstats. I just went back to telegraf-1.14.4-1.x86_64.rpm and that has higher cpu but not as bad as in 15.1. In 15.1 the cpu jumps to more than 100%. let me do some more analysis. I will post it by tomorrow. |
on 14.14-4 If I have both logparser and procstat enabled the cpu jumps to When I have procstats enabled but log parser disabled: Whe I have procstats disabled and only logparser enabled here is the top output: When I install telegraf-1.15.1-1.x86_64.rpm and enable procstat look at the cpu jump. The log parser plugin is disabled at this time: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND Now with only logparser enabled but procstat disabled: In summary both version 14 and 15 have issues. I think I had another ticket for netstat and I will update that separately. |
We might need to compare to 1.13 for a baseline, if 1.14 also had issues. Would be good to see for each version. inputs.internal might help with capturing some runtime stats per plugin. No pressure, but if you get a chance to fill this out it might be helpful. Ideally with some kind of average cpu measurement after startup and Telegraf has run for one full gather/collection. You may also want to consider submitting a profile for the latest version.
|
I will work on this today. I saw the profiling page. I will collect traces. |
Does the inputs.internal have cpu stats? CPU used by telegraf or is it procstats? |
Hi Steven,
Can I send the output to you via this email I don't want to post the traces on github. Can you suggest how to send the traces?
Thanks,
Ketan
On Monday, July 27, 2020, 02:46:05 PM MDT, Steven Soroka <notifications@github.com> wrote:
We might need to compare to 1.13 for a baseline, if 1.14 also had issues. Would be good to see for each version.
inputs.internal might help with capturing some runtime stats per plugin.
No pressure, but if you get a chance to fill this out it might be helpful. Ideally with some kind of average cpu measurement after startup and Telegraf has run for one full gather/collection.
You may also want to consider submitting a profile for the latest version.
avg cpu usage
+---------+---------+----------------+---------------+-------------------------+
| Version | Neither | Just Logparser | Just Procstat | Both Logparser+procstat |
+---------+---------+----------------+---------------+-------------------------+
| 1.13 | | | | |
| 1.15 | | | | |
+---------+---------+----------------+---------------+-------------------------+
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@ssoroka any update? |
@ssoroka any updates? |
@ssoroka any updates? |
Hey @Kedesai, I see you, I'm just behind in replies. You can email me any traces (email removed; trace received) btw, enable inputs.cpu for cpu stats. Looks like the increase is related largely to procstat. How many processes do you typically have running on the Telegraf machine? We've got a couple issues open for procstat performance and it's something we're actively working on. The fix is complicated and requires some upstream changes, so it's been slow to resolve. I'll check the traces for anything else obvious. Thanks for your patience! |
We are having similar issues. When upgrading some of our machines from Telegraf 1.14.5 to 1.15.2, telegraf's CPU usage rose on one machine from ~9% to ~25%, on another from ~25% to ~110% (of a single core). After further investigating, we found out that disabling procstat input configs reduced the CPU usage again. This especially applies to procstat + nginx and procstat + php-fpm, as nginx and php-fpm have a lot of processes. Downgrading to 1.14.5 did also solve the problem. Configs: ### php7.1-fpm.conf
[[inputs.phpfpm]]
urls = ["/var/run/php7.1-fpm.sock:fpm-status"]
[inputs.phpfpm.tags]
_influxdb_database = "php"
[[inputs.procstat]]
pattern = "fpm"
fielddrop = ["pid"]
[inputs.procstat.tags]
_influxdb_database = "php"
[[aggregators.basicstats]]
period = "10s"
drop_original = true
stats = ["sum"]
namepass = ["procstat"]
[aggregators.basicstats.tagpass]
process_name = ["php-fpm7.1"] ### nginx.conf
[[inputs.nginx]]
urls = ["http://localhost/nginx_status"]
[inputs.nginx.tags]
_influxdb_database = "nginx"
[[inputs.procstat]]
pattern = "nginx"
fielddrop = ["pid"]
[inputs.procstat.tags]
_influxdb_database = "nginx"
[[aggregators.basicstats]]
period = "10s"
drop_original = true
stats = ["sum"]
namepass = ["procstat"]
[aggregators.basicstats.tagpass]
process_name = ["nginx"] |
I can confirm the same behaviour, went from 1.13.4 -> 1.15.0 and procstat collector gather_time_ns went from sub 50ms to larger than 250ms. I went back to 1.14.5 and the performance was sub 50ms again.
|
One of my team built Telegraf 1.15.3 with gopsutil v2.20.5, CPU usage of procstat increased 5x, same as the official builds. They then built Telegraf 1.15.3 with gopsutil v2.20.4, CPU usage for procstat went back to what it was in 1.13.4 and 1.14.5 with the official builds. |
Relevant telegraf.conf:
System info:
Telegraf ver 15.1 and Rhel version 7.3
Docker
Steps to reproduce:
3.) with logparser and procstat the cpu is very high actually unacceptable levels. Even with those two disabled the cpu is at 5 % which is still very high for this kind of monitoring.
Expected behavior:
The cpu just jumped to more than 120% the expected cpu usage for telegraf should be more than 1% or 2 %.
Actual behavior:
Additional info:
The text was updated successfully, but these errors were encountered: