Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Procstat plugin causes higher cpu with telegraf 15.1 release #7884

Closed
Kedesai opened this issue Jul 23, 2020 · 15 comments · Fixed by #8210
Closed

Procstat plugin causes higher cpu with telegraf 15.1 release #7884

Kedesai opened this issue Jul 23, 2020 · 15 comments · Fixed by #8210
Labels
area/procstat performance problems with decreased performance or enhancements that improve performance

Comments

@Kedesai
Copy link

Kedesai commented Jul 23, 2020

Relevant telegraf.conf:

# [[inputs.logparser]]
#   files = ["/location/www/some/logs/*_log.*"]

#   from_beginning = false
#   watch_method = "poll"

# [inputs.logparser.grok]
#   patterns = ["%{COMBINED_LOG_FORMAT}"]
#   measurement = "web1_some_apache_access_log"
#   timezone = "America/New_York"

System info:

Telegraf ver 15.1 and Rhel version 7.3

Docker

Steps to reproduce:

  1. install telegraf 15.1
  2. start telegraf
    3.) with logparser and procstat the cpu is very high actually unacceptable levels. Even with those two disabled the cpu is at 5 % which is still very high for this kind of monitoring.

Expected behavior:

The cpu just jumped to more than 120% the expected cpu usage for telegraf should be more than 1% or 2 %.

Actual behavior:

Additional info:

@ssoroka
Copy link
Contributor

ssoroka commented Jul 23, 2020

inputs.logparser hasn't received any changes between 1.14.5 and 1.15.1, so it seems unlikely that there would be any performance regressions here. Is this result seen in combination with procstat, or do you also see this performance change when only logparser is running?

@Kedesai
Copy link
Author

Kedesai commented Jul 23, 2020

Yep definitely in conjunction with inputs.procstats. I just went back to telegraf-1.14.4-1.x86_64.rpm and that has higher cpu but not as bad as in 15.1. In 15.1 the cpu jumps to more than 100%. let me do some more analysis. I will post it by tomorrow.

@Kedesai
Copy link
Author

Kedesai commented Jul 23, 2020

on 14.14-4 If I have both logparser and procstat enabled the cpu jumps to
29600 telegraf 20 0 1406324 67704 19468 S 121.5 0.4 0:10.84 telegraf

When I have procstats enabled but log parser disabled:
24897 telegraf 20 0 1406580 57956 19000 S 39.3 0.4 0:08.24 telegraf

Whe I have procstats disabled and only logparser enabled here is the top output:
31033 telegraf 20 0 1002596 69788 18716 S 111.9 0.4 0:06.35 telegraf

When I install telegraf-1.15.1-1.x86_64.rpm and enable procstat look at the cpu jump. The log parser plugin is disabled at this time:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22195 telegraf 20 0 1482836 52504 19500 S 125.2 0.3 0:22.87 telegraf

Now with only logparser enabled but procstat disabled:
30903 telegraf 20 0 1431440 77736 19644 S 116.5 0.5 14:12.95 telegraf

In summary both version 14 and 15 have issues. I think I had another ticket for netstat and I will update that separately.

@ssoroka
Copy link
Contributor

ssoroka commented Jul 27, 2020

We might need to compare to 1.13 for a baseline, if 1.14 also had issues. Would be good to see for each version.

inputs.internal might help with capturing some runtime stats per plugin.

No pressure, but if you get a chance to fill this out it might be helpful. Ideally with some kind of average cpu measurement after startup and Telegraf has run for one full gather/collection.

You may also want to consider submitting a profile for the latest version.

avg cpu usage
+---------+---------+----------------+---------------+-------------------------+
| Version | Neither | Just Logparser | Just Procstat | Both Logparser+procstat |
+---------+---------+----------------+---------------+-------------------------+
|    1.13 |         |                |               |                         |
|    1.15 |         |                |               |                         |
+---------+---------+----------------+---------------+-------------------------+

@Kedesai
Copy link
Author

Kedesai commented Jul 30, 2020

I will work on this today. I saw the profiling page. I will collect traces.

@Kedesai
Copy link
Author

Kedesai commented Jul 30, 2020

Does the inputs.internal have cpu stats? CPU used by telegraf or is it procstats?

@Kedesai
Copy link
Author

Kedesai commented Jul 30, 2020

Here is the diff between 1.13 and 1.15
image

@Kedesai
Copy link
Author

Kedesai commented Jul 30, 2020 via email

@Kedesai
Copy link
Author

Kedesai commented Jul 31, 2020

@ssoroka any update?

@Kedesai
Copy link
Author

Kedesai commented Aug 5, 2020

@ssoroka any updates?

@Kedesai
Copy link
Author

Kedesai commented Aug 7, 2020

@ssoroka any updates?

@ssoroka
Copy link
Contributor

ssoroka commented Aug 17, 2020

Hey @Kedesai, I see you, I'm just behind in replies. You can email me any traces (email removed; trace received)

btw, enable inputs.cpu for cpu stats.

Looks like the increase is related largely to procstat. How many processes do you typically have running on the Telegraf machine? We've got a couple issues open for procstat performance and it's something we're actively working on. The fix is complicated and requires some upstream changes, so it's been slow to resolve. I'll check the traces for anything else obvious.

Thanks for your patience!

@ssoroka ssoroka changed the title Logparser plugin causes higher cpu with telegraf 15.1 release Procstat plugin causes higher cpu with telegraf 15.1 release Aug 17, 2020
@ssoroka ssoroka added the performance problems with decreased performance or enhancements that improve performance label Aug 19, 2020
@fxedel
Copy link
Contributor

fxedel commented Sep 2, 2020

We are having similar issues. When upgrading some of our machines from Telegraf 1.14.5 to 1.15.2, telegraf's CPU usage rose on one machine from ~9% to ~25%, on another from ~25% to ~110% (of a single core).

After further investigating, we found out that disabling procstat input configs reduced the CPU usage again. This especially applies to procstat + nginx and procstat + php-fpm, as nginx and php-fpm have a lot of processes.

Downgrading to 1.14.5 did also solve the problem.

Configs:

### php7.1-fpm.conf

[[inputs.phpfpm]]
  urls = ["/var/run/php7.1-fpm.sock:fpm-status"]

  [inputs.phpfpm.tags]
    _influxdb_database = "php"

[[inputs.procstat]]
  pattern = "fpm"
  fielddrop = ["pid"]
  [inputs.procstat.tags]
    _influxdb_database = "php"

[[aggregators.basicstats]]
  period = "10s"
  drop_original = true
  stats = ["sum"]
  namepass = ["procstat"]
  [aggregators.basicstats.tagpass]
    process_name = ["php-fpm7.1"]
### nginx.conf

[[inputs.nginx]]
  urls = ["http://localhost/nginx_status"]

  [inputs.nginx.tags]
    _influxdb_database = "nginx"

[[inputs.procstat]]
  pattern = "nginx"
  fielddrop = ["pid"]
  [inputs.procstat.tags]
    _influxdb_database = "nginx"

[[aggregators.basicstats]]
  period = "10s"
  drop_original = true
  stats = ["sum"]
  namepass = ["procstat"]
  [aggregators.basicstats.tagpass]
    process_name = ["nginx"]

@BradMeier
Copy link

BradMeier commented Sep 24, 2020

I can confirm the same behaviour, went from 1.13.4 -> 1.15.0 and procstat collector gather_time_ns went from sub 50ms to larger than 250ms. I went back to 1.14.5 and the performance was sub 50ms again.

[[inputs.procstat]]
  interval = "1s"
  fieldpass = ["cpu_time_user","cpu_time_system","cpu_usage", "memory*","pid","num_threads","voluntary_context_switches","involuntary_context_switches", "read_*", "write_*"]
  pattern = '^(\S*python\S* )?/home/USER/conf/*'

@BradMeier
Copy link

BradMeier commented Sep 24, 2020

One of my team built Telegraf 1.15.3 with gopsutil v2.20.5, CPU usage of procstat increased 5x, same as the official builds. They then built Telegraf 1.15.3 with gopsutil v2.20.4, CPU usage for procstat went back to what it was in 1.13.4 and 1.14.5 with the official builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/procstat performance problems with decreased performance or enhancements that improve performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants