Procstat plugin causes higher cpu with telegraf 15.1 release #7884

Kedesai · 2020-07-23T18:04:05Z

Relevant telegraf.conf:

# [[inputs.logparser]]
#   files = ["/location/www/some/logs/*_log.*"]

#   from_beginning = false
#   watch_method = "poll"

# [inputs.logparser.grok]
#   patterns = ["%{COMBINED_LOG_FORMAT}"]
#   measurement = "web1_some_apache_access_log"
#   timezone = "America/New_York"

System info:

Telegraf ver 15.1 and Rhel version 7.3

Docker

Steps to reproduce:

install telegraf 15.1
start telegraf
3.) with logparser and procstat the cpu is very high actually unacceptable levels. Even with those two disabled the cpu is at 5 % which is still very high for this kind of monitoring.

Expected behavior:

The cpu just jumped to more than 120% the expected cpu usage for telegraf should be more than 1% or 2 %.

Actual behavior:

Additional info:

The text was updated successfully, but these errors were encountered:

ssoroka · 2020-07-23T18:34:18Z

inputs.logparser hasn't received any changes between 1.14.5 and 1.15.1, so it seems unlikely that there would be any performance regressions here. Is this result seen in combination with procstat, or do you also see this performance change when only logparser is running?

Kedesai · 2020-07-23T18:58:24Z

Yep definitely in conjunction with inputs.procstats. I just went back to telegraf-1.14.4-1.x86_64.rpm and that has higher cpu but not as bad as in 15.1. In 15.1 the cpu jumps to more than 100%. let me do some more analysis. I will post it by tomorrow.

Kedesai · 2020-07-23T20:34:44Z

on 14.14-4 If I have both logparser and procstat enabled the cpu jumps to
29600 telegraf 20 0 1406324 67704 19468 S 121.5 0.4 0:10.84 telegraf

When I have procstats enabled but log parser disabled:
24897 telegraf 20 0 1406580 57956 19000 S 39.3 0.4 0:08.24 telegraf

Whe I have procstats disabled and only logparser enabled here is the top output:
31033 telegraf 20 0 1002596 69788 18716 S 111.9 0.4 0:06.35 telegraf

When I install telegraf-1.15.1-1.x86_64.rpm and enable procstat look at the cpu jump. The log parser plugin is disabled at this time:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22195 telegraf 20 0 1482836 52504 19500 S 125.2 0.3 0:22.87 telegraf

Now with only logparser enabled but procstat disabled:
30903 telegraf 20 0 1431440 77736 19644 S 116.5 0.5 14:12.95 telegraf

In summary both version 14 and 15 have issues. I think I had another ticket for netstat and I will update that separately.

ssoroka · 2020-07-27T20:45:48Z

We might need to compare to 1.13 for a baseline, if 1.14 also had issues. Would be good to see for each version.

inputs.internal might help with capturing some runtime stats per plugin.

No pressure, but if you get a chance to fill this out it might be helpful. Ideally with some kind of average cpu measurement after startup and Telegraf has run for one full gather/collection.

You may also want to consider submitting a profile for the latest version.

avg cpu usage
+---------+---------+----------------+---------------+-------------------------+
| Version | Neither | Just Logparser | Just Procstat | Both Logparser+procstat |
+---------+---------+----------------+---------------+-------------------------+
|    1.13 |         |                |               |                         |
|    1.15 |         |                |               |                         |
+---------+---------+----------------+---------------+-------------------------+

Kedesai · 2020-07-30T16:59:10Z

I will work on this today. I saw the profiling page. I will collect traces.

Kedesai · 2020-07-30T17:01:47Z

Does the inputs.internal have cpu stats? CPU used by telegraf or is it procstats?

Kedesai · 2020-07-30T21:16:04Z

Here is the diff between 1.13 and 1.15

Kedesai · 2020-07-30T21:27:47Z

Hi Steven, Can I send the output to you via this email I don't want to post the traces on github. Can you suggest how to send the traces? Thanks, Ketan On Monday, July 27, 2020, 02:46:05 PM MDT, Steven Soroka <notifications@github.com> wrote: We might need to compare to 1.13 for a baseline, if 1.14 also had issues. Would be good to see for each version. inputs.internal might help with capturing some runtime stats per plugin. No pressure, but if you get a chance to fill this out it might be helpful. Ideally with some kind of average cpu measurement after startup and Telegraf has run for one full gather/collection. You may also want to consider submitting a profile for the latest version. avg cpu usage +---------+---------+----------------+---------------+-------------------------+ | Version | Neither | Just Logparser | Just Procstat | Both Logparser+procstat | +---------+---------+----------------+---------------+-------------------------+ | 1.13 | | | | | | 1.15 | | | | | +---------+---------+----------------+---------------+-------------------------+ — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Kedesai · 2020-07-31T15:11:57Z

@ssoroka any update?

Kedesai · 2020-08-05T21:23:39Z

@ssoroka any updates?

Kedesai · 2020-08-07T19:41:30Z

@ssoroka any updates?

ssoroka · 2020-08-17T18:33:01Z

Hey @Kedesai, I see you, I'm just behind in replies. You can email me any traces (email removed; trace received)

btw, enable inputs.cpu for cpu stats.

Looks like the increase is related largely to procstat. How many processes do you typically have running on the Telegraf machine? We've got a couple issues open for procstat performance and it's something we're actively working on. The fix is complicated and requires some upstream changes, so it's been slow to resolve. I'll check the traces for anything else obvious.

Thanks for your patience!

fxedel · 2020-09-02T12:47:08Z

We are having similar issues. When upgrading some of our machines from Telegraf 1.14.5 to 1.15.2, telegraf's CPU usage rose on one machine from ~9% to ~25%, on another from ~25% to ~110% (of a single core).

After further investigating, we found out that disabling procstat input configs reduced the CPU usage again. This especially applies to procstat + nginx and procstat + php-fpm, as nginx and php-fpm have a lot of processes.

Downgrading to 1.14.5 did also solve the problem.

Configs:

### php7.1-fpm.conf

[[inputs.phpfpm]]
  urls = ["/var/run/php7.1-fpm.sock:fpm-status"]

  [inputs.phpfpm.tags]
    _influxdb_database = "php"

[[inputs.procstat]]
  pattern = "fpm"
  fielddrop = ["pid"]
  [inputs.procstat.tags]
    _influxdb_database = "php"

[[aggregators.basicstats]]
  period = "10s"
  drop_original = true
  stats = ["sum"]
  namepass = ["procstat"]
  [aggregators.basicstats.tagpass]
    process_name = ["php-fpm7.1"]

### nginx.conf

[[inputs.nginx]]
  urls = ["http://localhost/nginx_status"]

  [inputs.nginx.tags]
    _influxdb_database = "nginx"

[[inputs.procstat]]
  pattern = "nginx"
  fielddrop = ["pid"]
  [inputs.procstat.tags]
    _influxdb_database = "nginx"

[[aggregators.basicstats]]
  period = "10s"
  drop_original = true
  stats = ["sum"]
  namepass = ["procstat"]
  [aggregators.basicstats.tagpass]
    process_name = ["nginx"]

BradMeier · 2020-09-24T10:23:27Z

I can confirm the same behaviour, went from 1.13.4 -> 1.15.0 and procstat collector gather_time_ns went from sub 50ms to larger than 250ms. I went back to 1.14.5 and the performance was sub 50ms again.

[[inputs.procstat]]
  interval = "1s"
  fieldpass = ["cpu_time_user","cpu_time_system","cpu_usage", "memory*","pid","num_threads","voluntary_context_switches","involuntary_context_switches", "read_*", "write_*"]
  pattern = '^(\S*python\S* )?/home/USER/conf/*'

BradMeier · 2020-09-24T12:12:16Z

One of my team built Telegraf 1.15.3 with gopsutil v2.20.5, CPU usage of procstat increased 5x, same as the official builds. They then built Telegraf 1.15.3 with gopsutil v2.20.4, CPU usage for procstat went back to what it was in 1.13.4 and 1.14.5 with the official builds.

ssoroka added the area/procstat label Aug 17, 2020

ssoroka changed the title ~~Logparser plugin causes higher cpu with telegraf 15.1 release~~ Procstat plugin causes higher cpu with telegraf 15.1 release Aug 17, 2020

ssoroka added the performance problems with decreased performance or enhancements that improve performance label Aug 19, 2020

ssoroka mentioned this issue Aug 19, 2020

Procstat input with exe or pattern options are slow on Linux #7642

Closed

reimda mentioned this issue Oct 1, 2020

Update gopsutil: fix procstat performance regression #8210

Merged

ssoroka closed this as completed in #8210 Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Procstat plugin causes higher cpu with telegraf 15.1 release #7884

Procstat plugin causes higher cpu with telegraf 15.1 release #7884

Kedesai commented Jul 23, 2020

ssoroka commented Jul 23, 2020

Kedesai commented Jul 23, 2020

Kedesai commented Jul 23, 2020

ssoroka commented Jul 27, 2020

Kedesai commented Jul 30, 2020

Kedesai commented Jul 30, 2020

Kedesai commented Jul 30, 2020

Kedesai commented Jul 30, 2020 via email

Kedesai commented Jul 31, 2020

Kedesai commented Aug 5, 2020

Kedesai commented Aug 7, 2020

ssoroka commented Aug 17, 2020 •

edited

Loading

fxedel commented Sep 2, 2020 •

edited

Loading

BradMeier commented Sep 24, 2020 •

edited

Loading

BradMeier commented Sep 24, 2020 •

edited

Loading

Procstat plugin causes higher cpu with telegraf 15.1 release #7884

Procstat plugin causes higher cpu with telegraf 15.1 release #7884

Comments

Kedesai commented Jul 23, 2020

Relevant telegraf.conf:

System info:

Docker

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

ssoroka commented Jul 23, 2020

Kedesai commented Jul 23, 2020

Kedesai commented Jul 23, 2020

ssoroka commented Jul 27, 2020

Kedesai commented Jul 30, 2020

Kedesai commented Jul 30, 2020

Kedesai commented Jul 30, 2020

Kedesai commented Jul 30, 2020 via email

Kedesai commented Jul 31, 2020

Kedesai commented Aug 5, 2020

Kedesai commented Aug 7, 2020

ssoroka commented Aug 17, 2020 • edited Loading

fxedel commented Sep 2, 2020 • edited Loading

BradMeier commented Sep 24, 2020 • edited Loading

BradMeier commented Sep 24, 2020 • edited Loading

ssoroka commented Aug 17, 2020 •

edited

Loading

fxedel commented Sep 2, 2020 •

edited

Loading

BradMeier commented Sep 24, 2020 •

edited

Loading

BradMeier commented Sep 24, 2020 •

edited

Loading