-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telegraf stops publishing metrics to InfluxDB; All plugins take too long to collect #3629
Comments
Does the problem persist if you switch to using tcp ? I have seen this a few times but its pretty rare to cause an issue. I have used tcp to send to influx so just wanted to check. Also, can you turn off the logparser input plugin and check if that causes any improvement ? The logparser uses some heavy regexes which burn considerable CPU. Also, if you are sending lots of data in statsd, you should increase |
Just had another crash, same server, less than an hour after restarting Telegraf. logs |
@agnivade thanks for the suggestions! As a last resort for debugging I can try TCP, but the overhead of TCP is seriously undesirable in our setup. We require a UDP based solution. After the next crash, I'll disable logparse and see if that causes any improvement. (It's one of the most critical plugins for our usage so I hope that's not the problem!) I'll increase |
|
This may be related to |
tcp should not be an overhead if you do batching properly. Yes, Keep us posted on your changes. |
Sounds that for some reason the Accumulator is not being read from anymore, and its internal channel is full. In the past I have seen this caused by a stuck output, though I don't see this in your stack trace. I think increasing the pending messages will only delay this message from appearing after processing is stuck. Can you try disabling the aggregator and see if it helps? Maybe I didn't actually solve #2914 |
@danielnelson I suspected the aggregator might be the issue also. I disabled it on our 3rd server, and haven't seen a problem on that server since. I just lost metrics from server1, and disabled it there also. here's the log and stack trace from that server Luckily I am not actually using the aggregator, so I really should have it disabled anyway. (Turned it on because it seemed useful but when I started building dashboards I never actually needed it's output) |
I am having this same issue, and have seen multiple related reports, but haven't found anything that has helped resolve it. |
@sujeeth961991 so you're not using the aggregator plugin? If you post a stack trace when telegraf is stuck by sending it a SIGQUIT, does it look similar to mine? I disabled the aggregator 5 days ago and haven't seen a single issue since, making me think that in my case that may be the issue. |
It seems like we're experiencing this too, the aggregator plugin seems to be the problem, because there were no issues before I have enabled it. The telegraf process seems to be running but it doesn't report any metrics, no errors in logs though, after restart it started posting to influxdb again. I'll post a stacktrace next time it happens. |
@jgitlin-bt I had restarted my telegraf agent. I will post a stacktrace next time when this ihappens. |
I had the exact same problem today where telegraf died on one server over the holiday shutdown. I had to restart it. What type of debugging data should I collect next time it happens before restarting? |
@sgreszcz Kill Telegraf with SIGQUIT and it should print a stack trace. Are you also using an aggregator? |
Since disabling the aggregator I have had 6 days of no lockups, which is a record. Over the past 30 days I never made it more than 5 days without at least one server deadlocking. I'm going to start re-enabiling other inputs I disabled (like disk, network, etc) and see if things still look OK. Updates to follow. |
@danielnelson I'm not using the aggregator, just collectd (socket listener), docker, internal. I have the same configuration on six servers collecting the same data (but from different devices) and only one of my telegraf instances has blown out so far with the "took longer to collect than collection interval (10s). It seems to be only the local inputs that are complaining, not the collectd forwarder using the socket_listener input.
Loaded inputs: inputs.linux_sysctl_fs inputs.disk inputs.diskio inputs.swap inputs.system inputs.docker inputs.net inputs.socket_listener inputs.cpu inputs.kernel inputs.mem inputs.processes inputs.internal |
That's interesting @sgreszcz because for me, ever single input stops working, even the internal telegraf one when I disable that. The host stops publishing metrics altogether. We may have two separate issues, I am not sure. (Don't use collectd or docker, do use statsd input) |
I wouldn't expect I suspect even though there is no log message that you cannot send items to the socket_listener and have them delivered to an output. |
I should have been a bit clearer. Although socket_listener wasn't complaining in the logs, I wasn't getting any metrics of any type via that Telegraf process to my central InfluxDB server. |
@danielnelson I tried to find how to reliably make it hang, this does it in a matter of seconds script By the way, for some reason when decreasing the cardinality to 3 tags with 5 different values each, telegraf works fine. |
@epsylonix Thank you, I was able to reproduce the deadlock with this script. When using the code in #3656 there the deadlock does not occur for me, so that seems promising but I will wait for @jgitlin-bt to report back on his long term testing. Could you take a look at the pull request and see if it works for you as well? |
Apologies for the delay; I need to build a freebsd version and have been busy with other sprint work. I'll try and get that build tested early next week |
@danielegozzi I tested with the fix in #3656 . However, the issue - took longer to collect than collection interval, is not fixed. |
@adityask2 please post a stack trace by sending telegraf a SIGQUIT. Note that "plugin took longer to collect than collection interval" seems to be a symptom of this issue, not the issue itself. Are you using an aggregator? Disabling the aggregator resolved the issue for me so for (I've had >2 weeks with no missing datapoints) I built the patched version from #3656 and re-enabled the aggregator; so far (less than 24 hours) so good, but I'll need more time to be 100% sure |
@epsylonix what's in |
@jgitlin-bt it's a simple bash script, the link to the gist is in my previous post. It sends UDP datagrams using influx line protocol with some random data. When you run it with It seems that a lot of traffic is not the only condition for this deadlock, with 3 tags each having 5 distinct values it runs fine, but one additional tag leads to a deadlock within seconds. I haven't had time to build a branch with this fix yet, but I have a current stable build running on a test server, it doesn't process much data but it does stop reporting metrics occasionally with no errors in logs, very similar to the issue this script reproduces. |
Thanks! I missed the link in your post, I see it now. Tag cardinality may also be related; when I bumped up StatsD input data in November the new data had hundreds of distinct tag values for a new tag. |
@jgitlin-bt I'm attaching the stack trace and conf file Can you please suggest the build you have used so that I can re-test in that. |
Interesting @adityask2 so you aren't using an Aggregator. I suspect you have a separate issue than I do, as for me the issue is clearly related to the Do you see this in your log file?
I am running |
@jgitlin-bt Yes. I'm seeing "took longer to collect than collection interval" in the log. There seems to multiple issue reported on this. However, I will file another. |
@adityask2 interesting, your issue has the same symptoms but unlike mine is not tied to an aggregator and not fixed by #3656 -- good luck! @danielnelson I am pleased to report that I have had no issues using #3656 and that (or disabling the aggregator) appears to have resolved my issue. Thank you so much! |
Bug report
After a seemingly random amount of time, Telegraf stops publishing metrics to InfluxDB over UDP. I have been experiencing this issue since Nov 2016 on both Telegraf 1.3.x and 1.4.4 on FreeBSD, on three separate servers. In the telegraf log, all collectors start to fail with
I can't see anything unusual or interesting published from the Telegraf internal metrics.
This same issue has been reported in #3318, #2183, #2919, #2780 and #2870 but all those issues are either abandoned by the requestor, or confused with several separate issues; I am pening a new issue for my specific problem but if it's a duplicate (#3318 seems to be the closest) then please feel free to close
Relevant telegraf.conf:
telegraf.conf
System info:
Telegraf v1.4.4 (git: unknown unknown)
running on FreeBSD 10.3-RELEASE-p24Steps to reproduce:
service telegraf start
$random
time periodExpected behavior:
Telegraf publishes metrics to InfluxDB server over UDP
Actual behavior:
Telegraf stops publishing metrics seemingly randomly, all input plugins start to fail with:
Additional info:
Full logs and stack trace
Earlier occurrence
Grafana snapshot of Internal Telegraf metrics
The text was updated successfully, but these errors were encountered: