-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telegraf 1.20.3 to 1.21.2 failing to startup - mqtt.output fails #10180
Comments
Also noting that commenting out the Also noting that other processes are quite happy using both the broker service and DNS. |
Have you tried changing I think telegraf doesn't "crash", but exits with error code if it can't connect to mqtt server. Perhaps telegraf could be more robust in detecting existing scheme(tcp://) in servers and periodically retry failed connection. |
@TotallyInformation could you confirm fixing the server option solves this issue? |
Hi, sorry for the delay. I've just tried that change and it hasn't worked. I'm getting the same problem. Currently running Telegraf v1.20.4. Journal Log:
Also noting that the examples in the default telegraf conf file include the I also checked my mosquitto config just to make sure that it was allowing anonymous connections which it is. |
Now it’s a totally different problem. Telegraf is able to connect to the server, but now the “identifier” is rejected by the mqtt server. |
Well it still isn't working and it WAS working. No config changes had been made until I discovered it wasn't working. The broker accepts anonymous connections and can be contacted locally and also hasn't been changed. Many of the example settings for urls also include the tcp prefix. |
With the config option |
Still getting the same problem. Telegraf service still crashing if I try to turn on MQTT outputs. Telegraf Version: Telegraf 1.21.2 (git: HEAD 30d981d) Config tried:
Also tried without the client ID and with a different prefix. Mosquitto is accepting anonymous connections and doesn't require TLS. Log
|
I am also using mosquitto to try to reproduce. I first verified that I can connect to the mqtt server using telenet:
Can you please try that command and ensure that you are able to connect. If so, can you also share the mosquitto configuration? I used this Telegraf config: [[inputs.file]]
files = ["log.json"]
data_format = "json"
[[outputs.mqtt]]
servers = ["192.168.100.189:1883"]
topic_prefix = "telegraf"
client_id = "telegraf"
data_format = "json" and got the following output: ./telegraf --config config.toml --debug --once
2022-01-06T17:46:44Z I! Starting Telegraf 1.22.0-c353bace
2022-01-06T17:46:44Z I! Loaded inputs: file
2022-01-06T17:46:44Z I! Loaded aggregators:
2022-01-06T17:46:44Z I! Loaded processors:
2022-01-06T17:46:44Z I! Loaded outputs: mqtt
2022-01-06T17:46:44Z I! Tags enabled: host=ryzen
2022-01-06T17:46:44Z D! [agent] Initializing plugins
2022-01-06T17:46:44Z D! [agent] Connecting outputs
2022-01-06T17:46:44Z D! [agent] Attempting connection to [outputs.mqtt]
2022-01-06T17:46:44Z D! [agent] Successfully connected to outputs.mqtt
2022-01-06T17:46:44Z D! [agent] Starting service inputs
2022-01-06T17:46:44Z D! [agent] Stopping service inputs
2022-01-06T17:46:44Z D! [agent] Input channel closed
2022-01-06T17:46:44Z I! [agent] Hang on, flushing any cached metrics before shutdown
2022-01-06T17:46:44Z D! [outputs.mqtt] Wrote batch of 1 metrics in 103.961µs
2022-01-06T17:46:44Z D! [outputs.mqtt] Buffer fullness: 0 / 10000 metrics
2022-01-06T17:46:44Z I! [agent] Stopping running outputs
2022-01-06T17:46:44Z D! [agent] Stopped Successfully I have also run the Telegraf on the system itself using:
And that was also sucessful. |
OK, so here is the bottom line. There were 2 issues.
Changing both of those things now lets me re-enable the MQTT output. Finally. Oh, and I still think that a simple config error in Telegraf should NOT cause it to completely fall over. Certainly it should stop that specific output but the rest of Telegraf should continue to work. Without that, Telegraf isn't very useful as a system monitoring tool as it is far too fragile and subject to crashes when other services change. Which is why I'm not closing the issue. |
Yeah, I was just able to reproduce after updating to mosquitto 2.0.14! For 1 - We should at least update the docs to specify that it is not required. I can follow up with that. For 2 - It looks like this was identified as a part of the #9803 and the docs specifically call out version v2.0.12 that the value must be set. I will update the docs as well that make this more visible and specify it is not just v2.0.12, but later versions as well it seems as I saw it with 2.0.14. For your edit about Telegraf not falling over: I would not classify this as a config error. In this situation, a connection was unable to be established to an output. As a part of the initialization process, Telegraf will attempt to connect to all outputs and if there are any failures during that step, Telegraf will stop. This is absolutely the intended behavior. I do think we could do better with retries in some situations, but Telegraf should not start running if outputs are not able to be connected to in the first place. |
There is some confusion reguarding the server's string as well as the need to use the keep_alive param with later versions of mosquitto. This tries to make the README read a bit more clear and not need two different sections, which detail the various parameters. Finally, it gets the README and the configuration in sync with each other. Fixes: influxdata#10180
Thanks for all that.
As an IT Enterprise Architect, I'd have to disagree with that approach I'm afraid. If an output dependency is updated and it stops Telegraf from producing output on that channel, then absolutely Telegraf should complain. But as many of us are using Telegraf to consolidate and report on system utilisation and performance, having the whole thing stop dead because someone updated another service is not what I'd expect or want. After all, I do have other channels configured as well. I had thought about recommending it for some of our systems but this would mean that I'd still have to implement something else to keep an eye on Telegraf as well. Of course, one tremendous improvement that could be made to the MQTT output (and I've no idea how easy or feasible this would be) would be to allow specification of a Last Will and Testament message so that the Broker itself could indicate that it has stopped receiving messages from Telegraf. Anyway, I'm not an expert on logging and monitoring systems so doubtless I've missed some subtlety. I'm currently just using Telegraf for my home automation systems and so I now have to go and build a separate monitoring workflow to see if Telegraf is still doing what I asked it to. Not ideal but not the end of the world in this situation. Still, thanks a lot for responding. |
I agree with @TotallyInformation that telegraf should not halt if one of the outputs could not be connected while there are still others. I believe there are multiple open issues to discuss that. |
@TotallyInformation : Thank you for the hint. With |
See also #6694 |
Relevent telegraf.conf
System info
Telegraf 1.20.3 and 1.20.4, Debian GNU/Linux 10 (buster) 4.19.0-18-amd64, mosquitto version 2.0.12
Docker
No response
Steps to reproduce
systemd attempts to start the Telegraf service
Expected behavior
Telegraf should start up.
Telegraf used to start up before recent updates.
Actual behavior
Telegraf fails to start with the following errors in the log:
Note that Telegraf is attempting (and failing) to do a DNS lookup to 192.168.1.1 - this is the correct DNS server for this network and works for everything else. However, it should not be doing a DNS lookup anyway since the Mosquitto server is specified using an IP address.
What's more is that failure to connect to the broker should NOT crash Telegraf.
Additional info
Note that I also tried changing the broker server to
["tcp://127.0.0.1:1883"]
and the error was the same.The text was updated successfully, but these errors were encountered: