Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inputs.mqtt/mqtt_consumer: allow connection errors on start #10694

Closed
greg-mcnamara opened this issue Feb 21, 2022 · 26 comments · Fixed by #15486
Closed

inputs.mqtt/mqtt_consumer: allow connection errors on start #10694

greg-mcnamara opened this issue Feb 21, 2022 · 26 comments · Fixed by #15486
Assignees
Labels
area/mqtt feature request Requests for new plugin and for new features to existing plugins help wanted Request for community participation, code, contribution size/m 2-4 day effort

Comments

@greg-mcnamara
Copy link

Feature Request

Proposal:

Telegraf should not crash when a single input fails to connect to its source. Ideally it would continue to retry the connection for that input, or permanently fail but continue running so that other inputs and outputs continue to work normally. There seem to be several bug reports related to this, including #3167 and #10078.

Current behavior:

A single telegraf mqtt_consumer input that fails to connect to an mqtt broker causes the entire telegraf service to shut down.

Example log (after the final log entry the telegraf service exits and logging ceases):

2022-02-20T23:53:35Z I! Starting Telegraf 1.21.4
2022-02-20T23:53:35Z I! Using config file: /etc/telegraf/telegraf.conf
2022-02-21T12:53:35+13:00 I! Loaded inputs: mqtt_consumer (2x)
2022-02-21T12:53:35+13:00 I! Loaded aggregators:
2022-02-21T12:53:35+13:00 I! Loaded processors:
2022-02-21T12:53:35+13:00 I! Loaded outputs: influxdb_v2 (2x)
2022-02-21T12:53:35+13:00 I! Tags enabled: host=telegraf-c9fc696bc-xb4r8
2022-02-21T12:53:35+13:00 I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf-c9fc696bc-xb4r8", Flush Interval:10s
2022-02-21T12:53:35+13:00 D! [agent] Initializing plugins
2022-02-21T12:53:35+13:00 D! [agent] Connecting outputs
2022-02-21T12:53:35+13:00 D! [agent] Attempting connection to [outputs.influxdb_v2]
2022-02-21T12:53:35+13:00 D! [agent] Successfully connected to outputs.influxdb_v2
2022-02-21T12:53:35+13:00 D! [agent] Attempting connection to [outputs.influxdb_v2]
2022-02-21T12:53:35+13:00 D! [agent] Successfully connected to outputs.influxdb_v2
2022-02-21T12:53:35+13:00 D! [agent] Starting service inputs
2022-02-21T12:57:45+13:00 E! [telegraf] Error running agent: starting input inputs.mqtt_consumer: network Error : EOF

Desired behavior:

Each input would be responsible for its own data source connection and not affect other inputs/outputs when the connection fails.

Use case:

The software is not usable in production without this functionality.

@greg-mcnamara greg-mcnamara added the feature request Requests for new plugin and for new features to existing plugins label Feb 21, 2022
@powersj
Copy link
Contributor

powersj commented Feb 22, 2022

Hi,

Unfortunately, this is the intended behavior of Telegraf, but I do see room for improvement on a per-plugin basis. First, let's consider, your error message:

Error running agent: starting input inputs.mqtt_consumer: network Error : EOF

Think about what happens when a user mistypes their username/password or sets the wrong hostname/IP address for a service to collect from in their config. It is a lot less clear to users that something is wrong if Telegraf keeps on going. In your case, (if I ignore the timestamp) is that network error due to config error or an actual network issue? Failing prevents a false sense that everything is working.

In terms of improvements, I do think we should add some retries around some error conditions like #10078 tries to call out.

Given you are working with mqtt, it looks like you lost connection, we tried to connect and bailed? I am of the opinion that we should have some sort of exponential backoff retry logic in cases like these, but we should ultimately fail if after t time things do not clear up.

Thoughts?

@greg-mcnamara
Copy link
Author

Thanks @powersj it turns out the problem was caused by an incorrect mqtt server URL (needed ssl:// instead of tcp:// for MQTTS), but my main concern was that the whole telegraf service (in my case the Kubernetes pod's container) crashed and had to restart. Would it have crashed if I'd had other inputs configured that were successfully connected? I think each input should fail after a connection timeout and possibly some retries, but that should not cause the whole service to fail. Does that sound reasonable? Sorry I'm a telegraf and influxdb newbie and just learning as I go, I hope I'm not making incorrect assumptions about how it does or should work.

@ryanpjbyrne
Copy link

ryanpjbyrne commented Mar 11, 2022

Experiencing the same issue with mqtt_consumer. When a MQTT broker is not avaliable the whole telegraf service will fail and stop.

This problem seems to occurs from v1.19.0 . As a workaround I have downgraded to v1.18.3 for the time being.

Just to add my two cents, some kind of flag to indicate that an input has to be healthy would be ideal to allow the user to pick and choose (with a default in place) which inputs matter.

@observeralone
Copy link

Just like the issue #11289 I submitted, I hope to add a configuration, and the user decides whether to ignore the input that fails init. For details, please refer to the issue I wrote

I hope to get a clear answer from you @srebhan : Do you want to do this? how to do? If it's too late, we'll try to fix it ourselves.

thank you

@haoel
Copy link

haoel commented Jun 14, 2022

I have the same issue here, if the telegraf cannot connect to mongodb or dockerd, the whole telegraf crashed.

As we are using one telegraf to monitor a number of things, one input error makes other inputs stop working even if other inputs are correctly initialized. I think this behavior does not make sense.

if we have a double-edge sword here, I hope we could have an option to let users decide how to configure it.

zhao-kun pushed a commit to megaease/telegraf that referenced this issue Jun 14, 2022
* add ignore_init_fail_input option for ignore initialization failed Input influxdata#11289 influxdata#10694

* rename option ignore_init_fail_input to ignore_error_inputs
@srebhan
Copy link
Member

srebhan commented Jun 21, 2022

@haoel please create an issue for MongoDB and a separate one for Docker with a description of the failure. We should fix the two plugins.

@srebhan srebhan removed the area/mqtt label Jun 22, 2022
@powersj powersj changed the title Prevent telegraf from crashing when an input connection fails inputs.mqtt/mqtt_consumer: allow connection errors on start Apr 4, 2023
@powersj powersj added area/mqtt size/m 2-4 day effort help wanted Request for community participation, code, contribution labels Apr 4, 2023
@pkkrusty
Copy link

pkkrusty commented Jun 13, 2023

I guess this is not resolved yet? Seems crazy that one bad input would prevent all other collectors from functioning. In my case, I have multiple MQTT brokers, and if one drops off the network, none work because Telegraf can't handle the failed connection on startup.

Should note that if Telegraf has connection on startup, everything is fine. If the connection to the mqtt broker then drops out, Telegraf doesn't care and keeps on chugging.

As it should.

Telegraf is rarely used in isolation, and anyone who is ingesting data is likely doing something with that data, and has other methods of noticing if there's a problem. One failed input shouldn't take down the whole system.

@simonsmart99
Copy link

I would like to add another use case that hopefully supports a request to change the error handling behavior within an input.

I have a device on TTN which intermittently changes the payload of a specific topic. In the example below, the payload sometimes includes status detail within the path "uplink_message.decoded_payload", and sometimes it does not, however the location data is always included in the payload. (which I need).

[[inputs.mqtt_consumer]]
  servers = ["tcp://eu1.cloud.thethings.network:1883"]
  topics = ["v3/loratech-test@ttn/devices/+/up", "v3/+/devices/+/location/solved"]
  connection_timeout = "30s"
  username = "myapplicationname"
  password = "myttntoken"
  json_string_fields = ["uplink_message_frm_payload"] 
  data_format = "json_v2"

  [[inputs.mqtt_consumer.json_v2]]
      # Create Measurement for TTN Data
      measurement_name = "ttn_location"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "end_device_ids.device_id"
          type = "string"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "uplink_message.locations.frm-payload.latitude"
          type = "float"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "uplink_message.locations.frm-payload.longitude"
          type = "float"

  [[inputs.mqtt_consumer.json_v2]]
      # Create Measurement for TTN Data
      measurement_name = "ttn_status"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "uplink_message.decoded_payload.ALARM_status"
          type = "string"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "uplink_message.decoded_payload.BatV"
          type = "float"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "uplink_message.decoded_payload.MD"
          type = "string"

When the status detail is not included the entire input fails with an error code indicated on --debug. Although the location data is valid, the measurement is not parsed to the influxdb output.

A note here. I am very new to this, so could well be approaching this in the wrong way. Any advice is very welcome.

@mprasil
Copy link

mprasil commented Sep 30, 2023

+1 from me for some initial retry with exponential backoff. The failure mode I've observed was that the configuration in telegraf was 100% valid, it just took a little bit longer for the MQTT service start up after boot and telegraf service on the same machine errored out in the meantime.

@CubicEarth
Copy link

Is there any solution or work around for this?

My telegraf tries to connect to an MQTT server, and it also runs ping tests to check connectivity to a number of devices, and then writes everything to infulxdb.

I had my MQTT server go down. Unfortunately this cause telegraf to continually restart and never run or complete any of the ping tests.

I would be happy if I could just make telegraf try to connect to the MQTT server every 30 seconds or something, and in the meantime continue to run the ping tests as normal.

@srebhan
Copy link
Member

srebhan commented Jun 11, 2024

@CubicEarth and all others, please test the binary in PR #15486, available as soon as CI finished the tests, and set startup_error_behavior = "retry" in your plugin configuration! Let me know if this fixes you issue!

@CubicEarth
Copy link

CubicEarth commented Jun 11, 2024 via email

@srebhan
Copy link
Member

srebhan commented Jun 11, 2024

@mprasil or @simonsmart99 or anyone else reading this, can you please test?!?

@mprasil
Copy link

mprasil commented Jun 11, 2024

@srebhan testing it with live system is a bit more involved, but I put together a little test configuration:

# cat telegraf.toml
[agent]
        hostname = "test"
[[outputs.file]]
        files = ["stdout"]
[[inputs.mqtt_consumer]]
        data_format = "value"
        servers = ["tcp://localhost:1883"]
        topics = ["test/topic/#"]
        startup_error_behavior = "retry"

and then ran the downloaded binary (while having mqtt off) with telegraf --config telegraf.toml, which seems to work as expected:

2024-06-11T21:51:47Z I! Loading config: ./telegram.toml
2024-06-11T21:51:47Z I! Starting Telegraf 1.32.0-427e6ab1 brought to you by InfluxData the makers of InfluxDB
2024-06-11T21:51:47Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-06-11T21:51:47Z I! Loaded inputs: mqtt_consumer
2024-06-11T21:51:47Z I! Loaded aggregators:
2024-06-11T21:51:47Z I! Loaded processors:
2024-06-11T21:51:47Z I! Loaded secretstores:
2024-06-11T21:51:47Z I! Loaded outputs: file
2024-06-11T21:51:47Z I! Tags enabled: host=test
2024-06-11T21:51:47Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"test", Flush Interval:10s
2024-06-11T21:51:47Z I! [inputs.mqtt_consumer] Startup failed: network Error : dial tcp 127.0.0.1:1883: connect: connection refused; retrying...
2024-06-11T21:51:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
==== Here I started MQTT =====
2024-06-11T21:52:00Z I! [inputs.mqtt_consumer] Connected [tcp://localhost:1883]
mqtt_consumer,host=test,topic=test/topic value=42i 1718142723237521771
mqtt_consumer,host=test,topic=test/topic value=42i 1718142728656537543
mqtt_consumer,host=test,topic=test/topic value=42i 1718142735883563827

So it looks like it works exactly as expected, it disables the plugin while MQTT is unreachable with that Error in plugin: not connected error message every collection interval. Then once MQTT is up it connects and starts collecting metrics. This would be exactly what I need.

@mprasil
Copy link

mprasil commented Jun 11, 2024

The only issue I've observed is that once it connects it never tries to reconnect again should connection be dropped again. So if I start with MQTT up, then stop MQTT and start it again, telegraf will forever print Error in plugin: not connected and will never reconnect.

@powersj
Copy link
Contributor

powersj commented Jun 11, 2024

The only issue I've observed is that once it connects it never tries to reconnect again should connection be dropped again. So if I start with MQTT up, then stop MQTT and start it again, telegraf will forever print Error in plugin: not connected and will never reconnect.

Can you enable debug logging in your agent config and set client_trace = true in your MQTT config please and see what the mqtt client says it is doing? We had something recently similar in #15429.

@mprasil
Copy link

mprasil commented Jun 12, 2024

I could not enable client_trace:

2024-06-12T08:20:12Z I! Loading config: ./telegraf.toml
2024-06-12T08:20:12Z E! error loading config file ./telegraf.toml: plugin inputs.mqtt_consumer: line 5: configuration specified the fields ["client_trace"], but they were not used. This is either a typo or this config option does not exist in this version.

I've ran test with the debug enabled:

❯ ./telegraf --debug --config ./telegraf.toml
2024-06-12T08:23:49Z I! Loading config: ./telegraf.toml
2024-06-12T08:23:49Z I! Starting Telegraf 1.32.0-427e6ab1 brought to you by InfluxData the makers of InfluxDB
2024-06-12T08:23:49Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-06-12T08:23:49Z I! Loaded inputs: mqtt_consumer
2024-06-12T08:23:49Z I! Loaded aggregators:
2024-06-12T08:23:49Z I! Loaded processors:
2024-06-12T08:23:49Z I! Loaded secretstores:
2024-06-12T08:23:49Z I! Loaded outputs: file
2024-06-12T08:23:49Z I! Tags enabled: host=test
2024-06-12T08:23:49Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"test", Flush Interval:10s
2024-06-12T08:23:49Z D! [agent] Initializing plugins
2024-06-12T08:23:49Z D! [agent] Connecting outputs
2024-06-12T08:23:49Z D! [agent] Attempting connection to [outputs.file]
2024-06-12T08:23:49Z D! [agent] Successfully connected to outputs.file
2024-06-12T08:23:49Z D! [agent] Starting service inputs
2024-06-12T08:23:49Z I! [inputs.mqtt_consumer] Connected [tcp://localhost:1883]
2024-06-12T08:23:57Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: EOF
2024-06-12T08:23:57Z D! [inputs.mqtt_consumer]  Disconnected [tcp://localhost:1883]
2024-06-12T08:23:59Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:00Z D! [inputs.mqtt_consumer]  Connecting [tcp://localhost:1883]
2024-06-12T08:24:00Z E! [inputs.mqtt_consumer] Error in plugin: network Error : dial tcp 127.0.0.1:1883: connect: connection refused
2024-06-12T08:24:09Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:10Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T08:24:19Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T08:24:29Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T08:24:39Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T08:24:49Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
^C2024-06-12T08:24:57Z D! [agent] Stopping service inputs

On the MQTT side I can see the firs connection:

1718180176: mosquitto version 2.0.18 starting
1718180176: Config loaded from /mosquitto/config/mosquitto.conf.
1718180176: Starting in local only mode. Connections will only be possible from clients running on this machine.
1718180176: Create a configuration file which defines a listener to allow remote access.
1718180176: For more details see https://mosquitto.org/documentation/authentication-methods/
1718180176: Opening ipv4 listen socket on port 1883.
1718180176: Opening ipv6 listen socket on port 1883.
1718180176: mosquitto version 2.0.18 running
1718180629: New connection from 127.0.0.1:56060 on port 1883.
1718180629: New client connected from 127.0.0.1:56060 as Telegraf-Consumer-MHLTP (p2, c1, k60).
^C1718180637: mosquitto version 2.0.18 terminating

But when I start it next time, it does not see any clients connecting:

1718180649: mosquitto version 2.0.18 starting
1718180649: Config loaded from /mosquitto/config/mosquitto.conf.
1718180649: Starting in local only mode. Connections will only be possible from clients running on this machine.
1718180649: Create a configuration file which defines a listener to allow remote access.
1718180649: For more details see https://mosquitto.org/documentation/authentication-methods/
1718180649: Opening ipv4 listen socket on port 1883.
1718180649: Opening ipv6 listen socket on port 1883.
1718180649: mosquitto version 2.0.18 running

If you want to test it yourself, the config I used is up in my previous comment and the MQTT is just locally running eclipse-mosquitto container with host networking for simplicity:

docker runun -it --net=host --rm eclipse-mosquitto

@srebhan
Copy link
Member

srebhan commented Jun 12, 2024

@mprasil I guess the issue is that the connection loss detection has some insane defaults... It depends on the "keep-alive" interval and the "ping timeout" which are set to 60 seconds and 10 seconds respectively. So the time until we reconnect will sum up the two plus (in the worst case) your interval setting.

I added two parameters to the config keep_alive and ping_timeout for tuning the values... Could you please retest with the knowledge above and/or modified parameters?

@mprasil
Copy link

mprasil commented Jun 12, 2024

I have ran the test again with the same config as before:

2024-06-12T16:28:07Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:28:16Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: EOF
2024-06-12T16:28:16Z D! [inputs.mqtt_consumer]  Disconnected [tcp://localhost:1883]
2024-06-12T16:28:17Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:28:20Z D! [inputs.mqtt_consumer]  Connecting [tcp://localhost:1883]
2024-06-12T16:28:20Z E! [inputs.mqtt_consumer] Error in plugin: network Error : dial tcp 127.0.0.1:1883: connect: connection refused
2024-06-12T16:28:27Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:28:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:28:37Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:28:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:28:47Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:28:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:28:57Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:00Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:07Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:10Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:17Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:27Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:37Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:47Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:57Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:00Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:07Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:10Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:17Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:27Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:37Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:47Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:57Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:00Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:07Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:10Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:17Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:27Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:37Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:47Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:57Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:00Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:07Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:10Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:17Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:27Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:37Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:47Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:57Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:33:00Z E! [inputs.mqtt_consumer] Error in plugin: not connected

It's been couple minutes and telegraf still did not reconnect.

@srebhan
Copy link
Member

srebhan commented Jun 12, 2024

@mprasil did you really download and run the latest version from the PR?

@mprasil
Copy link

mprasil commented Jun 12, 2024

Just downloaded latest version from #15486 and it indeed seems to reconnect. However I managed to crash it after couple rounds of reconnections:

2024-06-12T20:11:11Z I! Loading config: ./telegraf.toml
2024-06-12T20:11:11Z I! Starting Telegraf 1.32.0-c13996f5 brought to you by InfluxData the makers of InfluxDB
2024-06-12T20:11:11Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-06-12T20:11:11Z I! Loaded inputs: mqtt_consumer
2024-06-12T20:11:11Z I! Loaded aggregators:
2024-06-12T20:11:11Z I! Loaded processors:
2024-06-12T20:11:11Z I! Loaded secretstores:
2024-06-12T20:11:11Z I! Loaded outputs: file
2024-06-12T20:11:11Z I! Tags enabled: host=test
2024-06-12T20:11:11Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"test", Flush Interval:10s
2024-06-12T20:11:11Z D! [agent] Initializing plugins
2024-06-12T20:11:11Z D! [agent] Connecting outputs
2024-06-12T20:11:11Z D! [agent] Attempting connection to [outputs.file]
2024-06-12T20:11:11Z D! [agent] Successfully connected to outputs.file
2024-06-12T20:11:11Z D! [agent] Starting service inputs
2024-06-12T20:11:11Z I! [inputs.mqtt_consumer] Startup failed: network Error : dial tcp 127.0.0.1:1883: connect: connection refused; retrying...
2024-06-12T20:11:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T20:11:21Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T20:11:30Z I! [inputs.mqtt_consumer] Connected [tcp://localhost:1883]
2024-06-12T20:11:30Z D! [inputs.mqtt_consumer]  Successfully connected after 2 attempts
2024-06-12T20:11:31Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T20:11:36Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: EOF
2024-06-12T20:11:36Z D! [inputs.mqtt_consumer]  Disconnected [tcp://localhost:1883]
2024-06-12T20:11:40Z D! [inputs.mqtt_consumer]  Connecting [tcp://localhost:1883]
2024-06-12T20:11:40Z E! [inputs.mqtt_consumer] Error in plugin: network Error : dial tcp 127.0.0.1:1883: connect: connection refused
2024-06-12T20:11:41Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T20:11:50Z D! [inputs.mqtt_consumer]  Connecting [tcp://localhost:1883]
2024-06-12T20:11:50Z I! [inputs.mqtt_consumer] Connected [tcp://localhost:1883]
2024-06-12T20:11:51Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T20:11:54Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: EOF
2024-06-12T20:11:54Z D! [inputs.mqtt_consumer]  Disconnected [tcp://localhost:1883]
2024-06-12T20:12:00Z D! [inputs.mqtt_consumer]  Connecting [tcp://localhost:1883]
2024-06-12T20:12:00Z E! FATAL: [inputs.mqtt_consumer] panicked: runtime error: invalid memory address or nil pointer dereference, Stack:
goroutine 189 [running]:
github.com/influxdata/telegraf/agent.panicRecover(0xc0022aa120)
        /go/src/github.com/influxdata/telegraf/agent/agent.go:1202 +0x70
panic({0x747dc40?, 0xe7471a0?})
        /usr/local/go/src/runtime/panic.go:770 +0x132
github.com/influxdata/telegraf/plugins/inputs/mqtt_consumer.(*MQTTConsumer).connect(0xc001eb9608)
        /go/src/github.com/influxdata/telegraf/plugins/inputs/mqtt_consumer/mqtt_consumer.go:186 +0x271
github.com/influxdata/telegraf/plugins/inputs/mqtt_consumer.(*MQTTConsumer).Gather(0xc001eb9608, {0x7ea3040?, 0x4e23fa?})
        /go/src/github.com/influxdata/telegraf/plugins/inputs/mqtt_consumer/mqtt_consumer.go:336 +0xcd
github.com/influxdata/telegraf/models.(*RunningInput).Gather(0xc0022aa120, {0x9523f80, 0xc00238c8e0})
        /go/src/github.com/influxdata/telegraf/models/running_input.go:228 +0x271
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
        /go/src/github.com/influxdata/telegraf/agent/agent.go:583 +0x5e
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce in goroutine 64
        /go/src/github.com/influxdata/telegraf/agent/agent.go:581 +0xf7

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc001c929b0?)
        /usr/local/go/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc000f405d0?)
        /usr/local/go/src/sync/waitgroup.go:116 +0x48
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc000f405d0, {0x94ead68, 0xc001b00ff0})
        /go/src/github.com/influxdata/telegraf/agent/agent.go:197 +0xa2c
main.(*Telegraf).runAgent(0xc0020ae000, {0x94ead68, 0xc001b00ff0}, 0x0?)
        /go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:443 +0x174c
main.(*Telegraf).reloadLoop(0xc0020ae000)
        /go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:189 +0x265
main.(*Telegraf).Run(0xc0020ae000)
        /go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf_posix.go:19 +0xbe
main.runApp.func1(0xc001c15b80)
        /go/src/github.com/influxdata/telegraf/cmd/telegraf/main.go:251 +0xcf0
github.com/urfave/cli/v2.(*Command).Run(0xc0020afb80, 0xc001c15b80, {0xc00024c040, 0x4, 0x4})
        /go/pkg/mod
2024-06-12T20:12:00Z E! PLEASE REPORT THIS PANIC ON GITHUB with stack trace, configuration, and OS information: https://github.com/influxdata/telegraf/issues/new/choose

It is kind of random, sometimes it happens on first try, sometimes it takes multiple tries.

@srebhan
Copy link
Member

srebhan commented Jun 12, 2024

Thanks for all your testing @mprasil!
Update pushed, please download the latest binary after CI finished the build and retest!

@mprasil
Copy link

mprasil commented Jun 13, 2024

Yeah, all good with the latest version. I've tortured telegraf with disconnection every couple seconds and it kept reconnecting as it should, no crashes observed. Thank you @srebhan I'm looking forward to run this in prod at some stage.

@pkkrusty
Copy link

pkkrusty commented Sep 9, 2024

This commit was merged in June and included in the 1.32 milestone. I'm on 1.31.3. Is there an estimated timeline for 1.32 push?

@srebhan
Copy link
Member

srebhan commented Sep 11, 2024

Well the release was on Monday... :-D

@pkkrusty
Copy link

Haha just saw it hit when I updated my raspberry this morning. Thanks! Implemented the line in my conf file and it's looking good so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/mqtt feature request Requests for new plugin and for new features to existing plugins help wanted Request for community participation, code, contribution size/m 2-4 day effort
Projects
None yet
10 participants