Multiple Output resilience broken #6694

lephisto · 2019-11-20T22:12:52Z

Relevant telegraf.conf:

    [[outputs.influxdb]]
    urls = [“http://127.0.0.1:8086”]
    namedrop = ["_test"]
    [outputs.influxdb.tagdrop]
    influxdb_database = [""]
    database = “telegraf”

    [[outputs.influxdb]]
    urls = [“http://192.168.x.y:8086/”]
    namedrop = ["_test"]
    metric_buffer_limit = 100000
    [outputs.influxdb.tagdrop]
    influxdb_database = [""]
    database = “telegraf”

    [[outputs.influxdb]]
    urls = [“http://127.0.0.1:8086”]
    database = “pmi”
    tagexclude = [“influxdb_database”]
    [outputs.influxdb.tagpass]
    influxdb_database = [“pmi”]

    [[outputs.influxdb]]
    urls = [“http://192.168.x.y:8086/”]
    database = “pmi”
    metric_buffer_limit = 100000
    tagexclude = [“influxdb_database”]
    [outputs.influxdb.tagpass]
    influxdb_database = [“pmi”]

    [[outputs.influxdb]]
    urls = [“http://127.0.0.1:8086”]
    database = “telegraf”
    tagexclude = [“influxdb_database”]
    [outputs.influxdb.tagpass]
    influxdb_database = [“idc”]

    [[outputs.influxdb]]
    urls = [“http://192.168.x.y:8086/”]
    database = “telegraf”
    tagexclude = [“influxdb_database”]
    [outputs.influxdb.tagpass]
    influxdb_database = [“idc”]

System info:

telegraf 1.12.1-1.12.4 (earlier probably affected as well, just tested with .1 and .4

Steps to reproduce:

Create Multiple outputs for the same metric stream Metrics
Start ingesting metrics to telegraf
Take away one of the outputs ungracefully (eg. Drop by firewall, pull the plug etc)

Expected behavior:

I would expect all other outputs to continue functioning unaffected, and the output that went away to queue up metrics until the bufferlimit is reached.

Actual behavior:

If one output goes away ungracefully (In my case vpn tunnel down, a firewall drop does the same) and packets run into timeout, telegraf starts dropping metrics for all outputs. If the lost output comes back telegraf panics and gets restarted by systemd. All buffered metrics are lost for all outputs. Happens with influx and influx_v2 output.

Additional info:

Nov 18 11:01:00 telegraf[11714]: 2019-11-18T10:01:00Z D! [outputs.influxdb] Wrote batch of 51 metrics in 61.231974ms
Nov 18 11:01:00 telegraf[11714]: 2019-11-18T10:01:00Z D! [outputs.influxdb] Buffer fullness: 1 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=telegraf: net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 787 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=telegraf: net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 946 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=pmi: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 1000 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Wrote batch of 23 metrics in 96.375817ms
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Buffer fullness: 17 / 100000 metrics

This keeps happening and data is not written to any of the output, I see a gap in the Graphs.

This happens when the output comes back:

Nov 18 11:03:40 telegraf[11714]: panic: channel is full
Nov 18 11:03:40 telegraf[11714]: goroutine 10357 [running]:
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/agent.(*trackingAccumulator).onDelivery(0xc000292780, 0x2c11e80, 0xc002bff860)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/accumulator.go:167 +0x7a
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingData).notify(…)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:73
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:163 +0x9e
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingMetric).Accept(0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:144 +0x3a
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*Buffer).metricWritten(0xc0001c2fa0, 0x2c72240, 0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/buffer.go:93 +0x72
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*Buffer).Accept(0xc0001c2fa0, 0xc002092000, 0x30, 0x30)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/buffer.go:179 +0xa6
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*RunningOutput).Write(0xc0001aa280, 0x0, 0xc000560660)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/running_output.go:190 +0xf7
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/agent.(*Agent).flushOnce.func1(0xc001755b00, 0xc0016d7bc0)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/agent.go:597 +0x27
Nov 18 11:03:40 telegraf[11714]: created by github.com/influxdata/telegraf/agent.(*Agent).flushOnce
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/agent.go:596 +0xc8
Nov 18 11:03:40 systemd[1]: telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 18 11:03:40 systemd[1]: telegraf.service: Unit entered failed state.
Nov 18 11:03:40 systemd[1]: telegraf.service: Failed with result ‘exit-code’.
Nov 18 11:03:40 systemd[1]: telegraf.service: Service hold-off time over, scheduling restart.

The text was updated successfully, but these errors were encountered:

danielnelson · 2019-12-17T01:18:41Z

Expected behavior:
I would expect all other outputs to continue functioning unaffected, and the output that went away to queue up metrics until the bufferlimit is reached.

The pausing behavior here is intended. The inputs max_undelivered_messages setting controls how many queue messages Telegraf will take at once without delivery. Messages must be delivered on all outputs that would handle the message. If one of these outputs is down then the messages it is handling will be undelivered and the input will pause.

Of course the panic crash is a bug, I believe I have fixed this in #6806.

aurimasplu · 2020-07-21T09:47:19Z

@danielnelson and what if I am OK with not delivering metrics to faulty output? I have multiple outputs, some off which I do not control. sometimes they go down and I do not care if they do not receive data, as long as my own output destinations gets data successfully.
max_undelivered_messages - seems to be related to successful delivery, and will not help if one of outputs are down.

VY81 · 2020-07-24T23:07:49Z

I have the same need as @aurimasplu

@danielnelson could you pls suggest a workaround to ignore max_undelivered_messages ? I am using Telegraf 1.14.4

Hipska · 2020-07-25T06:16:22Z

Same here, but I think there’s another ticket for that already.

kyjanond · 2021-02-03T12:24:44Z

@Hipska can you put a reference to that ticket please?

chintanp · 2022-01-08T02:10:23Z

I am using a fork of telegraf with postgres output (https://github.com/phemmer/telegraf/tree/postgres) on a Raspberry Pi to push to a local timescaledb and a remote one. It has worked fine for a long time, with low latency. I observed today that if the remote db goes down or access to it is somehow lost, then the data does not go to any of the other local accessible outputs. When I removed the problematic output from the conf, things started working fine. @phemmer any ideas about this? This issue seems similar to the one reported here: #6694 . However, it seems your source already has the code fix for this issue.

TotallyInformation · 2022-01-11T22:20:42Z

There is certainly some kind of issue here. I think that most people would expect Telegraf to be robust to output issues where there is more than 1 output and even to recover gracefully even if he only output goes away and comes back (which I think you've already fixed?).

In my case, it was an issue with the MQTT output (#10180 ) - instead of only stopping that output, it crashed Telegraf completely.

My argument is that Telegraf is often used for system monitoring - if your monitoring app crashes because of an external issue, that's a major problem. This is, after all, why we may configure more than 1 output channel.

My view is that no output channel of Telegraf should ever cause it to crash. Complain loudly, yes, crash, no.

Without this, Telegraf cannot, unfortunately, be used as the main/only monitoring system. And if I have to put in a second monitoring system to monitor the first, then it won't happen. Which would be a shame because I like Telegraf and would like to recommend it.

Hipska · 2022-01-12T08:34:28Z

I do fully agree with you on telegraf should not crash when a recoverable problem to one of its outputs happens.

On the other hand, telegraf is NOT a monitoring solution. It is a data collection agent (mostly time series data) which can obviously be used to also collect monitoring data. It does not handle or keep states, does not send alerts, does no remediation on the collected data, ... You would always need to have a 'real' monitoring tool to keep an eye if telegraf is still running or if the metrics are being collected/stored or even take actions (alerts/remediation) on the values of the collected data.

opobla · 2024-03-21T18:47:30Z

Hi! Some years later I have stumbled upon the same problem. I collect a lot of information to several outputs. One of these output is an influxdb reachable through a network connection (satellite) that is not always available. Having this data in influx in realtime when the link is up is a plus, but it is not critical, since data is collected and transferred to another online place.

I understand the rationale behind keeping all outputs synced, but I think that sync could be optional and configurable.

danielnelson added bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf ready labels Dec 3, 2019

danielnelson self-assigned this Dec 3, 2019

danielnelson mentioned this issue Dec 17, 2019

Fix panic on connection loss with undelivered messages #6806

Merged

3 tasks

danielnelson added this to the 1.13.1 milestone Dec 17, 2019

danielnelson closed this as completed in #6806 Dec 27, 2019

chintanp mentioned this issue Jan 8, 2022

feat: postgresql output #8651

Closed

3 tasks

Hipska mentioned this issue Jan 10, 2022

Telegraf 1.20.3 to 1.21.2 failing to startup - mqtt.output fails #10180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Output resilience broken #6694

Multiple Output resilience broken #6694

lephisto commented Nov 20, 2019

danielnelson commented Dec 17, 2019

aurimasplu commented Jul 21, 2020

VY81 commented Jul 24, 2020

Hipska commented Jul 25, 2020

kyjanond commented Feb 3, 2021

chintanp commented Jan 8, 2022

TotallyInformation commented Jan 11, 2022

Hipska commented Jan 12, 2022

opobla commented Mar 21, 2024

Multiple Output resilience broken #6694

Multiple Output resilience broken #6694

Comments

lephisto commented Nov 20, 2019

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Dec 17, 2019

aurimasplu commented Jul 21, 2020

VY81 commented Jul 24, 2020

Hipska commented Jul 25, 2020

kyjanond commented Feb 3, 2021

chintanp commented Jan 8, 2022

TotallyInformation commented Jan 11, 2022

Hipska commented Jan 12, 2022

opobla commented Mar 21, 2024