Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Output resilience broken #6694

Closed
lephisto opened this issue Nov 20, 2019 · 9 comments · Fixed by #6806
Closed

Multiple Output resilience broken #6694

lephisto opened this issue Nov 20, 2019 · 9 comments · Fixed by #6806
Assignees
Labels
bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf
Milestone

Comments

@lephisto
Copy link

Relevant telegraf.conf:

    [[outputs.influxdb]]
    urls = [“http://127.0.0.1:8086”]
    namedrop = ["_test"]
    [outputs.influxdb.tagdrop]
    influxdb_database = [""]
    database = “telegraf”

    [[outputs.influxdb]]
    urls = [“http://192.168.x.y:8086/”]
    namedrop = ["_test"]
    metric_buffer_limit = 100000
    [outputs.influxdb.tagdrop]
    influxdb_database = [""]
    database = “telegraf”

    [[outputs.influxdb]]
    urls = [“http://127.0.0.1:8086”]
    database = “pmi”
    tagexclude = [“influxdb_database”]
    [outputs.influxdb.tagpass]
    influxdb_database = [“pmi”]

    [[outputs.influxdb]]
    urls = [“http://192.168.x.y:8086/”]
    database = “pmi”
    metric_buffer_limit = 100000
    tagexclude = [“influxdb_database”]
    [outputs.influxdb.tagpass]
    influxdb_database = [“pmi”]

    [[outputs.influxdb]]
    urls = [“http://127.0.0.1:8086”]
    database = “telegraf”
    tagexclude = [“influxdb_database”]
    [outputs.influxdb.tagpass]
    influxdb_database = [“idc”]

    [[outputs.influxdb]]
    urls = [“http://192.168.x.y:8086/”]
    database = “telegraf”
    tagexclude = [“influxdb_database”]
    [outputs.influxdb.tagpass]
    influxdb_database = [“idc”]

System info:

telegraf 1.12.1-1.12.4 (earlier probably affected as well, just tested with .1 and .4

Steps to reproduce:

  1. Create Multiple outputs for the same metric stream Metrics
  2. Start ingesting metrics to telegraf
  3. Take away one of the outputs ungracefully (eg. Drop by firewall, pull the plug etc)

Expected behavior:

I would expect all other outputs to continue functioning unaffected, and the output that went away to queue up metrics until the bufferlimit is reached.

Actual behavior:

If one output goes away ungracefully (In my case vpn tunnel down, a firewall drop does the same) and packets run into timeout, telegraf starts dropping metrics for all outputs. If the lost output comes back telegraf panics and gets restarted by systemd. All buffered metrics are lost for all outputs. Happens with influx and influx_v2 output.

Additional info:

Nov 18 11:01:00 telegraf[11714]: 2019-11-18T10:01:00Z D! [outputs.influxdb] Wrote batch of 51 metrics in 61.231974ms
Nov 18 11:01:00 telegraf[11714]: 2019-11-18T10:01:00Z D! [outputs.influxdb] Buffer fullness: 1 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=telegraf: net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 787 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=telegraf: net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 946 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=pmi: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 1000 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Wrote batch of 23 metrics in 96.375817ms
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Buffer fullness: 17 / 100000 metrics

This keeps happening and data is not written to any of the output, I see a gap in the Graphs.

This happens when the output comes back:

Nov 18 11:03:40 telegraf[11714]: panic: channel is full
Nov 18 11:03:40 telegraf[11714]: goroutine 10357 [running]:
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/agent.(*trackingAccumulator).onDelivery(0xc000292780, 0x2c11e80, 0xc002bff860)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/accumulator.go:167 +0x7a
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingData).notify(…)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:73
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:163 +0x9e
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingMetric).Accept(0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:144 +0x3a
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*Buffer).metricWritten(0xc0001c2fa0, 0x2c72240, 0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/buffer.go:93 +0x72
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*Buffer).Accept(0xc0001c2fa0, 0xc002092000, 0x30, 0x30)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/buffer.go:179 +0xa6
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*RunningOutput).Write(0xc0001aa280, 0x0, 0xc000560660)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/running_output.go:190 +0xf7
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/agent.(*Agent).flushOnce.func1(0xc001755b00, 0xc0016d7bc0)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/agent.go:597 +0x27
Nov 18 11:03:40 telegraf[11714]: created by github.com/influxdata/telegraf/agent.(*Agent).flushOnce
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/agent.go:596 +0xc8
Nov 18 11:03:40 systemd[1]: telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 18 11:03:40 systemd[1]: telegraf.service: Unit entered failed state.
Nov 18 11:03:40 systemd[1]: telegraf.service: Failed with result ‘exit-code’.
Nov 18 11:03:40 systemd[1]: telegraf.service: Service hold-off time over, scheduling restart.

@danielnelson danielnelson added bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf ready labels Dec 3, 2019
@danielnelson danielnelson self-assigned this Dec 3, 2019
@danielnelson
Copy link
Contributor

Expected behavior:
I would expect all other outputs to continue functioning unaffected, and the output that went away to queue up metrics until the bufferlimit is reached.

The pausing behavior here is intended. The inputs max_undelivered_messages setting controls how many queue messages Telegraf will take at once without delivery. Messages must be delivered on all outputs that would handle the message. If one of these outputs is down then the messages it is handling will be undelivered and the input will pause.

Of course the panic crash is a bug, I believe I have fixed this in #6806.

@danielnelson danielnelson added this to the 1.13.1 milestone Dec 17, 2019
@aurimasplu
Copy link

@danielnelson and what if I am OK with not delivering metrics to faulty output? I have multiple outputs, some off which I do not control. sometimes they go down and I do not care if they do not receive data, as long as my own output destinations gets data successfully.
max_undelivered_messages - seems to be related to successful delivery, and will not help if one of outputs are down.

@VY81
Copy link

VY81 commented Jul 24, 2020

I have the same need as @aurimasplu

@danielnelson could you pls suggest a workaround to ignore max_undelivered_messages ? I am using Telegraf 1.14.4

@Hipska
Copy link
Contributor

Hipska commented Jul 25, 2020

Same here, but I think there’s another ticket for that already.

@kyjanond
Copy link

kyjanond commented Feb 3, 2021

@Hipska can you put a reference to that ticket please?

@chintanp chintanp mentioned this issue Jan 8, 2022
3 tasks
@chintanp
Copy link

chintanp commented Jan 8, 2022

I am using a fork of telegraf with postgres output (https://github.com/phemmer/telegraf/tree/postgres) on a Raspberry Pi to push to a local timescaledb and a remote one. It has worked fine for a long time, with low latency. I observed today that if the remote db goes down or access to it is somehow lost, then the data does not go to any of the other local accessible outputs. When I removed the problematic output from the conf, things started working fine. @phemmer any ideas about this? This issue seems similar to the one reported here: #6694 . However, it seems your source already has the code fix for this issue.

@TotallyInformation
Copy link

There is certainly some kind of issue here. I think that most people would expect Telegraf to be robust to output issues where there is more than 1 output and even to recover gracefully even if he only output goes away and comes back (which I think you've already fixed?).

In my case, it was an issue with the MQTT output (#10180 ) - instead of only stopping that output, it crashed Telegraf completely.

My argument is that Telegraf is often used for system monitoring - if your monitoring app crashes because of an external issue, that's a major problem. This is, after all, why we may configure more than 1 output channel.

My view is that no output channel of Telegraf should ever cause it to crash. Complain loudly, yes, crash, no.

Without this, Telegraf cannot, unfortunately, be used as the main/only monitoring system. And if I have to put in a second monitoring system to monitor the first, then it won't happen. Which would be a shame because I like Telegraf and would like to recommend it.

@Hipska
Copy link
Contributor

Hipska commented Jan 12, 2022

I do fully agree with you on telegraf should not crash when a recoverable problem to one of its outputs happens.

On the other hand, telegraf is NOT a monitoring solution. It is a data collection agent (mostly time series data) which can obviously be used to also collect monitoring data. It does not handle or keep states, does not send alerts, does no remediation on the collected data, ... You would always need to have a 'real' monitoring tool to keep an eye if telegraf is still running or if the metrics are being collected/stored or even take actions (alerts/remediation) on the values of the collected data.

@opobla
Copy link

opobla commented Mar 21, 2024

Hi! Some years later I have stumbled upon the same problem. I collect a lot of information to several outputs. One of these output is an influxdb reachable through a network connection (satellite) that is not always available. Having this data in influx in realtime when the link is up is a plus, but it is not critical, since data is collected and transferred to another online place.

I understand the rationale behind keeping all outputs synced, but I think that sync could be optional and configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants