You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User should be able to designated the interval and number of retries for loading their config from a URL if their endpoint is down.
Current behavior:
Right now, Telegraf retries three times at 10s intervals when receiving an error on loading config from a url in the case of the remote endpoint being down. Current solution does not use env variables or use flags to change these settings (based on #8803).
Desired behavior:
User needs some way to configure interval and number of retries settings to determine the behavior of loading the config from a URL.
Use case:
From @schmorgs:
Planning to use Telegraf in production across a large number of servers across the globe, and there are many points where breakages could happen, especially in countries where there is very low bandwidth and old infrastructure. Along with that comes many standards and versions of OS, etc, hence our approach to manage config centrally so that we don't have to navigate the variety of ways of reaching an endpoint.
So if Telegraf starts up and there happened to be a breakage somewhere (NW connectivity, Web Server down, etc), the agent will die. On RHEL7/8 and Windows, we can utilise systemd/SCM to configure infinite retries on the agent so that even if it does die, it will be restarted.
But RHEL6 doesn't have systemd and so we would end up writing some sort of watcher daemon as well which seems a bit overkill if the agent could handle (at least) this condition.
The reason for the importance is this will be our primary monitoring agent and so want to make this as available and robust as possible. We would still implement external controls such as systemd restarts to provide an extra layer of resilience, but the more the agent can do in this area makes just adds to this.
In some cases, the situation where the agent was unable to get config would be fairly small as the agent only pulls config on startup. But we want the agent to periodically pull its config down so that it can be configured centrally and automatically pulled by the agent. I understand this is part of a longer term strategy for Telegraf, but in the meantime, we HUP the agent periodically as a workaround, and so now the agent has constant reliability on the HTTP endpoint and therefore, more likelihood of encountering a problem.
Whether a switch, environment variable, config file on the server, etc, I'm happy to see whichever approach works best.
The text was updated successfully, but these errors were encountered:
It's quite normal requirement considering a power outage at home. The modem and router need time to connect to Internet, and the telegraf service with a url config just quickly tries several time and completely fails.
powersj
added a commit
to powersj/telegraf
that referenced
this issue
May 17, 2024
This introduces a new cli option to allow the user to set the number of
retry attempts to something other than 3. It also allows the user to set
the attempt count to -1 to infinitely retry.
fixes: influxdata#8854
Feature Request
Related: #7338
Proposal:
User should be able to designated the
interval
andnumber of retries
for loading their config from a URL if their endpoint is down.Current behavior:
Right now, Telegraf retries three times at 10s intervals when receiving an error on loading config from a url in the case of the remote endpoint being down. Current solution does not use env variables or use flags to change these settings (based on #8803).
Desired behavior:
User needs some way to configure
interval
andnumber of retries
settings to determine the behavior of loading the config from a URL.Use case:
From @schmorgs:
Planning to use Telegraf in production across a large number of servers across the globe, and there are many points where breakages could happen, especially in countries where there is very low bandwidth and old infrastructure. Along with that comes many standards and versions of OS, etc, hence our approach to manage config centrally so that we don't have to navigate the variety of ways of reaching an endpoint.
So if Telegraf starts up and there happened to be a breakage somewhere (NW connectivity, Web Server down, etc), the agent will die. On RHEL7/8 and Windows, we can utilise systemd/SCM to configure infinite retries on the agent so that even if it does die, it will be restarted.
But RHEL6 doesn't have systemd and so we would end up writing some sort of watcher daemon as well which seems a bit overkill if the agent could handle (at least) this condition.
The reason for the importance is this will be our primary monitoring agent and so want to make this as available and robust as possible. We would still implement external controls such as systemd restarts to provide an extra layer of resilience, but the more the agent can do in this area makes just adds to this.
In some cases, the situation where the agent was unable to get config would be fairly small as the agent only pulls config on startup. But we want the agent to periodically pull its config down so that it can be configured centrally and automatically pulled by the agent. I understand this is part of a longer term strategy for Telegraf, but in the meantime, we HUP the agent periodically as a workaround, and so now the agent has constant reliability on the HTTP endpoint and therefore, more likelihood of encountering a problem.
Whether a switch, environment variable, config file on the server, etc, I'm happy to see whichever approach works best.
The text was updated successfully, but these errors were encountered: