Add integration for service discovery & kv config stores (dynamic config) #272

pauldix · 2015-10-16T11:48:24Z

If there's some standard service discovery to connect to like Consul, it would be cool to have Telegraf connect to that and automatically start collecting data for services that Telegraf supports.

So when a new MySQL server comes on, Telegraf will automatically start collecting data from it.

Just an idea. Users could also get this by just having Telegraf part of their deploys when they create new servers.

sparrc · 2015-10-16T17:42:53Z

👍

rvrignaud · 2015-10-19T15:07:53Z

Hello @pauldix,
What you suggest is, I think, what I tried to explain here: #193 (comment)
Prometheus supports a wide range of discovery (consul included). I'm personally interested in kubernetes discovery.

titilambert · 2016-03-04T22:20:32Z

@pauldix @rvrignaud see PR about etcd here : #651

chris-zen · 2016-03-05T15:21:36Z

Hi @titilambert, your PR is really useful to update telegraf configuration dynamically, such as changing input and outputs configurations from time to time, but for service discovery in a system such as AWS, mesos or kubernetes where things scale dynamically, something like the service discovery features implemented in prometheus would be really great.

@rvrignaud explanation is here, and the prometheus documentation shows the different possibilities supported.

Having this feature would definitively make me move to influxdb, but keep using the prometheus instrumentation library.

titilambert · 2016-03-05T18:11:29Z

@chris-zen that's very interesting !
I'm agree with you, I would love to see that, but this kind of service discovery is more for scheduled (polling) monitoring systems (like Prometheus), isn't it ? I dont know if a decentralized (pushing) system like Telegraf is adapted to this...

What do you think about?

chris-zen · 2016-03-16T18:11:17Z

Yes, agree that it is specially important for polling. But telegraf is already supporting polling inputs such as the one for prometheus. Right now the prometheus input only allows static config, but it would be very useful to support service discovery too. My understanding is that telegraf is quite versatile and allows both pull and push models, but the pull model without service discovery is worthless in such dynamic environments.

sparrc · 2016-04-30T21:13:16Z

Just dropping this here for reference on what I think is a good service discovery model (from prometheus): https://prometheus.io/blog/2015/06/01/advanced-service-discovery/. Same as mentioned above but I think this blog post is a little more approachable than their documentation.

I think that the "file-based" custom service discovery will be easy to implement. Doing DNS-SERV, Consul, etc. will take a bit more work, but certainly doable.

I'm imagining some sort of plugin system for these, where notifications on config changes and additions could be sent down a channel, and whenever Telegraf detects one of these it would apply and reload the configuration.

sparrc · 2016-05-10T20:49:47Z

My preference would be to start with a simple file & directory service discovery. This would be an inotify goroutine that would basically send a service reload (SIGHUP) to the process when it detects a change in any config file, or any config file added or removed to a config directory.

This could be extended using https://github.com/docker/libkv or something similar that would launch a goroutine that would overwrite the on-disk config file(s) when it detects a change (basically a very simple version of confd)

This would solve some of the issues that I have (and that @johnrengelman and @balboah raised) with integrating with a kv-store. In essence, we wouldn't be dependent on a kv-store, and we wouldn't have any confusion over the currently-loaded config, because the config would always also be on-disk.

sparrc · 2016-05-10T20:53:46Z

curious what others think of this design, I'm biased but this is my view:

pros:

no problem if the kv-store (etcd, consul, zookeeper, etc) goes down
no ambiguity around the current config telegraf is using
allows for simple testing and prototyping
kv-store integration is independent and not "tightly coupled" with telegraf

cons:

requires write to disk before new config is loaded

pauldix · 2016-05-10T23:04:26Z

I like it. That was one thing that used to be tricky with Redis. You could make commands to alter the running config, but then if you restarted your server without updating the on disk config then you're hosed.

File write isn't a big deal. Not like they're going to be updating the config multiple times a second, minute, or even hour.

sofixa · 2016-05-11T06:58:02Z

@pauldix You might be updating your config multiple times per hour and up if you are in a highly dynamic environment, like an AWS Autoscaling Group or a Docker Swarm/Kubernetes/fleetd/LXD container thingie.
But even then, @sparrc 's proposed implementation sounds very good, combining flexibility with resiliency (you aren't depending on your KV/network always being up).
+1

panda87 · 2016-07-05T06:17:46Z

Hi guys, any updates with this monitoring methodology?
My company starts to implement mesos and marathon as scheduler and we find the services monitoring (mysql,es etc.) very difficult with the current telegraf monitoring architecture and it seems that the only way right now is use Prometheus as you mentioned above because of the support in dynamic SD monitoring.

@sparrc can you please share the current state design?

Thanks

toni-moreno · 2016-07-16T05:10:44Z

Hi to everybody , I'm new to this discussion and I would like to add my Point of view.

Everybody knows how important is now add ability to our agents to get configuration and discover configuration change from a centralized configuration system on our systems.

As I have been read in this thread ( and others #651) , there diferent ways to got remote configuration.

https://github.com/docker/libkv ( for etcd or other KV store backends)
https://github.com/spf13/viper ( for remote config storage)

Any way the most important thing ( IMHO ) is add the ability to manage easily changes on all our distributed agents. I think when there is not any available solution the easiest way should be the best. So I did yesterday a really simple proposal on #1496, that could be easily coded in a few lines of code. ( the same behaviour if you can switch to the https://github.com/spf13/viper library).

Once added this simple feature , we'll can continue discussion on other more sophisticated way to get configurations and integration with know centralized systems. ( like etcd, and others).

I vote for add first a simple centralized way and after an integrated solution. Both will cover the same functionality on different scenarios.

what do you think about?

sparrc · 2016-07-16T11:05:27Z

@toni-moreno the most simple way to manage it is via files. Although the http getting might be simple for your scenario, I can imagine ways in which it can get complicated (just see the httpjson plugin for examples). Like I said, this feature needs to first be coded as a file watcher and then we can develop plugins around changing the on-disk file(s).

blaggacao · 2016-07-29T18:58:32Z

There is one commonly used abstraction pattern available, the only thing what would be needed is hot config reloading:

https://github.com/kelseyhightower/confd/ is a single binary which watches any (many) kind(s) of backend(s) and templates the configuration file upon detected changes.

I'm about to implement something for rancher catalogue items. influxdata/influxdata-docker#9 is related.

The pattern is rather simple to manage with sidekicks and shared volumes.

One step further:

Integration of confd into the telegraf service and use the integrated interface for command execution in order to signal to the telegraf process a config reload.

@sparrc I think this is almost a no brainer, as only the signalling to the telegraf process would need some extra thought, the rest is taken care of.

sparrc · 2016-07-29T19:45:37Z

the signaling would simply be the file changing on disk, there is no need for confd to directly signal to Telegraf as far as I understand it.

blaggacao · 2016-07-29T20:41:59Z

Absolutely right.

panda87 · 2016-09-01T21:17:56Z

@sparrc Hi sparrc, any new updates on this?

3fr61n · 2016-09-07T09:06:02Z

Hi guys, very interesting discussion, I'm totally agree with having telegraf 'separate' of etcd/viper/etc, however it needs somehow track any file changes performed for those apps, and being able to apply those changes 'on-the-fly'.

Does anyone knows if this is going to be the way to go, and how is going to be implemented?

sparrc · 2016-09-07T09:18:42Z

@3fr61n, yes, the initial implementation will be a file/directory watcher that will be able to dynamically reload the configuration any time that the file(s) change.

I'm not sure the "how" yet, maybe this: https://github.com/fsnotify/fsnotify

danielnelson · 2017-09-30T01:18:05Z

@abraithwaite Can you take a look at the kubernetes_services option we added to the prometheus input and see if it works for your use case, it is only on the master branch but you can use the nightly builds.

abraithwaite · 2017-10-01T21:07:52Z

Unfortunately not. The value that prometheus provides with Kubernetes is that you configure metrics collection via the service (with kubernetes annotations) and not through the metrics collection agent.

This enables users to configure everything they need without having to setup something outside the scope of their own services.

I can provide examples if needed, just lemme know.

danielnelson · 2017-10-02T23:04:15Z

@abraithwaite can you link me to the Kubernetes documentation for the method you are using?

abraithwaite · 2017-10-02T23:10:51Z

Haven't seen any official documentation, actually. Just pieced it together from code, examples and blog posts:

https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml
https://coreos.com/blog/prometheus-and-kubernetes-up-and-running.html
prometheus/prometheus#2989
prometheus/prometheus#2009
https://movio.co/en/blog/prometheus-service-discovery-kubernetes/

abraithwaite · 2017-10-02T23:13:42Z

FWIW, I don't use prometheus with Kubernetes but the concept is extremely valuable and I'd still love to see it here.

I looked at the telegraf code though and I'm certain you'd need to add service discovery as a first class configuration method.

danielnelson · 2017-10-03T00:08:00Z

Just to clarify, the kubernetes_services option allows you to use the Kubernetes DNS cluster add-on to find and scrape prometheus endpoints without needing to update your Telegraf configuration file when a service is started/stopped.

abraithwaite · 2017-10-03T00:13:06Z

Right, I understand that. It still requires an explicit dependency between the service and telegraf, instead of an implicit one.

When using annotations, there is no PR a user has to make to update the telegraf config in order to start getting metrics from their service collected.

tmedford · 2018-05-21T05:49:09Z

I can agree that "Prometheus kubernetes discovery using annotations is pure gold. I would love to have this in telegraf." We use this to have prometheus dynamically find new targets. Would love to move back to telegraf for collection of metrics and uptime if this was supported.

narayanprabhu · 2018-09-04T22:35:25Z

Hi,
I'm pretty new to the TICK stack and getting used to this. We are trying to setup the TICK stack as the monitoring platform for our organization. One question that has been pounding up is on how we manage the configurations - for instance if we need to monitor one service/process on a server we would have to make changes to the config on the server and restart telegraf. On doing some research I found this page and I think I'm posting my concern on the right place. Do we have a working model to manage configuration centrally?

voiprodrigo · 2018-09-05T01:27:47Z

@narayanprabhu I use Puppet to ease that kind of pain. It knows all the services that are “ensured” on each server, and that makes it easier to deploy a matching Telegraf config.

_{Sent with GitHawk}

narayanprabhu · 2018-09-05T01:54:37Z

@voiprodrigo Yes puppet is a good option, unfortunately my organization does not have that solution. They mainly rely on SCCM for windows deployment and Ansible for the linux. This thread says that there is a UI option being built for chronograf to manage agent configs, is that option still being built. Wondering if that is coming up anytime soon?

And there is something about etcd where we can have one config consumed by other telegraf agents - is this some option that would help out my use case. Is this something that works for windows as well?

Jaeyo · 2018-10-08T13:45:25Z

@danielnelson any update?

danielnelson · 2018-10-08T19:16:08Z

Work is on hold right now (for the first item here), but I'm tempted to break this issue up into several issues:

Loading config data from a configuration store (zookeeper/etcd/consul/etc), prototype code for a plugin config loading system here: https://github.com/danielnelson/tgconfig
General purpose discovery: still needs more thought into what precisely it will be.
Prometheus endpoint discovery: this will be done in add scraping for Prometheus endpoint in Kubernetes #3901 and additional work if needed, but at least in the mid-term should be done to the prometheus input.

Jaeyo · 2019-02-02T05:27:39Z

in influxdb 2.0 alpha version, it has telegraf config generation ui. and telegraf was guided to take config from influxdb. but influxdb seems to have no edit config feature yet.
so, here's question, do telegraf have any plan to synchronize config from influxdb 2.0?

rdxmb · 2023-06-22T06:24:56Z

Hello, so reloading (file) config without restart has not been implemented yet? It's a pitty. @blaggacao haven't you mentioned it "almost a no brainer"?

I'd like to use telegraf with a sidecar creating the configs...

EDIT: seems like there is a --watch-config, will try that immediately

sparrc added the enhancement label Oct 16, 2015

sparrc removed the enhancement label Jan 20, 2016

sparrc mentioned this issue Apr 28, 2016

Monitoring dynamic microservices by service (service oriented monitoring) #1111

Closed

sparrc mentioned this issue May 10, 2016

[WIP] Etcd integration for configuration #651

Closed

sparrc mentioned this issue May 10, 2016

Add options to reload config or scan a config.d directory #1175

Closed

sparrc mentioned this issue Jul 14, 2016

[Feature Request] Download config file from remote configuration manager server . #1496

Closed

sparrc added the Difficulty/Large label Jul 20, 2016

sparrc changed the title ~~Add integration to service discovery~~ Add integration for service discovery & kv config stores (dynamic config) Jul 29, 2016

sparrc mentioned this issue Jul 29, 2016

[Feature Request] K/V telegraf configuration #1558

Closed

sparrc mentioned this issue Aug 3, 2016

Unable to drop a measurement created by telegraf #1579

Closed

danielnelson mentioned this issue Oct 2, 2017

Dynamically reconfigure an agent(telegraf), Telegraf UI on level of Graphs UI. #3294

Closed

danielnelson mentioned this issue Oct 21, 2017

telegraf dropped/purged/truncated its output buffer on SIGHUP #2679

Closed

danielnelson mentioned this issue Nov 7, 2017

Panic decoding toml array when expecting string #3444

Closed

danielnelson mentioned this issue Nov 15, 2017

Marathon input plugin #2369

Closed

3 tasks

danielnelson mentioned this issue Dec 6, 2017

Log connect error only in wavefront output #3549

Merged

3 tasks

danielnelson mentioned this issue Jan 5, 2018

Config file does not support valid comments within arrays #3642

Closed

danielnelson added the area/configuration label Jan 5, 2018

sorenmat mentioned this issue Mar 19, 2018

add scraping for Prometheus endpoint in Kubernetes #3901

Closed

3 tasks

russorat added this to the 2.0.0 milestone Jun 27, 2018

Jaeyo mentioned this issue Feb 2, 2019

have any plan for config sync with influxdb 2.0? #5367

Closed

danielnelson mentioned this issue Feb 12, 2019

Config plugin for dynamic config update #5409

Closed

This was referenced May 29, 2020

Config with etcd #193

Closed

Centralized Telegraf Manager #7478

Closed

sjwang90 removed this from the 2.0.0 milestone Jul 27, 2021

danielnelson removed their assignment Sep 1, 2021

Add integration for service discovery & kv config stores (dynamic config) #272

Add integration for service discovery & kv config stores (dynamic config) #272

Comments

pauldix commented Oct 16, 2015

sparrc commented Oct 16, 2015

rvrignaud commented Oct 19, 2015

titilambert commented Mar 4, 2016

chris-zen commented Mar 5, 2016

titilambert commented Mar 5, 2016

chris-zen commented Mar 16, 2016

sparrc commented Apr 30, 2016 • edited Loading

sparrc commented May 10, 2016

sparrc commented May 10, 2016

pauldix commented May 10, 2016

sofixa commented May 11, 2016

panda87 commented Jul 5, 2016

toni-moreno commented Jul 16, 2016

sparrc commented Jul 16, 2016

blaggacao commented Jul 29, 2016 • edited Loading

sparrc commented Jul 29, 2016 • edited Loading

blaggacao commented Jul 29, 2016

panda87 commented Sep 1, 2016

3fr61n commented Sep 7, 2016 • edited Loading

sparrc commented Sep 7, 2016

danielnelson commented Sep 30, 2017

abraithwaite commented Oct 1, 2017

danielnelson commented Oct 2, 2017

abraithwaite commented Oct 2, 2017 • edited Loading

abraithwaite commented Oct 2, 2017

danielnelson commented Oct 3, 2017

abraithwaite commented Oct 3, 2017

tmedford commented May 21, 2018

narayanprabhu commented Sep 4, 2018

voiprodrigo commented Sep 5, 2018

narayanprabhu commented Sep 5, 2018

Jaeyo commented Oct 8, 2018

danielnelson commented Oct 8, 2018

Jaeyo commented Feb 2, 2019

rdxmb commented Jun 22, 2023 • edited Loading

sparrc commented Apr 30, 2016 •

edited

Loading

blaggacao commented Jul 29, 2016 •

edited

Loading

sparrc commented Jul 29, 2016 •

edited

Loading

3fr61n commented Sep 7, 2016 •

edited

Loading

abraithwaite commented Oct 2, 2017 •

edited

Loading

rdxmb commented Jun 22, 2023 •

edited

Loading