Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integration for service discovery & kv config stores (dynamic config) #272

Open
pauldix opened this issue Oct 16, 2015 · 45 comments
Open

Comments

@pauldix
Copy link
Member

pauldix commented Oct 16, 2015

If there's some standard service discovery to connect to like Consul, it would be cool to have Telegraf connect to that and automatically start collecting data for services that Telegraf supports.

So when a new MySQL server comes on, Telegraf will automatically start collecting data from it.

Just an idea. Users could also get this by just having Telegraf part of their deploys when they create new servers.

@sparrc
Copy link
Contributor

sparrc commented Oct 16, 2015

👍

@rvrignaud
Copy link

Hello @pauldix,
What you suggest is, I think, what I tried to explain here: #193 (comment)
Prometheus supports a wide range of discovery (consul included). I'm personally interested in kubernetes discovery.

@titilambert
Copy link
Contributor

@pauldix @rvrignaud see PR about etcd here : #651

@chris-zen
Copy link

Hi @titilambert, your PR is really useful to update telegraf configuration dynamically, such as changing input and outputs configurations from time to time, but for service discovery in a system such as AWS, mesos or kubernetes where things scale dynamically, something like the service discovery features implemented in prometheus would be really great.

@rvrignaud explanation is here, and the prometheus documentation shows the different possibilities supported.

Having this feature would definitively make me move to influxdb, but keep using the prometheus instrumentation library.

@titilambert
Copy link
Contributor

@chris-zen that's very interesting !
I'm agree with you, I would love to see that, but this kind of service discovery is more for scheduled (polling) monitoring systems (like Prometheus), isn't it ? I dont know if a decentralized (pushing) system like Telegraf is adapted to this...

What do you think about?

@chris-zen
Copy link

Yes, agree that it is specially important for polling. But telegraf is already supporting polling inputs such as the one for prometheus. Right now the prometheus input only allows static config, but it would be very useful to support service discovery too. My understanding is that telegraf is quite versatile and allows both pull and push models, but the pull model without service discovery is worthless in such dynamic environments.

@sparrc
Copy link
Contributor

sparrc commented Apr 30, 2016

Just dropping this here for reference on what I think is a good service discovery model (from prometheus): https://prometheus.io/blog/2015/06/01/advanced-service-discovery/. Same as mentioned above but I think this blog post is a little more approachable than their documentation.

I think that the "file-based" custom service discovery will be easy to implement. Doing DNS-SERV, Consul, etc. will take a bit more work, but certainly doable.

I'm imagining some sort of plugin system for these, where notifications on config changes and additions could be sent down a channel, and whenever Telegraf detects one of these it would apply and reload the configuration.

@sparrc
Copy link
Contributor

sparrc commented May 10, 2016

My preference would be to start with a simple file & directory service discovery. This would be an inotify goroutine that would basically send a service reload (SIGHUP) to the process when it detects a change in any config file, or any config file added or removed to a config directory.

This could be extended using https://github.com/docker/libkv or something similar that would launch a goroutine that would overwrite the on-disk config file(s) when it detects a change (basically a very simple version of confd)

This would solve some of the issues that I have (and that @johnrengelman and @balboah raised) with integrating with a kv-store. In essence, we wouldn't be dependent on a kv-store, and we wouldn't have any confusion over the currently-loaded config, because the config would always also be on-disk.

@sparrc
Copy link
Contributor

sparrc commented May 10, 2016

curious what others think of this design, I'm biased but this is my view:

pros:

  • no problem if the kv-store (etcd, consul, zookeeper, etc) goes down
  • no ambiguity around the current config telegraf is using
  • allows for simple testing and prototyping
  • kv-store integration is independent and not "tightly coupled" with telegraf

cons:

  • requires write to disk before new config is loaded

@pauldix
Copy link
Member Author

pauldix commented May 10, 2016

I like it. That was one thing that used to be tricky with Redis. You could make commands to alter the running config, but then if you restarted your server without updating the on disk config then you're hosed.

File write isn't a big deal. Not like they're going to be updating the config multiple times a second, minute, or even hour.

@sofixa
Copy link

sofixa commented May 11, 2016

@pauldix You might be updating your config multiple times per hour and up if you are in a highly dynamic environment, like an AWS Autoscaling Group or a Docker Swarm/Kubernetes/fleetd/LXD container thingie.
But even then, @sparrc 's proposed implementation sounds very good, combining flexibility with resiliency (you aren't depending on your KV/network always being up).
+1

@panda87
Copy link

panda87 commented Jul 5, 2016

Hi guys, any updates with this monitoring methodology?
My company starts to implement mesos and marathon as scheduler and we find the services monitoring (mysql,es etc.) very difficult with the current telegraf monitoring architecture and it seems that the only way right now is use Prometheus as you mentioned above because of the support in dynamic SD monitoring.

@sparrc can you please share the current state design?

Thanks

@toni-moreno
Copy link
Contributor

Hi to everybody , I'm new to this discussion and I would like to add my Point of view.

Everybody knows how important is now add ability to our agents to get configuration and discover configuration change from a centralized configuration system on our systems.

As I have been read in this thread ( and others #651) , there diferent ways to got remote configuration.

https://github.com/docker/libkv ( for etcd or other KV store backends)
https://github.com/spf13/viper ( for remote config storage)

Any way the most important thing ( IMHO ) is add the ability to manage easily changes on all our distributed agents. I think when there is not any available solution the easiest way should be the best. So I did yesterday a really simple proposal on #1496, that could be easily coded in a few lines of code. ( the same behaviour if you can switch to the https://github.com/spf13/viper library).

Once added this simple feature , we'll can continue discussion on other more sophisticated way to get configurations and integration with know centralized systems. ( like etcd, and others).

I vote for add first a simple centralized way and after an integrated solution. Both will cover the same functionality on different scenarios.

what do you think about?

@sparrc
Copy link
Contributor

sparrc commented Jul 16, 2016

@toni-moreno the most simple way to manage it is via files. Although the http getting might be simple for your scenario, I can imagine ways in which it can get complicated (just see the httpjson plugin for examples). Like I said, this feature needs to first be coded as a file watcher and then we can develop plugins around changing the on-disk file(s).

@sparrc sparrc changed the title Add integration to service discovery Add integration for service discovery & kv config stores (dynamic config) Jul 29, 2016
@blaggacao
Copy link

blaggacao commented Jul 29, 2016

There is one commonly used abstraction pattern available, the only thing what would be needed is hot config reloading:

https://github.com/kelseyhightower/confd/ is a single binary which watches any (many) kind(s) of backend(s) and templates the configuration file upon detected changes.

I'm about to implement something for rancher catalogue items. influxdata/influxdata-docker#9 is related.

The pattern is rather simple to manage with sidekicks and shared volumes.


One step further:

  • Integration of confd into the telegraf service and use the integrated interface for command execution in order to signal to the telegraf process a config reload.

@sparrc I think this is almost a no brainer, as only the signalling to the telegraf process would need some extra thought, the rest is taken care of.

@sparrc
Copy link
Contributor

sparrc commented Jul 29, 2016

the signaling would simply be the file changing on disk, there is no need for confd to directly signal to Telegraf as far as I understand it.

@blaggacao
Copy link

Absolutely right.

@panda87
Copy link

panda87 commented Sep 1, 2016

@sparrc Hi sparrc, any new updates on this?

@3fr61n
Copy link

3fr61n commented Sep 7, 2016

Hi guys, very interesting discussion, I'm totally agree with having telegraf 'separate' of etcd/viper/etc, however it needs somehow track any file changes performed for those apps, and being able to apply those changes 'on-the-fly'.

Does anyone knows if this is going to be the way to go, and how is going to be implemented?

@sparrc
Copy link
Contributor

sparrc commented Sep 7, 2016

@3fr61n, yes, the initial implementation will be a file/directory watcher that will be able to dynamically reload the configuration any time that the file(s) change.

I'm not sure the "how" yet, maybe this: https://github.com/fsnotify/fsnotify

@danielnelson
Copy link
Contributor

@abraithwaite Can you take a look at the kubernetes_services option we added to the prometheus input and see if it works for your use case, it is only on the master branch but you can use the nightly builds.

@abraithwaite
Copy link

Unfortunately not. The value that prometheus provides with Kubernetes is that you configure metrics collection via the service (with kubernetes annotations) and not through the metrics collection agent.

This enables users to configure everything they need without having to setup something outside the scope of their own services.

I can provide examples if needed, just lemme know.

@danielnelson
Copy link
Contributor

@abraithwaite can you link me to the Kubernetes documentation for the method you are using?

@abraithwaite
Copy link

FWIW, I don't use prometheus with Kubernetes but the concept is extremely valuable and I'd still love to see it here.

I looked at the telegraf code though and I'm certain you'd need to add service discovery as a first class configuration method.

@danielnelson
Copy link
Contributor

Just to clarify, the kubernetes_services option allows you to use the Kubernetes DNS cluster add-on to find and scrape prometheus endpoints without needing to update your Telegraf configuration file when a service is started/stopped.

@abraithwaite
Copy link

Right, I understand that. It still requires an explicit dependency between the service and telegraf, instead of an implicit one.

When using annotations, there is no PR a user has to make to update the telegraf config in order to start getting metrics from their service collected.

@tmedford
Copy link

I can agree that "Prometheus kubernetes discovery using annotations is pure gold. I would love to have this in telegraf." We use this to have prometheus dynamically find new targets. Would love to move back to telegraf for collection of metrics and uptime if this was supported.

@russorat russorat added this to the 2.0.0 milestone Jun 27, 2018
@narayanprabhu
Copy link

Hi,
I'm pretty new to the TICK stack and getting used to this. We are trying to setup the TICK stack as the monitoring platform for our organization. One question that has been pounding up is on how we manage the configurations - for instance if we need to monitor one service/process on a server we would have to make changes to the config on the server and restart telegraf. On doing some research I found this page and I think I'm posting my concern on the right place. Do we have a working model to manage configuration centrally?

@voiprodrigo
Copy link
Contributor

@narayanprabhu I use Puppet to ease that kind of pain. It knows all the services that are “ensured” on each server, and that makes it easier to deploy a matching Telegraf config.

Sent with GitHawk

@narayanprabhu
Copy link

@voiprodrigo Yes puppet is a good option, unfortunately my organization does not have that solution. They mainly rely on SCCM for windows deployment and Ansible for the linux. This thread says that there is a UI option being built for chronograf to manage agent configs, is that option still being built. Wondering if that is coming up anytime soon?

And there is something about etcd where we can have one config consumed by other telegraf agents - is this some option that would help out my use case. Is this something that works for windows as well?

@Jaeyo
Copy link
Contributor

Jaeyo commented Oct 8, 2018

@danielnelson any update?

@danielnelson
Copy link
Contributor

Work is on hold right now (for the first item here), but I'm tempted to break this issue up into several issues:

  1. Loading config data from a configuration store (zookeeper/etcd/consul/etc), prototype code for a plugin config loading system here: https://github.com/danielnelson/tgconfig
  2. General purpose discovery: still needs more thought into what precisely it will be.
  3. Prometheus endpoint discovery: this will be done in add scraping for Prometheus endpoint in Kubernetes #3901 and additional work if needed, but at least in the mid-term should be done to the prometheus input.

@Jaeyo
Copy link
Contributor

Jaeyo commented Feb 2, 2019

in influxdb 2.0 alpha version, it has telegraf config generation ui. and telegraf was guided to take config from influxdb. but influxdb seems to have no edit config feature yet.
so, here's question, do telegraf have any plan to synchronize config from influxdb 2.0?

This was referenced May 29, 2020
@sjwang90 sjwang90 removed this from the 2.0.0 milestone Jul 27, 2021
@danielnelson danielnelson removed their assignment Sep 1, 2021
@rdxmb
Copy link
Contributor

rdxmb commented Jun 22, 2023

Hello, so reloading (file) config without restart has not been implemented yet? It's a pitty. @blaggacao haven't you mentioned it "almost a no brainer"?

I'd like to use telegraf with a sidecar creating the configs...

EDIT: seems like there is a --watch-config, will try that immediately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests