Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Etcd integration for configuration #651

Closed
wants to merge 1 commit into from

Conversation

titilambert
Copy link
Contributor

Rebased PR #465

Hello !
I just started an example of what could be the etcd integration with telegraf

Here an example:
1 . Make a myconf.conf file, which will be stored in etcd, with the following content :

[tags]
  dc = "us-east-1"

[agent]
  interval = "10s"
  round_interval = true
  flush_interval = "10s"
  flush_jitter = "0s"
  debug = false
  hostname = ""


[[outputs.influxdb]]
  urls = ["http://localhost:8086"]
  database = "telegraf"
  precision = "s"


[[inputs.cpu]]
  percpu = true
  totalcpu = true
  drop = ["cpu_time*"]

2 . Send this file to etcd using the label mylabel

./telegraf  -etcd http://127.0.0.1:2379 -etcdwritelabel mylabel -etcdwriteconfig myconf.conf

3 . You can check if data is really written in etcd with

./etcdctl get /telegraf/labels/mylabel

3 . Now any telegraf agent can load this config use the label mylabel

./telegraf  -etcd http://127.0.0.1:2379 -etcdreadlabels mylabel
Config read with label mylabel
2015/12/27 20:58:34 Database creation failed: Get http://localhost:8086/query?db=&q=CREATE+DATABASE+IF+NOT+EXISTS+telegraf: dial tcp 127.0.0.1:8086: getsockopt: connection refused
2015/12/27 20:58:34 Starting Telegraf (version v0.2.4-16-ga0bb7db)
2015/12/27 20:58:34 Loaded outputs: influxdb
2015/12/27 20:58:34 Loaded plugins: cpu
2015/12/27 20:58:34 Tags enabled: dc=us-east-1 host=osselait
2015/12/27 20:58:34 Agent Config: Interval:{10s}, Debug:false, Hostname:"osselait", Flush Interval:{10s}

Notes:

  • DO NOT forget to change your etcd server URL
  • Tested with etcd 2.2.2

Agent config reading order:

  1. /telegraf/main key
  2. /telegraf/hosts/HOSTNAME key
  3. /telegraf/labels/LABEL1 key
    4./telegraf/labels/LABEL2 key

Features:

  • Main config file can be loaded in etcd
  • Each agent try automatically to find its own key in etcd (/telegraf/hosts/HOSTNAME)
  • Labels can be configured in config files, so labels could be from etcd
  • Etcd config watcher, that reload telegraf when a change is detected in etcd.
  • You can write all configuration of ALL your telegraf agent in a folder then send it to etcd.

Missing:

  • Add an option to select the root folder name in etcd (default "/telegraf") DONE
  • Handle multiple etcd servers DONE
  • Handle update/set/delete in etcd DONE
  • Documentation DONE

@titilambert titilambert force-pushed the etcd branch 2 times, most recently from 192aa91 to 9d7acb2 Compare February 5, 2016 06:03
@titilambert
Copy link
Contributor Author

New features:

Agent specific config

Each agent start reading the value of key /telegraf/hosts/HOSTNAME.conf in etcd to get its default config. So you can now set your labels in this file and just start your agent like this:

./telegraf -etcd http://127.0.0.1:2379 

Configuration folder

You can now set your configuration in a folder like this:

testdata/
├── hosts
│   └── localhost.conf
├── labels
│   ├── influx.conf
│   ├── network2.conf
│   └── network.conf
└── main.conf

(An example is available here: https://github.com/titilambert/telegraf/tree/etcd/internal/etcd/testdata/test1)

Then you can send your configuration folder to etcd:

./telegraf -etcd http://127.0.0.1:2379  -etcdwriteconfigdir testdata/

Then you can start all your telegraf agents like this:

./telegraf  -etcd http://127.0.0.1:2379 -etcdreadlabels=influx,network

@titilambert
Copy link
Contributor Author

@sparrc rebased ! (I don't give up ;) )

@sparrc
Copy link
Contributor

sparrc commented Feb 13, 2016

great! Sorry I haven't had time to get a full review on this one, I've been slammed by some other features

@@ -19,6 +19,7 @@ import (
"github.com/influxdata/telegraf/plugins/serializers"

"github.com/influxdata/config"
"github.com/naoina/toml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the naoina/toml dependency here? I believe there is a function in influxdata/config that you can use to load and parse the toml file

@titilambert titilambert force-pushed the etcd branch 2 times, most recently from d32fab5 to 1d3f26d Compare February 20, 2016 04:36
@titilambert
Copy link
Contributor Author

@sparrc changes done ! I add 2 tests to get more coverage.
And rebased :)

@titilambert titilambert force-pushed the etcd branch 2 times, most recently from 7849b11 to c05fba3 Compare February 20, 2016 04:58
@titilambert
Copy link
Contributor Author

@sparrc What you think about those features ?

BTW, it's still missing:

  • Add an option to select the root folder name in etcd (default "/telegraf") DONE
  • Handle multiple etcd servers DONE
  • Handle update/set/delete in etcd
  • Documentation DONE

@titilambert titilambert force-pushed the etcd branch 2 times, most recently from e4f4f63 to 39818d6 Compare February 28, 2016 01:50
@titilambert
Copy link
Contributor Author

Added option to select the root folder name in etcd (default "/telegraf")

@titilambert
Copy link
Contributor Author

Rebased with the new toml lib

@titilambert
Copy link
Contributor Author

Rebased !
@sparrc What about create a new command telegrafctl ?
Instead of:

./telegraf -etcd http://127.0.0.1:2379  -etcdwriteconfigdir testdata/

We will use

./telegrafctl -etcd http://127.0.0.1:2379  -etcdwriteconfigdir testdata/

@sparrc
Copy link
Contributor

sparrc commented Mar 3, 2016

why do you want to do that? is there a precedent?

@titilambert
Copy link
Contributor Author

It's just to separate daemon binary from utils binary. I don't know if it's a good idea, it's just to copy etcd/kubernetes/...

@titilambert
Copy link
Contributor Author

@sparrc could you just confirm that this PR is in scope of Telegraf ? :)

@sparrc
Copy link
Contributor

sparrc commented Mar 16, 2016

yes, it is :)

@titilambert
Copy link
Contributor Author

@sparrc cool :)

@titilambert
Copy link
Contributor Author

@sparrc Rebased !
I also add a parameter to be able to erase config in etcd
I think all it's here. You can begin to review it, I'm waiting for your feedbacks :)
Thanks !

@balboah
Copy link

balboah commented Apr 6, 2016

Why not just use https://github.com/kelseyhightower/confd together with a kill HUP signal?

@titilambert
Copy link
Contributor Author

@balboah This does add a single point of failure, doesn't it ?
And this is not really cool when you re using telegraf inside a container...

@sparrc
Copy link
Contributor

sparrc commented Apr 6, 2016

@titilambert As I've looked through this PR more, and looked into etcd and configuration management options, I feel like this is going to get messy.

Yes, it's nice that we could have etcd directly baked into telegraf in some ways. But it's also complicated and it ignores all the other options that there are out there for achieving this (consul, redis, vault, etc)

As @balboah suggests, this seems more like a configuration management issue, so why not use configuration management tools to solve it? Why should telegraf become a configuration management tool on top of all the other things it does?

@sparrc
Copy link
Contributor

sparrc commented Apr 6, 2016

ps: do you have any examples of a project similar to telegraf that integrates directly with etcd? Or are there any libraries that we could use to generically get conf files? (like a library version of confd?)

@balboah
Copy link

balboah commented Apr 11, 2016

@titilambert not sure what you mean with single point of failure. In my case confd runs inside the container, monitors the etcd cluster for changes and updates the config file + sends the kill signal when there is some change.
As long as telegraf is good at handling that config file reloading (like re-connecting if influxdb hosts changes) all is good imo

@titilambert
Copy link
Contributor Author

@balboah Single point of failure: What's happen if confd crashed ? We don't have any fallback for confd. I can not see how running confd inside a container can solve this issue. Docker can still restart condf but the single point of failure is now on Docker. The only single point of failure should be Telegraf.

@sparrc I'm agree with you, this PR limits Telegraf to Etcd. But, imo, confd seems a single point of failure.

do you have any examples of a project similar to telegraf that integrates directly with etcd?
Yes, Kubernetes, fleet, locksmith, vulcand, calico, flannel, ...

Or are there any libraries that we could use to generically get conf files? (like a library version of confd?)
Good question !!!

@sparrc
Copy link
Contributor

sparrc commented Apr 11, 2016

@titilambert but what if your etcd server goes down? This could be a problem if etcd is integrated directly into telegraf. If you decouple the two services (telegraf and config management) then Telegraf is completely unaffected by any status or change in etcd.

@titilambert
Copy link
Contributor Author

@sparrc etcd can run as a cluster (with at least 3 nodes) which means etcd isn't a single point of failure.
With confd I just can't see how we can eliminate this issue (because you need to run several confd daemons on the same machine)

@balboah
Copy link

balboah commented Apr 13, 2016

sorry maybe I still don't understand, but I fail to see how integrating basically the same use case as confd into telegraf solves the availability issue differently? The code would be pretty much the same, the number of processes would be the same, how is the "single point of failure" different?
To clarify: confd talks with etcd nodes and writes a config file. Telegraf would do the same, or update configuration in its memory. Sure confd process could die because of bugs, but so could telegraf?

Also confd already supports toml templating that you're introducing, has plugins for different statements and supports more sources than etcd.

@titilambert
Copy link
Contributor Author

@balboah With this PR Telegraf will never get config files. It loads conf directly from etcd in memory. This eliminate the write config step.
Of course, Telegraf can crash in both cases but you have one step less without confd.

@sparrc I think you're right ! Maybe we need to use https://github.com/spf13/viper ? We can use it to read config only from remote sources or for both remote and local files. What should be the best choice ?

@myontop
Copy link

myontop commented Apr 16, 2016

when you plan to release a version with etcd integration?

@johnrengelman
Copy link
Contributor

I'll chime in my 2 cents - I don't think telegraf should go down the road of supporting config backends directly. Once etcd is added, then there will be requests for consul and zookeeper, etc.
It becomes a maintenance nightmare.
A better approach is to have best practices for how to integrate with these services externally.
As for the SPOF argument, by using something like confd to write out a config file and have telegraf simply load that file, you are actually reducing the failure modes. In this scenario, if etcd or confd fails, then there is still a configuration file for telegraf to load which will allow it to startup up.

If integrated directly and etcd is down, then telegraf can't run because the configuration is coming from there.

It also protects telegraf from any changes in the APIs of these tools. You don't want to to have to release a new version of telegraf due to a compatibility issue with etcd.

@titilambert
Copy link
Contributor Author

titilambert commented May 4, 2016

@johnrengelman what do you think about https://github.com/spf13/viper ?
I understand the use case:

In this scenario, if etcd or confd fails, then there is still a configuration file for telegraf to load which will allow it to startup up.

but I'm sure about using confd+telegraf in docker envs...
For example in Kubernetes environnement, this means adding a new container inside telegraf pod... this will multiplicate by 2 your number of containers, just for the configuration. And you prefer using ressources/containers for you own applications.

@gunnaraasen
Copy link
Member

Not sure if this has been mentioned before. Docker's libkv is another potential option for supporting multiple distributed config stores.

@sparrc
Copy link
Contributor

sparrc commented May 5, 2016

that library looks fantastic, thanks @gunnaraasen

@titilambert
Copy link
Contributor Author

@sparrc what do you think ? I rewrite the PR with https://github.com/docker/libkv or https://github.com/spf13/viper ?

@sparrc
Copy link
Contributor

sparrc commented May 10, 2016

closing this because I prefer to have this conversation in #272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants