Automated monitoring of an etcd cluster? #2383

Telmo · 2015-02-26T18:05:45Z

I've read all the tools using etcd, I've google different permutation of "etcd monitoring" and I am still unable to find a way to monitor an etcd cluster. With monitoring I mean a programatic way of checking the global health of a cluster, to be alerted if "nodes fall off the map" or etcd stops working all together.

Is there such a tool? If not what are people using to be alerted of possible issues with etcd?

kelseyhightower · 2015-02-27T06:07:28Z

@Telmo Thanks for reaching out. I'm going to propose a simple option, please let me know if this is what you are looking for.

if you care about the health of a specific node, then one option would be to attempt to write to a "dummy" key. For example /monitoring/healthcheck. If that works then the node can be considered healthy.

We have a new health endpoint which you can test with the following command:

etcdctl cluster-health -h
NAME:
   cluster-health - check the health of the etcd cluster

USAGE:
   command cluster-health [arguments...]

We have not documented the raw api call for this, but it might provided a better approach for monitoring the health of the entire cluster.

Also, how should this work? Would you like a single endpoint that you can hit with a nagios check?

Telmo · 2015-02-27T19:53:45Z

a single endpoint would be definitely a plus, however I am aware that if a node drops from the cluster the cluster may not be aware that the node is not there any longer.

What's the http endpoint for cluster-heatlh ? I'd like to query it directly from something like a go program.

Basically, we are a BIG company, and we need to make sure our etcd cluster can be monitored so we can be proactive/reactive with regards of nodes being down or cluster issues.

kelseyhightower · 2015-02-27T23:40:02Z

@Telmo I'll produce a monitoring guide early next week for your review.

kelseyhightower · 2015-03-03T14:11:56Z

@Telmo I did not forget about you. I'm setting up our etcd integration environment now and I'm using Datadog for metrics and monitoring. Once I've got my datadog integration done I'll share it with you.

Telmo · 2015-03-03T19:59:02Z

Perfect, thank you very much for this.

Telmo · 2015-03-12T13:57:32Z

@kelseyhightower any updates? Looking forward to the guide.

damm · 2015-05-08T02:03:16Z

+1

ajardan · 2015-05-22T18:40:54Z

+1

damm · 2015-05-22T19:00:20Z

@Telmo there is a plugin (or two) for sensu that help with this.

sometimes /health goes false and then returns true 2s later; so i'm nervous of the cluster health monitoring although I want it. (Just don't want the false positives alerting)

https://github.com/sensu/sensu-community-plugins/tree/master/plugins/etcd

Being able to monitor if etcd is up; is fairly critical. Having metrics is useful; a check for /health would be good but it does return false occasionally so it'll have to have a lot of retries.

xiang90 · 2015-08-04T03:50:07Z

@Telmo

Now etcd supports metrics reporting: https://github.com/coreos/etcd/blob/master/Documentation/metrics.md

You can curl the /health endpoint of each member to get the health information.
Or you can use etcdctl -cluster-health command with forever option to monitor the cluster's health.

Thanks.

hmatinho · 2015-08-04T23:36:22Z

cluster-health only shows health of members not proxies it would be nice to check if proxies were also up

xiang90 · 2015-08-05T00:06:46Z

@hmatinho Proxy is not part of the actual etcd cluster. And you might have a large number of proxies. I do not think it is the responsibility of the etcd cluster to report the status of proxies.

stvnwrgs · 2015-10-15T10:07:41Z

I am having a similar issue. Using etcdctl cluster-health works great, if your health check runs on the machines or the health check has access to the machines. If you use something external, that needs an url it will only work if you use etcd without tls.

F.e: I want to use the gce network loadbalancer with url based health checks. But it is not possible to provide a ca, cert or key.

I have only one idea at the moment. Setup a nginx as proxy for a special route on a different port that just gives you the health back without tls.

Does somebody has a better idea?

It would be cool if you could enable a second http that has restricted access and serves without tls.

damm · 2015-10-15T17:53:42Z

Be aware that ipaddress:2379/health should work fine @stvnwrgs

xiang90 added the etcdctl label Jul 25, 2015

xiang90 added this to the v2.2.0 milestone Jul 25, 2015

xiang90 self-assigned this Jul 25, 2015

xiang90 mentioned this issue Jul 30, 2015

etcdctl: cluster-health supports forever flag #3197

Merged

xiang90 closed this as completed in #3197 Aug 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated monitoring of an etcd cluster? #2383

Automated monitoring of an etcd cluster? #2383

Telmo commented Feb 26, 2015

kelseyhightower commented Feb 27, 2015

Telmo commented Feb 27, 2015

kelseyhightower commented Feb 27, 2015

kelseyhightower commented Mar 3, 2015

Telmo commented Mar 3, 2015

Telmo commented Mar 12, 2015

damm commented May 8, 2015

ajardan commented May 22, 2015

damm commented May 22, 2015

xiang90 commented Aug 4, 2015

hmatinho commented Aug 4, 2015

xiang90 commented Aug 5, 2015

stvnwrgs commented Oct 15, 2015

damm commented Oct 15, 2015

Automated monitoring of an etcd cluster? #2383

Automated monitoring of an etcd cluster? #2383

Comments

Telmo commented Feb 26, 2015

kelseyhightower commented Feb 27, 2015

Telmo commented Feb 27, 2015

kelseyhightower commented Feb 27, 2015

kelseyhightower commented Mar 3, 2015

Telmo commented Mar 3, 2015

Telmo commented Mar 12, 2015

damm commented May 8, 2015

ajardan commented May 22, 2015

damm commented May 22, 2015

xiang90 commented Aug 4, 2015

hmatinho commented Aug 4, 2015

xiang90 commented Aug 5, 2015

stvnwrgs commented Oct 15, 2015

damm commented Oct 15, 2015