Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated monitoring of an etcd cluster? #2383

Closed
Telmo opened this issue Feb 26, 2015 · 14 comments
Closed

Automated monitoring of an etcd cluster? #2383

Telmo opened this issue Feb 26, 2015 · 14 comments
Assignees
Milestone

Comments

@Telmo
Copy link

Telmo commented Feb 26, 2015

I've read all the tools using etcd, I've google different permutation of "etcd monitoring" and I am still unable to find a way to monitor an etcd cluster. With monitoring I mean a programatic way of checking the global health of a cluster, to be alerted if "nodes fall off the map" or etcd stops working all together.

Is there such a tool? If not what are people using to be alerted of possible issues with etcd?

@kelseyhightower
Copy link
Contributor

@Telmo Thanks for reaching out. I'm going to propose a simple option, please let me know if this is what you are looking for.

  • if you care about the health of a specific node, then one option would be to attempt to write to a "dummy" key. For example /monitoring/healthcheck. If that works then the node can be considered healthy.

We have a new health endpoint which you can test with the following command:

etcdctl cluster-health -h
NAME:
   cluster-health - check the health of the etcd cluster

USAGE:
   command cluster-health [arguments...]

We have not documented the raw api call for this, but it might provided a better approach for monitoring the health of the entire cluster.

Also, how should this work? Would you like a single endpoint that you can hit with a nagios check?

@Telmo
Copy link
Author

Telmo commented Feb 27, 2015

a single endpoint would be definitely a plus, however I am aware that if a node drops from the cluster the cluster may not be aware that the node is not there any longer.

What's the http endpoint for cluster-heatlh ? I'd like to query it directly from something like a go program.

Basically, we are a BIG company, and we need to make sure our etcd cluster can be monitored so we can be proactive/reactive with regards of nodes being down or cluster issues.

@kelseyhightower
Copy link
Contributor

@Telmo I'll produce a monitoring guide early next week for your review.

@kelseyhightower
Copy link
Contributor

@Telmo I did not forget about you. I'm setting up our etcd integration environment now and I'm using Datadog for metrics and monitoring. Once I've got my datadog integration done I'll share it with you.

@Telmo
Copy link
Author

Telmo commented Mar 3, 2015

Perfect, thank you very  much for this.

@Telmo
Copy link
Author

Telmo commented Mar 12, 2015

@kelseyhightower any updates? Looking forward to the guide.

@damm
Copy link

damm commented May 8, 2015

+1

1 similar comment
@ajardan
Copy link

ajardan commented May 22, 2015

+1

@damm
Copy link

damm commented May 22, 2015

@Telmo there is a plugin (or two) for sensu that help with this.

sometimes /health goes false and then returns true 2s later; so i'm nervous of the cluster health monitoring although I want it. (Just don't want the false positives alerting)

https://github.com/sensu/sensu-community-plugins/tree/master/plugins/etcd

Being able to monitor if etcd is up; is fairly critical. Having metrics is useful; a check for /health would be good but it does return false occasionally so it'll have to have a lot of retries.

@xiang90
Copy link
Contributor

xiang90 commented Aug 4, 2015

@Telmo

Now etcd supports metrics reporting: https://github.com/coreos/etcd/blob/master/Documentation/metrics.md

You can curl the /health endpoint of each member to get the health information.
Or you can use etcdctl -cluster-health command with forever option to monitor the cluster's health.

Thanks.

@hmatinho
Copy link

hmatinho commented Aug 4, 2015

cluster-health only shows health of members not proxies it would be nice to check if proxies were also up

@xiang90
Copy link
Contributor

xiang90 commented Aug 5, 2015

@hmatinho Proxy is not part of the actual etcd cluster. And you might have a large number of proxies. I do not think it is the responsibility of the etcd cluster to report the status of proxies.

@stvnwrgs
Copy link

I am having a similar issue. Using etcdctl cluster-health works great, if your health check runs on the machines or the health check has access to the machines. If you use something external, that needs an url it will only work if you use etcd without tls.

F.e: I want to use the gce network loadbalancer with url based health checks. But it is not possible to provide a ca, cert or key.

I have only one idea at the moment. Setup a nginx as proxy for a special route on a different port that just gives you the health back without tls.

Does somebody has a better idea?

It would be cool if you could enable a second http that has restricted access and serves without tls.

@damm
Copy link

damm commented Oct 15, 2015

Be aware that ipaddress:2379/health should work fine @stvnwrgs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants