Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reporting node health #27

Closed
flavianmissi opened this issue Dec 2, 2018 · 9 comments
Closed

reporting node health #27

flavianmissi opened this issue Dec 2, 2018 · 9 comments

Comments

@flavianmissi
Copy link

How is node health reported? Looked around in the code and couldn't see anything obvious.

centrifugo's api has no healthcheck endpoint, while investigating how to create one I noticed that health would need to be reported by each node. Is that correct?

To start simple we could just ping the configured engine, then report "ok" or "not ok".

I'm willing to send a PR, but would probably need a little guidance to get me started.

Great project btw.
Thanks!

@FZambia
Copy link
Member

FZambia commented Dec 2, 2018

Hello, could you provide a bit more background on why do you need this? Is this for Kubernetes or smth similar?

Also do you need this per node or per full setup (i.e. switch to another Centrifugo setup on balancer for example)?

Not sure at moment which information I should provide to you as I don't understand what do you mean under node health in this moment.

@flavianmissi
Copy link
Author

flavianmissi commented Dec 3, 2018

Hey, yes of course.

Is this for Kubernetes or smth similar?

Exactly, I deployed centrifugo into multiple kubernetes clusters and I'm currently using the admin / endpoint as healthcheck, but it's not suitable for all my envs and it isn't a very trust worthy one either.

Also do you need this per node or per full setup (i.e. switch to another Centrifugo setup on balancer for example)?

I was envisioning something more like per container (so per node I suppose?), so if a container would for example lose access to redis for any reason, the healthcheck would start failling and kubernetes (or any other similar tool) could find this out via healthcheck then re-kick the problematic container.

Hope this is more clear now, thanks for the quick answer.

@FZambia
Copy link
Member

FZambia commented Dec 5, 2018

First idea is adding health status change here - this is a place where node tries to send information about itself to other nodes. This is done periodically and if this operation fails - engine is not working.

But this won't properly work in case when Redis sharding used at moment as we try to deliver Control message using each Redis shard in order, so actually some of them can return error.

Looks like if we want to check all shards this feature requires new Health method in engine which will return health status taking into account all shards and any engine specific things.

@flavianmissi
Copy link
Author

But this won't properly work in case when Redis sharding used at moment as we try to deliver Control message using each Redis shard in order, so actually some of them can return error.

Is this really something this healthcheck should be concerned about thought? On a cluster installation (a redis cluster in this case), node failures are acceptable as long as the whole cluster is still operational.

Looks like if we want to check all shards this feature requires new Health method in engine which will return health status taking into account all shards and any engine specific things.

Checking the health of each and every shard sounds like something redis should be doing, not centrifuge. There's even a command for that on redis-cli: redis-cli --cluster check 127.0.0.1:7000. I think an exposed ping (or health) method in the engine should be enough. Then the Node would expose a status check method.

Thoughts?

@FZambia
Copy link
Member

FZambia commented Dec 18, 2018

Is this really something this healthcheck should be concerned about thought? On a cluster installation (a redis cluster in this case), node failures are acceptable as long as the whole cluster is still operational.

Centrifuge uses application-level sharding (not Redis cluster). Because PUB/SUB does not scale well in Redis cluster - redis/redis#2672 and Redis mailing list. So we have to manually check each shard. Does it make sense?

Yep, the Node must have Health method too of course. Btw - look at related pr I opened recently - it waits for this method to be implemented in this lib.

@flavianmissi
Copy link
Author

Centrifuge uses application-level sharding (not Redis cluster). Because PUB/SUB does not scale well in Redis cluster - antirez/redis#2672 and Redis mailing list. So we have to manually check each shard. Does it make sense?

Oh that explains it, it does make sense, thank you.

Yep, the Node must have Health method too of course. Btw - look at related pr I opened recently - it waits for this method to be implemented in this lib.

Great news! I see the PR is merged now, do you have any kind of time line for when this should be generally available?

Thanks for making it happen btw 🎉

@FZambia
Copy link
Member

FZambia commented Jan 7, 2019

It's already available in Centrifugo v2.1.0 - but it just provides health endpoint without doing extra checks inside engine - i.e. here is a healthcheck handler. If you need more than that we still have to do this inside this lib first.

@flavianmissi
Copy link
Author

This satisfies my current needs, thank you!

@FZambia
Copy link
Member

FZambia commented Jan 8, 2019

OK, great! Will close this issue then because my current needs also satisfied with that simple health checker, if you will need more from it feel free to reopen!

@FZambia FZambia closed this as completed Jan 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants