Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Election is never based on service health #3249

Closed
jamessewell opened this issue Sep 21, 2017 · 5 comments
Closed

[RFC] Election is never based on service health #3249

jamessewell opened this issue Sep 21, 2017 · 5 comments
Labels
Focus:Gossip Protocol Tasks related to fundamental gossip algorithm behavior Focus:Supervisor ProcessManagement Related to how the Supervisor manages service processes Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Stale Type:Additional Discussion

Comments

@jamessewell
Copy link
Contributor

jamessewell commented Sep 21, 2017

This is a placeholder issue based on discussions with @adamhjk and @reset

At the moment as far as I can tell elections will (at least once 3246 is fixed) happen under the following circumstances:

  • leader is departed
  • leader is stopped (gracefully / ungracefully)
  • leader is on the wrong side of a netsplit

All these amount to the sup health becoming poor on the leader (I think alive=false gets set?).

Given that Habitat is aiming to provide services, and given that it's being used for high availability I think that it needs to be able to trigger an election based on leader service health as well (if configured).

At the moment there is health_check hook, which seems to be intended for use by monitoring and alerting solutions. This can return several values, which are mapped as so:

  • 0 => HealthCheck::Ok
  • 1 => HealthCheck::Warning
  • 2 => HealthCheck::Critical
  • 3 => HealthCheck::Unknown
  • _ => HealthCheck::Unknown

It would be great if one of the following could happen:

  • a new hook was added, which allowed a leader (or any node?) to be forcefully departed on failure
  • the health_check hook was changed to allow a return value to forcefully depart a node (danger: what if the hook inadvertently passes through this retval from another binary)

As @reset pointed out when reasoning about this it needs to be remembered that health is transient and electing a new leader is a major event:

  • what happens when a leader returns to health before promotion?
  • what happens when a leader returns to health during promotion?

I think to manage this there needs to be at least some sort of holdoff period (or number of failed checks?) before the force depart to allow the leader time to return to the cluster.

It would be ideal (but would add complexity I suppose) if this could be configured. This would allow users to align the holdoff time with their desired mean time to recovery / recovery point objective.

@eeyun eeyun changed the title Election is never based on service health [RFC] Election is never based on service health Nov 7, 2017
@eeyun
Copy link
Contributor

eeyun commented Jan 30, 2018

Ok I'm pinging our primary stakeholders on this. Ping! @adamhjk @reset @cm @baumanj .

I think we all are sort of thinking having this feature makes sense but as this is a significant amount of work to implement we'd prefer to get some more comments and thoughts shared here in a durable medium. As you get the time, please leave some comments with your thoughts!

@christophermaier christophermaier self-assigned this Jun 5, 2018
@christophermaier christophermaier removed their assignment Jun 8, 2018
@christophermaier christophermaier added Focus:Gossip Protocol Tasks related to fundamental gossip algorithm behavior Focus:Supervisor ProcessManagement Related to how the Supervisor manages service processes E-less-easy labels Nov 30, 2018
@stale
Copy link

stale bot commented Apr 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

@stale stale bot added the Stale label Apr 2, 2020
@christophermaier
Copy link
Contributor

Still needed.

@stale stale bot removed the Stale label May 5, 2020
@christophermaier christophermaier added Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component and removed A-supervisor labels Jul 24, 2020
@stale
Copy link

stale bot commented Oct 1, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

@stale stale bot added the Stale label Oct 1, 2022
@stale
Copy link

stale bot commented May 21, 2023

This issue has been automatically closed after being stale for 400 days. We still value your input and contribution. Please re-open the issue if desired and leave a comment with details.

@stale stale bot closed this as completed May 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Focus:Gossip Protocol Tasks related to fundamental gossip algorithm behavior Focus:Supervisor ProcessManagement Related to how the Supervisor manages service processes Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Stale Type:Additional Discussion
Projects
None yet
Development

No branches or pull requests

6 participants