[RFC] Election is never based on service health #3249

jamessewell · 2017-09-21T01:58:25Z

This is a placeholder issue based on discussions with @adamhjk and @reset

At the moment as far as I can tell elections will (at least once 3246 is fixed) happen under the following circumstances:

leader is departed
leader is stopped (gracefully / ungracefully)
leader is on the wrong side of a netsplit

All these amount to the sup health becoming poor on the leader (I think alive=false gets set?).

Given that Habitat is aiming to provide services, and given that it's being used for high availability I think that it needs to be able to trigger an election based on leader service health as well (if configured).

At the moment there is health_check hook, which seems to be intended for use by monitoring and alerting solutions. This can return several values, which are mapped as so:

0 => HealthCheck::Ok
1 => HealthCheck::Warning
2 => HealthCheck::Critical
3 => HealthCheck::Unknown
_ => HealthCheck::Unknown

It would be great if one of the following could happen:

a new hook was added, which allowed a leader (or any node?) to be forcefully departed on failure
the health_check hook was changed to allow a return value to forcefully depart a node (danger: what if the hook inadvertently passes through this retval from another binary)

As @reset pointed out when reasoning about this it needs to be remembered that health is transient and electing a new leader is a major event:

what happens when a leader returns to health before promotion?
what happens when a leader returns to health during promotion?

I think to manage this there needs to be at least some sort of holdoff period (or number of failed checks?) before the force depart to allow the leader time to return to the cluster.

It would be ideal (but would add complexity I suppose) if this could be configured. This would allow users to align the holdoff time with their desired mean time to recovery / recovery point objective.

The text was updated successfully, but these errors were encountered:

eeyun · 2018-01-30T18:17:59Z

Ok I'm pinging our primary stakeholders on this. Ping! @adamhjk @reset @cm @baumanj .

I think we all are sort of thinking having this feature makes sense but as this is a significant amount of work to implement we'd prefer to get some more comments and thoughts shared here in a durable medium. As you get the time, please leave some comments with your thoughts!

stale · 2020-04-02T23:10:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

christophermaier · 2020-05-05T14:33:40Z

Still needed.

stale · 2022-10-01T03:27:02Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

stale · 2023-05-21T20:14:52Z

This issue has been automatically closed after being stale for 400 days. We still value your input and contribution. Please re-open the issue if desired and leave a comment with details.

christophermaier added A-supervisor Type:Additional Discussion labels Sep 21, 2017

eeyun changed the title ~~Election is never based on service health~~ [RFC] Election is never based on service health Nov 7, 2017

raskchanky added the V-sup label Apr 17, 2018

christophermaier self-assigned this Jun 5, 2018

christophermaier added this to the 1.0 Supervisor Stability milestone Jun 8, 2018

christophermaier removed their assignment Jun 8, 2018

jamessewell mentioned this issue Jul 16, 2018

Incorporate health into gossip #5326

Open

christophermaier added Focus:Gossip Protocol Tasks related to fundamental gossip algorithm behavior Focus:Supervisor ProcessManagement Related to how the Supervisor manages service processes E-less-easy labels Nov 30, 2018

dmccown modified the milestones: 1.0 Supervisor (Planning), 1.0 Supervisor Dec 11, 2018

stale bot added the Stale label Apr 2, 2020

stale bot removed the Stale label May 5, 2020

davidMcneil mentioned this issue May 8, 2020

Improve service rolling update #7576

Closed

christophermaier added Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component and removed A-supervisor labels Jul 24, 2020

rahulgoel1 removed V-sup labels Jul 23, 2021

stale bot added the Stale label Oct 1, 2022

stale bot closed this as completed May 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Election is never based on service health #3249

[RFC] Election is never based on service health #3249

jamessewell commented Sep 21, 2017 •

edited by stidhamlisa

Loading

eeyun commented Jan 30, 2018

stale bot commented Apr 2, 2020

christophermaier commented May 5, 2020

stale bot commented Oct 1, 2022

stale bot commented May 21, 2023

[RFC] Election is never based on service health #3249

[RFC] Election is never based on service health #3249

Comments

jamessewell commented Sep 21, 2017 • edited by stidhamlisa Loading

eeyun commented Jan 30, 2018

stale bot commented Apr 2, 2020

christophermaier commented May 5, 2020

stale bot commented Oct 1, 2022

stale bot commented May 21, 2023

jamessewell commented Sep 21, 2017 •

edited by stidhamlisa

Loading