-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: emit nomad.nomad.autopilot.healthy
on followers
#13219
Comments
Seems like a reasonable thing to do. This code was mostly lifted directly out of Consul long ago and hasn't been touched much since. We've got an open issue #9570 for updating our autopilot implementation tentatively scheduled for Nomad 1.4.0. I'll make a note of this issue over there. |
@tgross added to the 1.4 milestone after reading your comment. Is the engineering effort on this relatively small? Seems nice, but not a major improvement. Just want to make sure we dont commit to something big. Can remove the milestone if it is big. |
Just a heads up that the updated raft-autopilot is shipping in Nomad 1.4.0 beta (and backports) but we've run out of time to make additional feature updates and it's not clear to me which of this is going to be automatically covered by the library update vs not. So what I'm going to do is to self-assign this issue and verify which of the reported behaviors we're still short on. If everything "just works" now and we can close the issue, great. Otherwise I'll report back here on what specific things need to be done and come up with a scope of work for how big of a lift that is. (Or if it's trivial, just knock it out real quick 😁) |
Ok, I had a chance to review this issue in a detail following the switch to the raft-autopilot library. The ask here is "just" to emit the autopilot state metrics from the followers. But the reason we don't emit the metric from the followers is because that in Nomad we only run autopilot on the leader. The followers don't know the autopilot state, so they can't emit correct metrics for it! Whereas in Consul they run autopilot from all nodes as of Consul 1.12.0 (ref changelog and hashicorp/consul#12617), because:
It's not obvious to me what those features actually were and I don't see any newer autopilot-related changelog entries in Consul either, so that might be for upcoming stuff? (I'll follow up with the Matt who wrote #12617 but I'm not going to ping him here.) So unless we intend to add whatever those features are, I don't see any solid reason to do the work of running autopilot on the followers too. (Note that if we did, we would need to disable reconciliation on non-leaders hashicorp/raft-autopilot#16) I'll keep this open till I get a chance to follow-up with Matt (he's OOO), but otherwise I think we can close this out. |
nomad.nomad.autopilot.healthy
on followers
nomad.nomad.autopilot.healthy
on followersnomad.nomad.autopilot.healthy
on followers
Ok, I had a chance to chat with Matt sooner than expected, and those autopilot features were intended to support the new Consul dataplane work (see CSL-166 for internal folks). So nothing that helps us on Nomad directly. However, Matt pointed out that not having autopilot running on the followers and getting autopilot state updates from the leader means that the metrics even on the leader are often wrong, because a previous leader will still have the metrics in memory and will flush them when asked, and the new leader will have stale data. So it probably makes sense for us to implement running autopilot on the followers in the non-reconciling mode, which would give us the metrics on all the servers. hashicorp/consul@a553982 has the work that Consul did to implement this and it feels fairly small. I'll try to pick it up in the 1.4.x period. |
Proposal
Align Nomad to Consul behaviour, specifically make Nomad behave as Consul does in regards to
autopilot.healthy
metricFurther explanation:
<NOMAD_ADDR>:4646/v1/metrics
<CONSUL_ADDR>:8500/v1/agent/metrics
consul.autopilot.healthy
consul.autopilot.healthy
nomad.nomad.autopilot.healthy
nomad.nomad.autopilot.healthy
Documentation:
Nomad endpoint:
nomad.nomad.autopilot.healthy
nomad.nomad.autopilot.healthy
Consul endpoint:
consul.autopilot.healthy
consul.autopilot.healthy
Expected behaviour if this feature will be implemented
The code in question that has us not setting this value at all if not the leader:
nomad/nomad/autopilot.go
Line 76 in 0af4762
How can this behaviour be reproduced
Build a nomad & consul cluster:
Example Consul sever configuration
Example Nomad server configuration
Tests:
Nomad cluster:
<NOMAD_ADDR>:4646/v1/metrics
nomad.nomad.autopilot.healthy
nomad.nomad.autopilot.healthy
Consul Cluster:
<CONSUL_ADDR>:8500/v1/agent/metrics
consul.autopilot.healthy
consul.autopilot.healthy
Workaround for Nomad:
nomad.nomad.autopilot.healthy
The text was updated successfully, but these errors were encountered: