metrics: emit `nomad.nomad.autopilot.healthy` on followers #13219

ionhashicorp · 2022-06-03T14:00:54Z

Proposal

Align Nomad to Consul behaviour, specifically make Nomad behave as Consul does in regards to autopilot.healthy metric

Further explanation:

both Consul and Nomad expose an endpoint to be able to query metrics
nomad metrics endpoint: <NOMAD_ADDR>:4646/v1/metrics
Consul metrics endpoint: <CONSUL_ADDR>:8500/v1/agent/metrics
Consul LEADER returns metric consul.autopilot.healthy
Consul FOLLOWERS returns metric consul.autopilot.healthy
Nomad LEADER returns metric nomad.nomad.autopilot.healthy
Nomad FOLLOWER does NOT return metric nomad.nomad.autopilot.healthy

Documentation:

Nomad endpoint:

endpoint: <NOMAD_ADDR>:4646/v1/metrics
NOMAD LEADER will return metric nomad.nomad.autopilot.healthy
NOMAD FOLLOWERS will NOT return metric nomad.nomad.autopilot.healthy

Consul endpoint:

endpoint: <CONSUL_ADDR>:8500/v1/agent/metrics
CONSUL LEADER will return metric consul.autopilot.healthy
CONSUL FOLLOWER will return metric consul.autopilot.healthy

Expected behaviour if this feature will be implemented

NOMAD FOLLOWERS to return metric nomad.nomad.autopilot.healthy

The code in question that has us not setting this value at all if not the leader:

nomad/nomad/autopilot.go

Line 76 in 0af4762

func (d *AutopilotDelegate) NotifyHealth(health autopilot.OperatorHealthReply) {

How can this behaviour be reproduced

Build a nomad & consul cluster:

nomad cluster: 3 servers (1 region)
consul cluster: 3 servers (1 datacenter, no region concept in consul)

Example Consul sever configuration

# consul server config
datacenter = "dc1"
data_dir   = "/opt/consul"

bind_addr   = "{{ GetInterfaceIP \"ens5\" }}"
client_addr = "0.0.0.0"

server           = true
raft_protocol    = 3
bootstrap_expect = 3

retry_join     = ["tag_key=Project tag_value=consul"]
retry_max      = 5
retry_interval = "15s"

# Consul UI
ui_config {
  enabled = true
}

# service mesh
connect {
  enabled = true
}

addresses {
  grpc = "127.0.0.1"
}

ports {
  grpc = 8502
}

telemetry {
  prometheus_retention_time = "72h"
  disable_hostname = true
}

Example Nomad server configuration

# nomad server config
region = "global"
datacenter = "dc1"
data_dir = "/opt/nomad"

bind_addr = "{{ GetInterfaceIP \"ens5\" }}"

server {
  enabled = true
  raft_protocol = 3
  bootstrap_expect = 3
  
  server_join {
    retry_join = ["tag_key=Project tag_value=nomad"]
    retry_max = 5
    retry_interval = "15s"
  }
}

consul {
  address = "127.0.0.1:8500"
}


telemetry {
  collection_interval = "1s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

Tests:

Nomad cluster:

endpoint: <NOMAD_ADDR>:4646/v1/metrics
results:
- NOMAD LEADER will return metric nomad.nomad.autopilot.healthy
- NOMAD FOLLOWERS will NOT return metric nomad.nomad.autopilot.healthy
example queries

# AWS
# Query NOMAD LEADER
$ curl --silent http://169.254.169.254/latest/meta-data/local-ipv4
$ curl -sS $PRIVATE_IP:4646/v1/metrics | jq | grep healthy -A2
      "Name": "nomad.nomad.autopilot.healthy",
      "Value": 1
    },
$

# AWS
# Query NOMAD FOLLOWER (no result)
$ curl --silent http://169.254.169.254/latest/meta-data/local-ipv4
$ curl -sS $PRIVATE_IP:4646/v1/metrics | jq | grep healthy -A2
$

Consul Cluster:

endpoint: <CONSUL_ADDR>:8500/v1/agent/metrics
results:
- CONSUL LEADER will return metric consul.autopilot.healthy
- CONSUL FOLLOWER will return metric consul.autopilot.healthy
example queries:

# AWS
# Query CONSUL LEADER
$ curl --silent http://169.254.169.254/latest/meta-data/local-ipv4
$ curl -sS $PRIVATE_IP:8500/v1/agent/metrics | jq | grep healthy -A1
      "Name": "consul.autopilot.healthy",
      "Value": 1,

AWS
# Query CONSUL FOLLOWER
$ curl --silent http://169.254.169.254/latest/meta-data/local-ipv4
$ curl -sS $PRIVATE_IP:8500/v1/agent/metrics | jq | grep healthy -A1
      "Name": "consul.autopilot.healthy",
      "Value": 1,
$

Workaround for Nomad:

query only the NOMAD LEADER for metric nomad.nomad.autopilot.healthy

The text was updated successfully, but these errors were encountered:

tgross · 2022-06-06T14:50:39Z

Seems like a reasonable thing to do. This code was mostly lifted directly out of Consul long ago and hasn't been touched much since. We've got an open issue #9570 for updating our autopilot implementation tentatively scheduled for Nomad 1.4.0. I'll make a note of this issue over there.

mikenomitch · 2022-06-17T18:09:19Z

@tgross added to the 1.4 milestone after reading your comment.

Is the engineering effort on this relatively small? Seems nice, but not a major improvement. Just want to make sure we dont commit to something big. Can remove the milestone if it is big.

tgross · 2022-09-13T16:20:58Z

Just a heads up that the updated raft-autopilot is shipping in Nomad 1.4.0 beta (and backports) but we've run out of time to make additional feature updates and it's not clear to me which of this is going to be automatically covered by the library update vs not. So what I'm going to do is to self-assign this issue and verify which of the reported behaviors we're still short on. If everything "just works" now and we can close the issue, great. Otherwise I'll report back here on what specific things need to be done and come up with a scope of work for how big of a lift that is. (Or if it's trivial, just knock it out real quick 😁)

tgross · 2022-09-27T19:55:54Z

Ok, I had a chance to review this issue in a detail following the switch to the raft-autopilot library. The ask here is "just" to emit the autopilot state metrics from the followers. But the reason we don't emit the metric from the followers is because that in Nomad we only run autopilot on the leader. The followers don't know the autopilot state, so they can't emit correct metrics for it!

Whereas in Consul they run autopilot from all nodes as of Consul 1.12.0 (ref changelog and hashicorp/consul#12617), because:

For some upcoming features we need autopilot on all servers to continually track the state of all servers. This PR pulls in a raft-autopilot update and enables that functionality.

It's not obvious to me what those features actually were and I don't see any newer autopilot-related changelog entries in Consul either, so that might be for upcoming stuff? (I'll follow up with the Matt who wrote #12617 but I'm not going to ping him here.) So unless we intend to add whatever those features are, I don't see any solid reason to do the work of running autopilot on the followers too. (Note that if we did, we would need to disable reconciliation on non-leaders hashicorp/raft-autopilot#16)

I'll keep this open till I get a chance to follow-up with Matt (he's OOO), but otherwise I think we can close this out.

tgross · 2022-09-27T20:39:35Z

Ok, I had a chance to chat with Matt sooner than expected, and those autopilot features were intended to support the new Consul dataplane work (see CSL-166 for internal folks). So nothing that helps us on Nomad directly.

However, Matt pointed out that not having autopilot running on the followers and getting autopilot state updates from the leader means that the metrics even on the leader are often wrong, because a previous leader will still have the metrics in memory and will flush them when asked, and the new leader will have stale data. So it probably makes sense for us to implement running autopilot on the followers in the non-reconciling mode, which would give us the metrics on all the servers. hashicorp/consul@a553982 has the work that Consul did to implement this and it feels fairly small. I'll try to pick it up in the 1.4.x period.

ionhashicorp added the type/enhancement label Jun 3, 2022

tgross added the theme/autopilot label Jun 6, 2022

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jun 6, 2022

tgross mentioned this issue Jun 6, 2022

Adopt raft-autopilot library #9570

Closed

mikenomitch added this to the 1.4.0 milestone Jun 17, 2022

tgross self-assigned this Sep 13, 2022

tgross modified the milestones: 1.4.0, 1.4.x Sep 22, 2022

tgross changed the title ~~Nomad FOLLOWERS return nomad.nomad.autopilot.healthy~~ emit nomad.nomad.autopilot.healthy on followers Sep 27, 2022

tgross changed the title ~~emit nomad.nomad.autopilot.healthy on followers~~ metrics: emit nomad.nomad.autopilot.healthy on followers Sep 27, 2022

tgross modified the milestones: 1.4.x, 1.5.0 Nov 30, 2022

This was referenced Dec 16, 2022

disable scheduling until initial snapshot is restored #15560

Open

expand Nomad's own Consul health checks and/or tags to include autopilot health #15561

Open

tgross modified the milestones: 1.5.0, 1.5.x Jan 20, 2023

tgross removed this from the 1.5.x milestone May 17, 2023

mikenomitch added the hcc/cst Admin - internal label May 22, 2023

tgross removed their assignment Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: emit `nomad.nomad.autopilot.healthy` on followers #13219

metrics: emit `nomad.nomad.autopilot.healthy` on followers #13219

ionhashicorp commented Jun 3, 2022 •

edited

Loading

tgross commented Jun 6, 2022

mikenomitch commented Jun 17, 2022 •

edited

Loading

tgross commented Sep 13, 2022 •

edited

Loading

tgross commented Sep 27, 2022

tgross commented Sep 27, 2022

metrics: emit nomad.nomad.autopilot.healthy on followers #13219

metrics: emit nomad.nomad.autopilot.healthy on followers #13219

Comments

ionhashicorp commented Jun 3, 2022 • edited Loading

Proposal

How can this behaviour be reproduced

Tests:

Workaround for Nomad:

tgross commented Jun 6, 2022

mikenomitch commented Jun 17, 2022 • edited Loading

tgross commented Sep 13, 2022 • edited Loading

tgross commented Sep 27, 2022

tgross commented Sep 27, 2022

metrics: emit `nomad.nomad.autopilot.healthy` on followers #13219

metrics: emit `nomad.nomad.autopilot.healthy` on followers #13219

ionhashicorp commented Jun 3, 2022 •

edited

Loading

mikenomitch commented Jun 17, 2022 •

edited

Loading

tgross commented Sep 13, 2022 •

edited

Loading