Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: emit nomad.nomad.autopilot.healthy on followers #13219

Open
ionhashicorp opened this issue Jun 3, 2022 · 5 comments
Open

metrics: emit nomad.nomad.autopilot.healthy on followers #13219

ionhashicorp opened this issue Jun 3, 2022 · 5 comments
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/autopilot type/enhancement

Comments

@ionhashicorp
Copy link

ionhashicorp commented Jun 3, 2022

Proposal

Align Nomad to Consul behaviour, specifically make Nomad behave as Consul does in regards to autopilot.healthy metric

Further explanation:

  • both Consul and Nomad expose an endpoint to be able to query metrics
  • nomad metrics endpoint: <NOMAD_ADDR>:4646/v1/metrics
  • Consul metrics endpoint: <CONSUL_ADDR>:8500/v1/agent/metrics
  • Consul LEADER returns metric consul.autopilot.healthy
  • Consul FOLLOWERS returns metric consul.autopilot.healthy
  • Nomad LEADER returns metric nomad.nomad.autopilot.healthy
  • Nomad FOLLOWER does NOT return metric nomad.nomad.autopilot.healthy

Documentation:

Nomad endpoint:

  • endpoint: <NOMAD_ADDR>:4646/v1/metrics
  • NOMAD LEADER will return metric nomad.nomad.autopilot.healthy
  • NOMAD FOLLOWERS will NOT return metric nomad.nomad.autopilot.healthy

Consul endpoint:

  • endpoint: <CONSUL_ADDR>:8500/v1/agent/metrics
  • CONSUL LEADER will return metric consul.autopilot.healthy
  • CONSUL FOLLOWER will return metric consul.autopilot.healthy

Expected behaviour if this feature will be implemented

  • NOMAD FOLLOWERS to return metric nomad.nomad.autopilot.healthy

The code in question that has us not setting this value at all if not the leader:

How can this behaviour be reproduced

Build a nomad & consul cluster:

  • nomad cluster: 3 servers (1 region)
  • consul cluster: 3 servers (1 datacenter, no region concept in consul)

Example Consul sever configuration

# consul server config
datacenter = "dc1"
data_dir   = "/opt/consul"

bind_addr   = "{{ GetInterfaceIP \"ens5\" }}"
client_addr = "0.0.0.0"

server           = true
raft_protocol    = 3
bootstrap_expect = 3

retry_join     = ["tag_key=Project tag_value=consul"]
retry_max      = 5
retry_interval = "15s"

# Consul UI
ui_config {
  enabled = true
}

# service mesh
connect {
  enabled = true
}

addresses {
  grpc = "127.0.0.1"
}

ports {
  grpc = 8502
}

telemetry {
  prometheus_retention_time = "72h"
  disable_hostname = true
}

Example Nomad server configuration

# nomad server config
region = "global"
datacenter = "dc1"
data_dir = "/opt/nomad"

bind_addr = "{{ GetInterfaceIP \"ens5\" }}"

server {
  enabled = true
  raft_protocol = 3
  bootstrap_expect = 3
  
  server_join {
    retry_join = ["tag_key=Project tag_value=nomad"]
    retry_max = 5
    retry_interval = "15s"
  }
}

consul {
  address = "127.0.0.1:8500"
}


telemetry {
  collection_interval = "1s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

Tests:

Nomad cluster:

  • endpoint: <NOMAD_ADDR>:4646/v1/metrics
  • results:
    • NOMAD LEADER will return metric nomad.nomad.autopilot.healthy
    • NOMAD FOLLOWERS will NOT return metric nomad.nomad.autopilot.healthy
  • example queries
# AWS
# Query NOMAD LEADER
$ curl --silent http://169.254.169.254/latest/meta-data/local-ipv4
$ curl -sS $PRIVATE_IP:4646/v1/metrics | jq | grep healthy -A2
      "Name": "nomad.nomad.autopilot.healthy",
      "Value": 1
    },
$
# AWS
# Query NOMAD FOLLOWER (no result)
$ curl --silent http://169.254.169.254/latest/meta-data/local-ipv4
$ curl -sS $PRIVATE_IP:4646/v1/metrics | jq | grep healthy -A2
$

Consul Cluster:

  • endpoint: <CONSUL_ADDR>:8500/v1/agent/metrics
  • results:
    • CONSUL LEADER will return metric consul.autopilot.healthy
    • CONSUL FOLLOWER will return metric consul.autopilot.healthy
  • example queries:
# AWS
# Query CONSUL LEADER
$ curl --silent http://169.254.169.254/latest/meta-data/local-ipv4
$ curl -sS $PRIVATE_IP:8500/v1/agent/metrics | jq | grep healthy -A1
      "Name": "consul.autopilot.healthy",
      "Value": 1,
AWS
# Query CONSUL FOLLOWER
$ curl --silent http://169.254.169.254/latest/meta-data/local-ipv4
$ curl -sS $PRIVATE_IP:8500/v1/agent/metrics | jq | grep healthy -A1
      "Name": "consul.autopilot.healthy",
      "Value": 1,
$

Workaround for Nomad:

  • query only the NOMAD LEADER for metric nomad.nomad.autopilot.healthy
@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jun 6, 2022
@tgross
Copy link
Member

tgross commented Jun 6, 2022

Seems like a reasonable thing to do. This code was mostly lifted directly out of Consul long ago and hasn't been touched much since. We've got an open issue #9570 for updating our autopilot implementation tentatively scheduled for Nomad 1.4.0. I'll make a note of this issue over there.

@mikenomitch mikenomitch added this to the 1.4.0 milestone Jun 17, 2022
@mikenomitch
Copy link
Contributor

mikenomitch commented Jun 17, 2022

@tgross added to the 1.4 milestone after reading your comment.

Is the engineering effort on this relatively small? Seems nice, but not a major improvement. Just want to make sure we dont commit to something big. Can remove the milestone if it is big.

@tgross
Copy link
Member

tgross commented Sep 13, 2022

Just a heads up that the updated raft-autopilot is shipping in Nomad 1.4.0 beta (and backports) but we've run out of time to make additional feature updates and it's not clear to me which of this is going to be automatically covered by the library update vs not. So what I'm going to do is to self-assign this issue and verify which of the reported behaviors we're still short on. If everything "just works" now and we can close the issue, great. Otherwise I'll report back here on what specific things need to be done and come up with a scope of work for how big of a lift that is. (Or if it's trivial, just knock it out real quick 😁)

@tgross tgross self-assigned this Sep 13, 2022
@tgross tgross modified the milestones: 1.4.0, 1.4.x Sep 22, 2022
@tgross
Copy link
Member

tgross commented Sep 27, 2022

Ok, I had a chance to review this issue in a detail following the switch to the raft-autopilot library. The ask here is "just" to emit the autopilot state metrics from the followers. But the reason we don't emit the metric from the followers is because that in Nomad we only run autopilot on the leader. The followers don't know the autopilot state, so they can't emit correct metrics for it!

Whereas in Consul they run autopilot from all nodes as of Consul 1.12.0 (ref changelog and hashicorp/consul#12617), because:

For some upcoming features we need autopilot on all servers to continually track the state of all servers. This PR pulls in a raft-autopilot update and enables that functionality.

It's not obvious to me what those features actually were and I don't see any newer autopilot-related changelog entries in Consul either, so that might be for upcoming stuff? (I'll follow up with the Matt who wrote #12617 but I'm not going to ping him here.) So unless we intend to add whatever those features are, I don't see any solid reason to do the work of running autopilot on the followers too. (Note that if we did, we would need to disable reconciliation on non-leaders hashicorp/raft-autopilot#16)

I'll keep this open till I get a chance to follow-up with Matt (he's OOO), but otherwise I think we can close this out.

@tgross tgross changed the title Nomad FOLLOWERS return nomad.nomad.autopilot.healthy emit nomad.nomad.autopilot.healthy on followers Sep 27, 2022
@tgross tgross changed the title emit nomad.nomad.autopilot.healthy on followers metrics: emit nomad.nomad.autopilot.healthy on followers Sep 27, 2022
@tgross
Copy link
Member

tgross commented Sep 27, 2022

Ok, I had a chance to chat with Matt sooner than expected, and those autopilot features were intended to support the new Consul dataplane work (see CSL-166 for internal folks). So nothing that helps us on Nomad directly.

However, Matt pointed out that not having autopilot running on the followers and getting autopilot state updates from the leader means that the metrics even on the leader are often wrong, because a previous leader will still have the metrics in memory and will flush them when asked, and the new leader will have stale data. So it probably makes sense for us to implement running autopilot on the followers in the non-reconciling mode, which would give us the metrics on all the servers. hashicorp/consul@a553982 has the work that Consul did to implement this and it feels fairly small. I'll try to pick it up in the 1.4.x period.

@tgross tgross modified the milestones: 1.4.x, 1.5.0 Nov 30, 2022
@tgross tgross modified the milestones: 1.5.0, 1.5.x Jan 20, 2023
@tgross tgross removed this from the 1.5.x milestone May 17, 2023
@mikenomitch mikenomitch added the hcc/cst Admin - internal label May 22, 2023
@tgross tgross removed their assignment Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/autopilot type/enhancement
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants