node_scheduling_eligibility metric is not correct #13549

Netlims · 2022-07-03T22:30:35Z

Hi,
I'm trying to sample some metrics from the active nodes in the cluster, for that I'm using node_scheduling_eligibility as a filter.
I tried marking two nodes as ineligible but on their metrics, they send their node_scheduling_eligibility as eligible.
nomad node status:

ID        DC   Name             Class      Drain  Eligibility  Status
e8b66956  dc1  EC2AMAZ-LRR95F6  BE-SERVER  false  ineligible   ready
0a481d04  dc1  EC2AMAZ-LRR95F6  BE-SERVER  false  ineligible   ready
0227cdc9  dc1  ip-10-0-1-213    FE-SERVER  false  eligible     ready
e888efe7  dc1  EC2AMAZ-2SVI9N8  BE-SERVER  false  eligible     ready
6461eb74  dc1  EC2AMAZ-14C7A8A  BE-SERVER  false  eligible     ready
655d03e8  dc1  EC2AMAZ-GD8BC2D  BE-SERVER  false  ineligible   ready

From one of the ineligible nodes:
localhost:4646/v1/metrics?format=prometheus:

# HELP nomad_client_host_cpu_total nomad_client_host_cpu_total
# TYPE nomad_client_host_cpu_total gauge
nomad_client_host_cpu_total{cpu="cpu0",datacenter="dc1",host="EC2AMAZ-GD8BC2D",node_class="BE-SERVER",node_id="655d03e8-7414-7226-998c-e4ec9960b35a",node_scheduling_eligibility="eligible",node_status="ready"} 100
nomad_client_host_cpu_total{cpu="cpu1",datacenter="dc1",host="EC2AMAZ-GD8BC2D",node_class="BE-SERVER",node_id="655d03e8-7414-7226-998c-e4ec9960b35a",node_scheduling_eligibility="eligible",node_status="ready"} 100
nomad_client_host_cpu_total{cpu="cpu2",datacenter="dc1",host="EC2AMAZ-GD8BC2D",node_class="BE-SERVER",node_id="655d03e8-7414-7226-998c-e4ec9960b35a",node_scheduling_eligibility="eligible",node_status="ready"} 100
nomad_client_host_cpu_total{cpu="cpu3",datacenter="dc1",host="EC2AMAZ-GD8BC2D",node_class="BE-SERVER",node_id="655d03e8-7414-7226-998c-e4ec9960b35a",node_scheduling_eligibility="eligible",node_status="ready"} 100

The issue appears also when the metrics are not formatted for prometheus.

Here is the telemetry block from the agent:

telemetry {
    disable_hostname = "false"
    publish_allocation_metrics = "true"
    publish_node_metrics = "true"
    prometheus_metrics = "true"
}

I'm using Nomad 1.2.6

Please tell me if you need anything else.
Thank you

The text was updated successfully, but these errors were encountered:

tgross · 2022-07-05T20:20:54Z

Hi @Netlims!

I've verified this is broken just as you've said. This code was originally introduced in #6130 and was revised slightly #8925 but honestly now that I'm looking at it with some distance I'm not sure it ever worked architecturally. Setting scheduler ineligibility gets set in the server's view of the node but we never push that information back to the client on its next heartbeat. We do push a tiny bit of data back about the cluster on each heartbeat in constructNodeServerInfoResponse , so it's at least possible to add this information there.

In any case this is a clear bug without a good workaround. That metric is misleading without a fix to send the ineligibility data back to the client (or removing the incorrect label). I'll mark this for roadmapping. Thanks for opening the issue @Netlims!

Fuco1 · 2022-08-07T14:07:41Z

I can confirm it worked because we had some grafana reports based on this. I was wondering since when and why they suddenly show all CPUs as eligible :)

Edit: we were using influx + telegraf for metric collection.

Edit2: please don't remove this tag as it's really useful for automated cluster scaling.

Vadim-Che · 2023-02-09T07:36:43Z

Hi.
While this metrics is not working is there any other possibility to monitor cluster's node eligibility? Because now it looks like we cannot rely on any Nomad metrics. All nodes are Ok even if those are not reachable.

kholisrag · 2023-03-06T02:18:07Z

any ETA to fix this? since this good for the nomad-autoscaler metrics filtering

rostow · 2023-04-05T12:20:20Z

Same here, it seems not to be working for us either. We are using Prometheus as our APM. Would be awesome to have it working for autoscaling purposes as @kholisrag mentioned.

Netlims added the type/bug label Jul 3, 2022

Netlims changed the title ~~node_scheduling_eligibility is not correct~~ node_scheduling_eligibility metric is not correct Jul 3, 2022

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 5, 2022

tgross self-assigned this Jul 5, 2022

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jul 5, 2022

tgross added the theme/metrics label Jul 5, 2022

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jul 5, 2022

tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Jul 5, 2022

tgross removed their assignment Jul 5, 2022

hsmade mentioned this issue Aug 7, 2024

allocatable_memory wrong hashicorp/nomad-autoscaler#910

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node_scheduling_eligibility metric is not correct #13549

node_scheduling_eligibility metric is not correct #13549

Netlims commented Jul 3, 2022

tgross commented Jul 5, 2022

Fuco1 commented Aug 7, 2022 •

edited

Loading

Vadim-Che commented Feb 9, 2023

kholisrag commented Mar 6, 2023 •

edited

Loading

rostow commented Apr 5, 2023

node_scheduling_eligibility metric is not correct #13549

node_scheduling_eligibility metric is not correct #13549

Comments

Netlims commented Jul 3, 2022

tgross commented Jul 5, 2022

Fuco1 commented Aug 7, 2022 • edited Loading

Vadim-Che commented Feb 9, 2023

kholisrag commented Mar 6, 2023 • edited Loading

rostow commented Apr 5, 2023

Fuco1 commented Aug 7, 2022 •

edited

Loading

kholisrag commented Mar 6, 2023 •

edited

Loading