Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_scheduling_eligibility metric is not correct #13549

Open
Netlims opened this issue Jul 3, 2022 · 5 comments
Open

node_scheduling_eligibility metric is not correct #13549

Netlims opened this issue Jul 3, 2022 · 5 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/metrics type/bug

Comments

@Netlims
Copy link

Netlims commented Jul 3, 2022

Hi,
I'm trying to sample some metrics from the active nodes in the cluster, for that I'm using node_scheduling_eligibility as a filter.
I tried marking two nodes as ineligible but on their metrics, they send their node_scheduling_eligibility as eligible.
nomad node status:

ID        DC   Name             Class      Drain  Eligibility  Status
e8b66956  dc1  EC2AMAZ-LRR95F6  BE-SERVER  false  ineligible   ready
0a481d04  dc1  EC2AMAZ-LRR95F6  BE-SERVER  false  ineligible   ready
0227cdc9  dc1  ip-10-0-1-213    FE-SERVER  false  eligible     ready
e888efe7  dc1  EC2AMAZ-2SVI9N8  BE-SERVER  false  eligible     ready
6461eb74  dc1  EC2AMAZ-14C7A8A  BE-SERVER  false  eligible     ready
655d03e8  dc1  EC2AMAZ-GD8BC2D  BE-SERVER  false  ineligible   ready

From one of the ineligible nodes:
localhost:4646/v1/metrics?format=prometheus:

# HELP nomad_client_host_cpu_total nomad_client_host_cpu_total
# TYPE nomad_client_host_cpu_total gauge
nomad_client_host_cpu_total{cpu="cpu0",datacenter="dc1",host="EC2AMAZ-GD8BC2D",node_class="BE-SERVER",node_id="655d03e8-7414-7226-998c-e4ec9960b35a",node_scheduling_eligibility="eligible",node_status="ready"} 100
nomad_client_host_cpu_total{cpu="cpu1",datacenter="dc1",host="EC2AMAZ-GD8BC2D",node_class="BE-SERVER",node_id="655d03e8-7414-7226-998c-e4ec9960b35a",node_scheduling_eligibility="eligible",node_status="ready"} 100
nomad_client_host_cpu_total{cpu="cpu2",datacenter="dc1",host="EC2AMAZ-GD8BC2D",node_class="BE-SERVER",node_id="655d03e8-7414-7226-998c-e4ec9960b35a",node_scheduling_eligibility="eligible",node_status="ready"} 100
nomad_client_host_cpu_total{cpu="cpu3",datacenter="dc1",host="EC2AMAZ-GD8BC2D",node_class="BE-SERVER",node_id="655d03e8-7414-7226-998c-e4ec9960b35a",node_scheduling_eligibility="eligible",node_status="ready"} 100

The issue appears also when the metrics are not formatted for prometheus.

Here is the telemetry block from the agent:

telemetry {
    disable_hostname = "false"
    publish_allocation_metrics = "true"
    publish_node_metrics = "true"
    prometheus_metrics = "true"
}

I'm using Nomad 1.2.6

Please tell me if you need anything else.
Thank you

@Netlims Netlims changed the title node_scheduling_eligibility is not correct node_scheduling_eligibility metric is not correct Jul 3, 2022
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 5, 2022
@tgross tgross self-assigned this Jul 5, 2022
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jul 5, 2022
@tgross
Copy link
Member

tgross commented Jul 5, 2022

Hi @Netlims!

I've verified this is broken just as you've said. This code was originally introduced in #6130 and was revised slightly #8925 but honestly now that I'm looking at it with some distance I'm not sure it ever worked architecturally. Setting scheduler ineligibility gets set in the server's view of the node but we never push that information back to the client on its next heartbeat. We do push a tiny bit of data back about the cluster on each heartbeat in constructNodeServerInfoResponse , so it's at least possible to add this information there.

In any case this is a clear bug without a good workaround. That metric is misleading without a fix to send the ineligibility data back to the client (or removing the incorrect label). I'll mark this for roadmapping. Thanks for opening the issue @Netlims!

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jul 5, 2022
@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Jul 5, 2022
@tgross tgross removed their assignment Jul 5, 2022
@Fuco1
Copy link
Contributor

Fuco1 commented Aug 7, 2022

I can confirm it worked because we had some grafana reports based on this. I was wondering since when and why they suddenly show all CPUs as eligible :)

Edit: we were using influx + telegraf for metric collection.

Edit2: please don't remove this tag as it's really useful for automated cluster scaling.

@Vadim-Che
Copy link

Hi.
While this metrics is not working is there any other possibility to monitor cluster's node eligibility? Because now it looks like we cannot rely on any Nomad metrics. All nodes are Ok even if those are not reachable.

@kholisrag
Copy link

kholisrag commented Mar 6, 2023

any ETA to fix this? since this good for the nomad-autoscaler metrics filtering

@rostow
Copy link

rostow commented Apr 5, 2023

Same here, it seems not to be working for us either. We are using Prometheus as our APM. Would be awesome to have it working for autoscaling purposes as @kholisrag mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/metrics type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

5 participants