Missing allocation resource use metrics #5928

damoxc · 2019-07-05T17:17:42Z

Nomad version

Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)

Operating system and Environment details

Server: Ubuntu 18.04
Client: Windows 2019
Driver: Docker

Issue

There doesn't appear to be any allocation resource usage metrics exposed after adding to both server and client's configuration:

telemetry {
  collection_interval        = "1s"
  disable_hostname           = true
  prometheus_metrics         = true
  publish_allocation_metrics = true
  publish_node_metrics       = true
}

I've tried querying the server and clients, there doesn't appear to be any nomad.client metrics exposed on the server and on the clients there is only basic totals, none of the allocation metrics as listed on https://www.nomadproject.io/docs/telemetry/metrics.html#allocation-metrics.

Server Metrics

$ curl --silent  https://x.x.x.x:4646/v1/metrics | jq '.Gauges[].Name' | uniq
"nomad.nomad.autopilot.failure_tolerance"
"nomad.nomad.autopilot.healthy"
"nomad.nomad.blocked_evals.total_blocked"
"nomad.nomad.blocked_evals.total_escaped"
"nomad.nomad.blocked_evals.total_quota_limit"
"nomad.nomad.broker._core.ready"
"nomad.nomad.broker._core.unacked"
"nomad.nomad.broker.service.ready"
"nomad.nomad.broker.service.unacked"
"nomad.nomad.broker.total_blocked"
"nomad.nomad.broker.total_ready"
"nomad.nomad.broker.total_unacked"
"nomad.nomad.broker.total_waiting"
"nomad.nomad.heartbeat.active"
"nomad.nomad.job_summary.complete"
"nomad.nomad.job_summary.failed"
"nomad.nomad.job_summary.lost"
"nomad.nomad.job_summary.queued"
"nomad.nomad.job_summary.running"
"nomad.nomad.job_summary.starting"
"nomad.nomad.plan.queue_depth"
"nomad.nomad.vault.distributed_tokens_revoking"
"nomad.nomad.vault.token_ttl"
"nomad.runtime.alloc_bytes"
"nomad.runtime.free_count"
"nomad.runtime.heap_objects"
"nomad.runtime.malloc_count"
"nomad.runtime.num_goroutines"
"nomad.runtime.sys_bytes"
"nomad.runtime.total_gc_pause_ns"
"nomad.runtime.total_gc_runs"

Client Metrics

$ curl --silent http://x.x.x.x:4646/v1/metrics | jq '.Gauges[].Name'
"nomad.client.allocated.cpu"
"nomad.client.allocated.disk"
"nomad.client.allocated.memory"
"nomad.client.allocated.network"
"nomad.client.allocations.blocked"
"nomad.client.allocations.migrating"
"nomad.client.allocations.pending"
"nomad.client.allocations.running"
"nomad.client.allocations.terminal"
"nomad.client.unallocated.cpu"
"nomad.client.unallocated.disk"
"nomad.client.unallocated.memory"
"nomad.client.unallocated.network"
"nomad.runtime.alloc_bytes"
"nomad.runtime.free_count"
"nomad.runtime.heap_objects"
"nomad.runtime.malloc_count"
"nomad.runtime.num_goroutines"
"nomad.runtime.sys_bytes"
"nomad.runtime.total_gc_pause_ns"
"nomad.runtime.total_gc_runs"

The text was updated successfully, but these errors were encountered:

cgbaker · 2019-07-05T19:08:29Z

Are the any jobs registered with running allocations on the client that you're talking to?

For example:

$ http localhost:7646/v1/metrics | jq  '.Gauges[]| select(.Name | startswith("nomad.client.allocs.cpu")) | .Name, .Labels.alloc_id'
"nomad.client.allocs.cpu.system"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.throttled_periods"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.throttled_time"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.total_percent"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.total_ticks"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.user"
"ed6849f8-877e-4c42-f36a-1653206d4266"

damoxc · 2019-07-06T13:21:21Z

@cgbaker
Running the following to hit all 6 clients yields no results:

$ for clientAddr in x x x x x x; do curl --silent http://x.x.x.${clientAddr}:4646/v1/metrics | jq  '.Gauges[]| select(.Name | startswith("nomad.client.allocs.cpu")) | .Name, .Labels.alloc_id'; done;

There are certainly running allocations on all the up clients:

$ nomad node status -allocs
ID        DC              Name        Class   Drain  Eligibility  Status  Running Allocs
9e144753  europe-west2-b  nomad-g42v  <none>  false  eligible     ready   6
30a68744  europe-west2-c  nomad-gmk2  <none>  false  eligible     ready   25
f6f5e1dd  europe-west2-c  nomad-d48d  <none>  false  eligible     ready   27
92def6f4  europe-west2-b  nomad-rfz4  <none>  false  eligible     ready   24
5f7c3e2e  europe-west2-b  nomad-g42v  <none>  false  eligible     down    0
6963fbfe  europe-west2-a  nomad-bw8h  <none>  false  eligible     ready   24
444905c6  europe-west2-a  nomad-0g3x  <none>  false  eligible     ready   27

cgbaker · 2019-07-15T18:39:40Z

@damoxc , there is a reported issue where, under load on the client node, all of the nomad.client.* metrics are missing due to (I recall) a blocking call to a Windows API that isn't returning. I did some trivial testing on a single-node Windows cluster and the allocation metrics were returned, so I'm working under the theory that there's some circumstantial difference needed to reproduce this issue. Any insight you have for reproducing this would be appreciated.

damoxc · 2019-07-16T09:33:08Z

@cgbaker are you able to share the configuration files for your simple example so I could compare them to what I have?

cgbaker · 2019-07-17T11:53:48Z

They weren't saved before the node was torn down, but the telemetry section was the same as you posted above. I will spin up a cluster and try again.

damoxc · 2019-07-18T06:40:33Z

I've added the same telemetry block to our production cluster, which is configured nearly identically and that is exhibiting the same problem, so at least it is consistent and not just something random with our dev cluster.

cgbaker · 2019-07-18T16:31:19Z

And to be clear, the prod cluster has the same configuration:

ubuntu servers
windows clients with the telemetry stanza
docker tasks

damoxc · 2019-07-18T16:41:26Z

Yes. We also have:

Raft encryption
ACL tokens
TLS

mgeggie · 2019-09-18T20:22:19Z

We're also seeing this affecting our Nomad deployment with Ubuntu Linux 16.04 clients and Nomad v0.9.2

endocrimes · 2019-09-18T21:18:24Z

@damoxc @mgeggie Could you share any example job files that you're not seeing metrics for?

mgeggie · 2019-09-18T21:42:00Z

@endocrimes We've actually had jobs start out successfully reporting metrics, only to later have all allocation usage metrics stop on a nomad-client server. Here's an example job we use for deploying traefik:

job "traefik" {
  datacenters = ["use1"]
  type        = "service"
  group "traefik" {

    count = 3

    update {
      max_parallel     = 1
      min_healthy_time = "20s"
      healthy_deadline = "2m"
      auto_revert      = true
      stagger          = "30s"
    }

#     migrate {
#       max_parallel = 1
#       health_check = "checks"
#       min_healthy_time = "20s"
#       healthy_deadline = "1m"
#     }

    task "traefik" {
      driver = "docker"
      kill_timeout = "35s"

      env {
        CONSUL_HTTP_ADDR = "http://169.254.1.1:8500"
      }

      config {
        image        = "traefik:v1.7.11"
        args = [
                  "--api",
                  "--ping",
                  "--ping.entrypoint=http",
                  "--consulcatalog.endpoint=169.254.1.1:8500",
                  "--metrics.prometheus.entrypoint=traefik",
                  "--traefikLog.format=common",
                  "--accessLog.format=common",
                  "--lifecycle.requestacceptgracetimeout=20s",
                  "--lifecycle.gracetimeout=10s",
                ]
        port_map {
          http  = 80
          webui = 8080
        }
      }

      resources {
        cpu = 1000 # MHz
        memory = 2048 # MB
        network {
          mbits = 500
          port "http" {
            static = 80
          }
          port "webui" { }
        }
      }

      service {
        name = "traefik"
        port = "webui"
        check {
          name     = "ping"
          type     = "http"
          port     = "http"
          path     = "/ping"
          interval = "5s"
          timeout  = "2s"
        }
      }

    }
  }
}

In fact, for this job, we have 3 instances, where two are currently reporting metrics, and one is not, each on its own Nomad-client server. None of the allocations on the affected Nomad-client server are reporting allocation utilization metrics.

endocrimes · 2019-09-19T02:48:38Z

@mgeggie That's very useful - thank you!

Could you possibly send us any client logs from the bad clients to nomad-oss-debug@hashicorp.com? - The lower level the logging the better, but for this one anything is useful.

mgeggie · 2019-09-19T19:53:46Z

@endocrimes I sent logs over to nomad-oss-debug@hashicorp.com.

endocrimes · 2019-09-21T09:23:12Z

@mgeggie Thanks! - I think #6349 should help with a fair amount of your case, as it looks like you get blocked on collecting host disk stats for broader client allocation metrics. Individual allocation metrics I'm still unsure of though.

mgeggie · 2019-09-23T17:10:34Z

Thanks @endocrimes . We've seen that host disk collection error since starting our Nomad cluster a few months back. I'll check out #6349 to see about resolving the issue.

Also note, we've restarted the Nomad process on our afflicted Nomad client, and allocation stats that had been reporting and stopped did not recover after restarting the process.

We're about to upgrade our Nomad cluster to 0.9.5. I'll report back on the status of our afflicted nodes once that's complete.

mgeggie · 2019-09-26T15:50:11Z

Hi @endocrimes just an update on our upgrade; after upgrading our troublesome nomadclient to 0.9.5 and rebooting the server, allocation resource use metrics are once again being produced.

stale · 2020-02-06T18:29:25Z

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

mgeggie · 2020-02-06T19:58:22Z

Hey @tgross I only received the waiting-reply label and notification and the issue was closed in the same hour and I'm still curious how our cluster entered a state where it stopped producing allocation metrics. I can report that in the 3 months since upgrading to Nomad-0.9.5 we haven't had any further issues of losing allocation metrics.

tgross · 2020-02-06T20:11:34Z

Hi @mgeggie sorry about the confusion over that. I saw the notification from the bot and it looked to me like the issue had been resolved with the upgrade?

I know we updated the prometheus and libcontainer clients in 0.9.4 so the issue may have been upstream, but we also improved CPU utilization for busy clusters in that same release and have seen Nomad CPU utilization impact metrics collection for other users. I've recently added extra testing for both host and allocation metrics collection in our end-to-end test suite and I've got an open ticket to upgrade our go-psutils library in the 0.11 release cycle so that we get improvements from upstream on collecting host information. Hope that gives you some context!

github-actions · 2022-11-13T02:32:10Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

cgbaker added theme/metrics stage/waiting-reply labels Jul 5, 2019

stale bot removed the stage/waiting-reply label Jul 5, 2019

cgbaker self-assigned this Jul 5, 2019

cgbaker added stage/needs-investigation theme/platform-windows labels Jul 8, 2019

preetapan assigned endocrimes and unassigned cgbaker Sep 19, 2019

tgross removed the stage/needs-investigation label Nov 8, 2019

stale bot added the stage/waiting-reply label Feb 6, 2020

tgross removed the stage/waiting-reply label Feb 6, 2020

tgross closed this as completed Feb 6, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing allocation resource use metrics #5928

Missing allocation resource use metrics #5928

damoxc commented Jul 5, 2019

cgbaker commented Jul 5, 2019

damoxc commented Jul 6, 2019

cgbaker commented Jul 15, 2019

damoxc commented Jul 16, 2019

cgbaker commented Jul 17, 2019

damoxc commented Jul 18, 2019 •

edited

Loading

cgbaker commented Jul 18, 2019

damoxc commented Jul 18, 2019

mgeggie commented Sep 18, 2019

endocrimes commented Sep 18, 2019

mgeggie commented Sep 18, 2019 •

edited

Loading

endocrimes commented Sep 19, 2019

mgeggie commented Sep 19, 2019

endocrimes commented Sep 21, 2019

mgeggie commented Sep 23, 2019

mgeggie commented Sep 26, 2019

stale bot commented Feb 6, 2020

mgeggie commented Feb 6, 2020

tgross commented Feb 6, 2020

github-actions bot commented Nov 13, 2022

Missing allocation resource use metrics #5928

Missing allocation resource use metrics #5928

Comments

damoxc commented Jul 5, 2019

Nomad version

Operating system and Environment details

Issue

Server Metrics

Client Metrics

cgbaker commented Jul 5, 2019

damoxc commented Jul 6, 2019

cgbaker commented Jul 15, 2019

damoxc commented Jul 16, 2019

cgbaker commented Jul 17, 2019

damoxc commented Jul 18, 2019 • edited Loading

cgbaker commented Jul 18, 2019

damoxc commented Jul 18, 2019

mgeggie commented Sep 18, 2019

endocrimes commented Sep 18, 2019

mgeggie commented Sep 18, 2019 • edited Loading

endocrimes commented Sep 19, 2019

mgeggie commented Sep 19, 2019

endocrimes commented Sep 21, 2019

mgeggie commented Sep 23, 2019

mgeggie commented Sep 26, 2019

stale bot commented Feb 6, 2020

mgeggie commented Feb 6, 2020

tgross commented Feb 6, 2020

github-actions bot commented Nov 13, 2022

damoxc commented Jul 18, 2019 •

edited

Loading

mgeggie commented Sep 18, 2019 •

edited

Loading