Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing allocation resource use metrics #5928

Closed
damoxc opened this issue Jul 5, 2019 · 20 comments
Closed

Missing allocation resource use metrics #5928

damoxc opened this issue Jul 5, 2019 · 20 comments

Comments

@damoxc
Copy link
Contributor

damoxc commented Jul 5, 2019

Nomad version

Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)

Operating system and Environment details

Server: Ubuntu 18.04
Client: Windows 2019
Driver: Docker

Issue

There doesn't appear to be any allocation resource usage metrics exposed after adding to both server and client's configuration:

telemetry {
  collection_interval        = "1s"
  disable_hostname           = true
  prometheus_metrics         = true
  publish_allocation_metrics = true
  publish_node_metrics       = true
}

I've tried querying the server and clients, there doesn't appear to be any nomad.client metrics exposed on the server and on the clients there is only basic totals, none of the allocation metrics as listed on https://www.nomadproject.io/docs/telemetry/metrics.html#allocation-metrics.

Server Metrics

$ curl --silent  https://x.x.x.x:4646/v1/metrics | jq '.Gauges[].Name' | uniq
"nomad.nomad.autopilot.failure_tolerance"
"nomad.nomad.autopilot.healthy"
"nomad.nomad.blocked_evals.total_blocked"
"nomad.nomad.blocked_evals.total_escaped"
"nomad.nomad.blocked_evals.total_quota_limit"
"nomad.nomad.broker._core.ready"
"nomad.nomad.broker._core.unacked"
"nomad.nomad.broker.service.ready"
"nomad.nomad.broker.service.unacked"
"nomad.nomad.broker.total_blocked"
"nomad.nomad.broker.total_ready"
"nomad.nomad.broker.total_unacked"
"nomad.nomad.broker.total_waiting"
"nomad.nomad.heartbeat.active"
"nomad.nomad.job_summary.complete"
"nomad.nomad.job_summary.failed"
"nomad.nomad.job_summary.lost"
"nomad.nomad.job_summary.queued"
"nomad.nomad.job_summary.running"
"nomad.nomad.job_summary.starting"
"nomad.nomad.plan.queue_depth"
"nomad.nomad.vault.distributed_tokens_revoking"
"nomad.nomad.vault.token_ttl"
"nomad.runtime.alloc_bytes"
"nomad.runtime.free_count"
"nomad.runtime.heap_objects"
"nomad.runtime.malloc_count"
"nomad.runtime.num_goroutines"
"nomad.runtime.sys_bytes"
"nomad.runtime.total_gc_pause_ns"
"nomad.runtime.total_gc_runs"

Client Metrics

$ curl --silent http://x.x.x.x:4646/v1/metrics | jq '.Gauges[].Name'
"nomad.client.allocated.cpu"
"nomad.client.allocated.disk"
"nomad.client.allocated.memory"
"nomad.client.allocated.network"
"nomad.client.allocations.blocked"
"nomad.client.allocations.migrating"
"nomad.client.allocations.pending"
"nomad.client.allocations.running"
"nomad.client.allocations.terminal"
"nomad.client.unallocated.cpu"
"nomad.client.unallocated.disk"
"nomad.client.unallocated.memory"
"nomad.client.unallocated.network"
"nomad.runtime.alloc_bytes"
"nomad.runtime.free_count"
"nomad.runtime.heap_objects"
"nomad.runtime.malloc_count"
"nomad.runtime.num_goroutines"
"nomad.runtime.sys_bytes"
"nomad.runtime.total_gc_pause_ns"
"nomad.runtime.total_gc_runs"
@cgbaker
Copy link
Contributor

cgbaker commented Jul 5, 2019

Are the any jobs registered with running allocations on the client that you're talking to?

For example:

$ http localhost:7646/v1/metrics | jq  '.Gauges[]| select(.Name | startswith("nomad.client.allocs.cpu")) | .Name, .Labels.alloc_id'
"nomad.client.allocs.cpu.system"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.throttled_periods"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.throttled_time"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.total_percent"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.total_ticks"
"ed6849f8-877e-4c42-f36a-1653206d4266"
"nomad.client.allocs.cpu.user"
"ed6849f8-877e-4c42-f36a-1653206d4266"

@damoxc
Copy link
Contributor Author

damoxc commented Jul 6, 2019

@cgbaker
Running the following to hit all 6 clients yields no results:

$ for clientAddr in x x x x x x; do curl --silent http://x.x.x.${clientAddr}:4646/v1/metrics | jq  '.Gauges[]| select(.Name | startswith("nomad.client.allocs.cpu")) | .Name, .Labels.alloc_id'; done;

There are certainly running allocations on all the up clients:

$ nomad node status -allocs
ID        DC              Name        Class   Drain  Eligibility  Status  Running Allocs
9e144753  europe-west2-b  nomad-g42v  <none>  false  eligible     ready   6
30a68744  europe-west2-c  nomad-gmk2  <none>  false  eligible     ready   25
f6f5e1dd  europe-west2-c  nomad-d48d  <none>  false  eligible     ready   27
92def6f4  europe-west2-b  nomad-rfz4  <none>  false  eligible     ready   24
5f7c3e2e  europe-west2-b  nomad-g42v  <none>  false  eligible     down    0
6963fbfe  europe-west2-a  nomad-bw8h  <none>  false  eligible     ready   24
444905c6  europe-west2-a  nomad-0g3x  <none>  false  eligible     ready   27

@cgbaker
Copy link
Contributor

cgbaker commented Jul 15, 2019

@damoxc , there is a reported issue where, under load on the client node, all of the nomad.client.* metrics are missing due to (I recall) a blocking call to a Windows API that isn't returning. I did some trivial testing on a single-node Windows cluster and the allocation metrics were returned, so I'm working under the theory that there's some circumstantial difference needed to reproduce this issue. Any insight you have for reproducing this would be appreciated.

@damoxc
Copy link
Contributor Author

damoxc commented Jul 16, 2019

@cgbaker are you able to share the configuration files for your simple example so I could compare them to what I have?

@cgbaker
Copy link
Contributor

cgbaker commented Jul 17, 2019

They weren't saved before the node was torn down, but the telemetry section was the same as you posted above. I will spin up a cluster and try again.

@damoxc
Copy link
Contributor Author

damoxc commented Jul 18, 2019

I've added the same telemetry block to our production cluster, which is configured nearly identically and that is exhibiting the same problem, so at least it is consistent and not just something random with our dev cluster.

@cgbaker
Copy link
Contributor

cgbaker commented Jul 18, 2019

And to be clear, the prod cluster has the same configuration:

  • ubuntu servers
  • windows clients with the telemetry stanza
  • docker tasks

@damoxc
Copy link
Contributor Author

damoxc commented Jul 18, 2019

Yes. We also have:

  1. Raft encryption
  2. ACL tokens
  3. TLS

@mgeggie
Copy link

mgeggie commented Sep 18, 2019

We're also seeing this affecting our Nomad deployment with Ubuntu Linux 16.04 clients and Nomad v0.9.2

@endocrimes
Copy link
Contributor

@damoxc @mgeggie Could you share any example job files that you're not seeing metrics for?

@mgeggie
Copy link

mgeggie commented Sep 18, 2019

@endocrimes We've actually had jobs start out successfully reporting metrics, only to later have all allocation usage metrics stop on a nomad-client server. Here's an example job we use for deploying traefik:

job "traefik" {
  datacenters = ["use1"]
  type        = "service"
  group "traefik" {

    count = 3

    update {
      max_parallel     = 1
      min_healthy_time = "20s"
      healthy_deadline = "2m"
      auto_revert      = true
      stagger          = "30s"
    }

#     migrate {
#       max_parallel = 1
#       health_check = "checks"
#       min_healthy_time = "20s"
#       healthy_deadline = "1m"
#     }

    task "traefik" {
      driver = "docker"
      kill_timeout = "35s"

      env {
        CONSUL_HTTP_ADDR = "http://169.254.1.1:8500"
      }

      config {
        image        = "traefik:v1.7.11"
        args = [
                  "--api",
                  "--ping",
                  "--ping.entrypoint=http",
                  "--consulcatalog.endpoint=169.254.1.1:8500",
                  "--metrics.prometheus.entrypoint=traefik",
                  "--traefikLog.format=common",
                  "--accessLog.format=common",
                  "--lifecycle.requestacceptgracetimeout=20s",
                  "--lifecycle.gracetimeout=10s",
                ]
        port_map {
          http  = 80
          webui = 8080
        }
      }

      resources {
        cpu = 1000 # MHz
        memory = 2048 # MB
        network {
          mbits = 500
          port "http" {
            static = 80
          }
          port "webui" { }
        }
      }

      service {
        name = "traefik"
        port = "webui"
        check {
          name     = "ping"
          type     = "http"
          port     = "http"
          path     = "/ping"
          interval = "5s"
          timeout  = "2s"
        }
      }

    }
  }
}

In fact, for this job, we have 3 instances, where two are currently reporting metrics, and one is not, each on its own Nomad-client server. None of the allocations on the affected Nomad-client server are reporting allocation utilization metrics.

@endocrimes
Copy link
Contributor

@mgeggie That's very useful - thank you!

Could you possibly send us any client logs from the bad clients to nomad-oss-debug@hashicorp.com? - The lower level the logging the better, but for this one anything is useful.

@preetapan preetapan assigned endocrimes and unassigned cgbaker Sep 19, 2019
@mgeggie
Copy link

mgeggie commented Sep 19, 2019

@endocrimes I sent logs over to nomad-oss-debug@hashicorp.com.

@endocrimes
Copy link
Contributor

@mgeggie Thanks! - I think #6349 should help with a fair amount of your case, as it looks like you get blocked on collecting host disk stats for broader client allocation metrics. Individual allocation metrics I'm still unsure of though.

@mgeggie
Copy link

mgeggie commented Sep 23, 2019

Thanks @endocrimes . We've seen that host disk collection error since starting our Nomad cluster a few months back. I'll check out #6349 to see about resolving the issue.

Also note, we've restarted the Nomad process on our afflicted Nomad client, and allocation stats that had been reporting and stopped did not recover after restarting the process.

We're about to upgrade our Nomad cluster to 0.9.5. I'll report back on the status of our afflicted nodes once that's complete.

@mgeggie
Copy link

mgeggie commented Sep 26, 2019

Hi @endocrimes just an update on our upgrade; after upgrading our troublesome nomadclient to 0.9.5 and rebooting the server, allocation resource use metrics are once again being produced.

@stale
Copy link

stale bot commented Feb 6, 2020

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@mgeggie
Copy link

mgeggie commented Feb 6, 2020

Hey @tgross I only received the waiting-reply label and notification and the issue was closed in the same hour and I'm still curious how our cluster entered a state where it stopped producing allocation metrics. I can report that in the 3 months since upgrading to Nomad-0.9.5 we haven't had any further issues of losing allocation metrics.

@tgross
Copy link
Member

tgross commented Feb 6, 2020

Hi @mgeggie sorry about the confusion over that. I saw the notification from the bot and it looked to me like the issue had been resolved with the upgrade?

I know we updated the prometheus and libcontainer clients in 0.9.4 so the issue may have been upstream, but we also improved CPU utilization for busy clusters in that same release and have seen Nomad CPU utilization impact metrics collection for other users. I've recently added extra testing for both host and allocation metrics collection in our end-to-end test suite and I've got an open ticket to upgrade our go-psutils library in the 0.11 release cycle so that we get improvements from upstream on collecting host information. Hope that gives you some context!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants