Negative metrics in 0.9.0 for service tasks. #5570

the-maldridge · 2019-04-16T21:23:32Z

Nomad version

Nomad v0.9.0

Operating system and Environment details

Alpine Linux AMD64

Issue

Unallocated client CPU appears to be affected by service/batch tasks and is reporting negative values for these tasks.

Reproduction steps

Install Nomad 0.9.0, enable telemetry, submit a service job, observe bug.

The text was updated successfully, but these errors were encountered:

preetapan · 2019-04-18T17:52:05Z

@the-maldridge Can you provide a job spec file that triggers this behavior and the output of /v1/metrics on the client you are seeing this in? I tried an example redis job on a test cluster, below is the output and I currently don't see negative values. This could be an edge case triggered by specific resource stanza requirements, so having that is useful to help us debug further.

What I see when running a redis service job on a node:

{
  "Counters": [],
  "Gauges": [
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.allocated.cpu",
      "Value": 1000
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.allocated.disk",
      "Value": 600
    },
    {
      "Labels": {
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work",
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674"
      },
      "Name": "nomad.client.allocated.memory",
      "Value": 512
    },
    {
      "Labels": {
        "node_class": "none",
        "device": "wlp58s0",
        "host": "preetha-work",
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1"
      },
      "Name": "nomad.client.allocated.network",
      "Value": 10
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.allocations.blocked",
      "Value": 0
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.allocations.migrating",
      "Value": 0
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.allocations.pending",
      "Value": 2
    },
    {
      "Labels": {
        "node_class": "none",
        "host": "preetha-work",
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1"
      },
      "Name": "nomad.client.allocations.running",
      "Value": 0
    },
    {
      "Labels": {
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work",
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674"
      },
      "Name": "nomad.client.allocations.terminal",
      "Value": 0
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.unallocated.cpu",
      "Value": 14200
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.unallocated.disk",
      "Value": 287081
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.unallocated.memory",
      "Value": 7356
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "device": "wlp58s0",
        "host": "preetha-work"
      },
      "Name": "nomad.client.unallocated.network",
      "Value": 990
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.alloc_bytes",
      "Value": 4902936
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.free_count",
      "Value": 3381831
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.heap_objects",
      "Value": 43694
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.malloc_count",
      "Value": 3425525
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.num_goroutines",
      "Value": 163
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.sys_bytes",
      "Value": 72284410
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.total_gc_pause_ns",
      "Value": 20731992
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.total_gc_runs",
      "Value": 113
    }
  ],
  "Points": [],
  "Samples": [
    {
      "Count": 1,
      "Labels": {
        "host": "preetha-work"
      },
      "Max": 53488,
      "Mean": 53488,
      "Min": 53488,
      "Name": "nomad.runtime.gc_pause_ns",
      "Rate": 5348.8,
      "Stddev": 0,
      "Sum": 53488
    }
  ],
  "Timestamp": "2019-04-18 17:33:20 +0000 UTC"
}

stevenscg · 2019-04-22T14:44:16Z

I am seeing this as well for unallocated cpu, memory, etc. Clients with no batch jobs running on them do not emit negative telemetry values. Clients with a mix of batch and service do emit negative values. Nomad v0.9.0. CentOS 7. Dogstatsd telemetry handler.

The green trace here is the host with no batch jobs:

preetapan · 2019-04-22T15:04:08Z

@stevenscg - Any other info you can provide? What drivers are these tasks using? Based on my investigation thus far this looks pretty driver specific, I don't see it so far with docker/raw_exec running a shell script.

stevenscg · 2019-04-22T15:12:44Z

All of our jobs use the docker driver.

Client telemetry config looks like this:

telemetry {
  datadog_address = "127.0.0.1:8125"
  publish_node_metrics = true
}

Server telemetry config looks like this:

telemetry {
  datadog_address = "127.0.0.1:8125"
}

Also, the green trace shown above is showing a positive around 1900 and is what I would expect for this host and cluster.

preetapan · 2019-04-22T16:09:33Z

@stevenscg thanks. Can you also post the JSON response from curl on the "/v1/metrics" from one of the negative value nodes (red or purple above)?

stevenscg · 2019-04-22T16:45:55Z

@preetapan Emailed the metrics response to nomad-oss-debug.

cgbaker · 2019-05-07T20:15:08Z

@the-maldridge , @stevenscg : a fix for this has been merged to master and should be part of the upcoming 0.9.2. a linux build with this fix is attached if you are interested in testing this out.

855caf72328e02ef7b20292872bd892f3eb5d5c5.tar.gz

the-maldridge · 2019-05-07T20:17:07Z

@cgbaker I'd be happy to test, if I pull from master at that commit are most other things stable?

cgbaker · 2019-05-09T11:04:14Z

@the-maldridge , yes:
https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md

Fuco1 · 2019-09-29T18:49:52Z

I'm still seeing negative metrics emited to statsd on 0.9.5

The server only runs one job (I've removed the templates with configuration as it contains some sensitive info).

job "traefik" {
    datacenters = ["dc1"]
    type = "service"

    constraint {
        attribute = "${node.unique.name}"
        value = "gateway"
    }

    group "server" {
        count = 1

        ephemeral_disk {
            size = 3000
        }

        task "traefik" {
            driver = "docker"

            vault {
                policies = ["cert"]
            }

            config {
                image = "traefik:1.7.12"

                volumes = [
                    "local/traefik.toml:/etc/traefik/traefik.toml",
                    "secrets/certs:/certs"
                ]

                port_map {
                    https = 443
                    dashboard = 8080
                }
            }

            resources {
                network {
                    port "https" {
                        static = 443
                    }
                    port "dashboard" {
                        static = 8080
                    }
                }

                memory = 3500
            }

            service {
                name = "traefik"
                port = "https"
                check {
                    name     = "Traefik TCP Alive"
                    type     = "tcp"
                    interval = "10s"
                    timeout  = "2s"
                }
            }
        }
    }
}

Fuco1 · 2019-09-30T18:47:19Z

After tonight in the morning the metric is now a positive 122 MEGS unallocated. So at some point something in the nomad agent "fixed" itself

If there are some additional logs I can provide tell me where to find them.

stale · 2019-12-29T18:55:31Z

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

Fuco1 · 2019-12-29T19:09:10Z

hey @Stale I'm still seeing the issue

stale · 2020-03-28T19:23:30Z

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

stale · 2020-04-28T02:58:41Z

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

github-actions · 2022-11-08T02:31:58Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

preetapan self-assigned this Apr 18, 2019

preetapan added the stage/waiting-reply label Apr 18, 2019

cgbaker added theme/metrics and removed stage/waiting-reply labels Apr 30, 2019

cgbaker mentioned this issue May 1, 2019

stale allocation data leads to incorrect (and even negative) metrics #5637

Merged

cgbaker closed this as completed in #5637 May 7, 2019

cgbaker added the stage/needs-investigation label Sep 30, 2019

cgbaker reopened this Sep 30, 2019

stale bot added the stage/waiting-reply label Dec 29, 2019

stale bot removed the stage/waiting-reply label Dec 29, 2019

stale bot added the stage/waiting-reply label Mar 28, 2020

stale bot closed this as completed Apr 28, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative metrics in 0.9.0 for service tasks. #5570

Negative metrics in 0.9.0 for service tasks. #5570

the-maldridge commented Apr 16, 2019 •

edited

Loading

preetapan commented Apr 18, 2019

stevenscg commented Apr 22, 2019

preetapan commented Apr 22, 2019

stevenscg commented Apr 22, 2019 •

edited

Loading

preetapan commented Apr 22, 2019

stevenscg commented Apr 22, 2019

cgbaker commented May 7, 2019

the-maldridge commented May 7, 2019

cgbaker commented May 9, 2019

Fuco1 commented Sep 29, 2019

Fuco1 commented Sep 30, 2019

stale bot commented Dec 29, 2019

Fuco1 commented Dec 29, 2019

stale bot commented Mar 28, 2020

stale bot commented Apr 28, 2020

github-actions bot commented Nov 8, 2022

Negative metrics in 0.9.0 for service tasks. #5570

Negative metrics in 0.9.0 for service tasks. #5570

Comments

the-maldridge commented Apr 16, 2019 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

preetapan commented Apr 18, 2019

stevenscg commented Apr 22, 2019

preetapan commented Apr 22, 2019

stevenscg commented Apr 22, 2019 • edited Loading

preetapan commented Apr 22, 2019

stevenscg commented Apr 22, 2019

cgbaker commented May 7, 2019

the-maldridge commented May 7, 2019

cgbaker commented May 9, 2019

Fuco1 commented Sep 29, 2019

Fuco1 commented Sep 30, 2019

stale bot commented Dec 29, 2019

Fuco1 commented Dec 29, 2019

stale bot commented Mar 28, 2020

stale bot commented Apr 28, 2020

github-actions bot commented Nov 8, 2022

the-maldridge commented Apr 16, 2019 •

edited

Loading

stevenscg commented Apr 22, 2019 •

edited

Loading