Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative metrics in 0.9.0 for service tasks. #5570

Closed
the-maldridge opened this issue Apr 16, 2019 · 16 comments · Fixed by #5637
Closed

Negative metrics in 0.9.0 for service tasks. #5570

the-maldridge opened this issue Apr 16, 2019 · 16 comments · Fixed by #5637

Comments

@the-maldridge
Copy link

the-maldridge commented Apr 16, 2019

Nomad version

Nomad v0.9.0

Operating system and Environment details

Alpine Linux AMD64

Issue

Unallocated client CPU appears to be affected by service/batch tasks and is reporting negative values for these tasks.

Reproduction steps

Install Nomad 0.9.0, enable telemetry, submit a service job, observe bug.

@preetapan preetapan self-assigned this Apr 18, 2019
@preetapan
Copy link
Contributor

@the-maldridge Can you provide a job spec file that triggers this behavior and the output of /v1/metrics on the client you are seeing this in? I tried an example redis job on a test cluster, below is the output and I currently don't see negative values. This could be an edge case triggered by specific resource stanza requirements, so having that is useful to help us debug further.

What I see when running a redis service job on a node:

{
  "Counters": [],
  "Gauges": [
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.allocated.cpu",
      "Value": 1000
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.allocated.disk",
      "Value": 600
    },
    {
      "Labels": {
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work",
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674"
      },
      "Name": "nomad.client.allocated.memory",
      "Value": 512
    },
    {
      "Labels": {
        "node_class": "none",
        "device": "wlp58s0",
        "host": "preetha-work",
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1"
      },
      "Name": "nomad.client.allocated.network",
      "Value": 10
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.allocations.blocked",
      "Value": 0
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.allocations.migrating",
      "Value": 0
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.allocations.pending",
      "Value": 2
    },
    {
      "Labels": {
        "node_class": "none",
        "host": "preetha-work",
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1"
      },
      "Name": "nomad.client.allocations.running",
      "Value": 0
    },
    {
      "Labels": {
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work",
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674"
      },
      "Name": "nomad.client.allocations.terminal",
      "Value": 0
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.unallocated.cpu",
      "Value": 14200
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.unallocated.disk",
      "Value": 287081
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "host": "preetha-work"
      },
      "Name": "nomad.client.unallocated.memory",
      "Value": 7356
    },
    {
      "Labels": {
        "node_id": "ed81d50a-429d-4677-432f-439279c9d674",
        "datacenter": "dc1",
        "node_class": "none",
        "device": "wlp58s0",
        "host": "preetha-work"
      },
      "Name": "nomad.client.unallocated.network",
      "Value": 990
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.alloc_bytes",
      "Value": 4902936
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.free_count",
      "Value": 3381831
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.heap_objects",
      "Value": 43694
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.malloc_count",
      "Value": 3425525
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.num_goroutines",
      "Value": 163
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.sys_bytes",
      "Value": 72284410
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.total_gc_pause_ns",
      "Value": 20731992
    },
    {
      "Labels": {
        "host": "preetha-work"
      },
      "Name": "nomad.runtime.total_gc_runs",
      "Value": 113
    }
  ],
  "Points": [],
  "Samples": [
    {
      "Count": 1,
      "Labels": {
        "host": "preetha-work"
      },
      "Max": 53488,
      "Mean": 53488,
      "Min": 53488,
      "Name": "nomad.runtime.gc_pause_ns",
      "Rate": 5348.8,
      "Stddev": 0,
      "Sum": 53488
    }
  ],
  "Timestamp": "2019-04-18 17:33:20 +0000 UTC"
}

@stevenscg
Copy link

I am seeing this as well for unallocated cpu, memory, etc. Clients with no batch jobs running on them do not emit negative telemetry values. Clients with a mix of batch and service do emit negative values. Nomad v0.9.0. CentOS 7. Dogstatsd telemetry handler.

The green trace here is the host with no batch jobs:

image

@preetapan
Copy link
Contributor

@stevenscg - Any other info you can provide? What drivers are these tasks using? Based on my investigation thus far this looks pretty driver specific, I don't see it so far with docker/raw_exec running a shell script.

@stevenscg
Copy link

stevenscg commented Apr 22, 2019

All of our jobs use the docker driver.

Client telemetry config looks like this:

telemetry {
  datadog_address = "127.0.0.1:8125"
  publish_node_metrics = true
}

Server telemetry config looks like this:

telemetry {
  datadog_address = "127.0.0.1:8125"
}

Also, the green trace shown above is showing a positive around 1900 and is what I would expect for this host and cluster.

@preetapan
Copy link
Contributor

@stevenscg thanks. Can you also post the JSON response from curl on the "/v1/metrics" from one of the negative value nodes (red or purple above)?

@stevenscg
Copy link

@preetapan Emailed the metrics response to nomad-oss-debug.

@cgbaker
Copy link
Contributor

cgbaker commented May 7, 2019

@the-maldridge , @stevenscg : a fix for this has been merged to master and should be part of the upcoming 0.9.2. a linux build with this fix is attached if you are interested in testing this out.

855caf72328e02ef7b20292872bd892f3eb5d5c5.tar.gz

@the-maldridge
Copy link
Author

@cgbaker I'd be happy to test, if I pull from master at that commit are most other things stable?

@cgbaker
Copy link
Contributor

cgbaker commented May 9, 2019

@Fuco1
Copy link
Contributor

Fuco1 commented Sep 29, 2019

I'm still seeing negative metrics emited to statsd on 0.9.5

The server only runs one job (I've removed the templates with configuration as it contains some sensitive info).

job "traefik" {
    datacenters = ["dc1"]
    type = "service"

    constraint {
        attribute = "${node.unique.name}"
        value = "gateway"
    }

    group "server" {
        count = 1

        ephemeral_disk {
            size = 3000
        }

        task "traefik" {
            driver = "docker"

            vault {
                policies = ["cert"]
            }

            config {
                image = "traefik:1.7.12"

                volumes = [
                    "local/traefik.toml:/etc/traefik/traefik.toml",
                    "secrets/certs:/certs"
                ]

                port_map {
                    https = 443
                    dashboard = 8080
                }
            }

            resources {
                network {
                    port "https" {
                        static = 443
                    }
                    port "dashboard" {
                        static = 8080
                    }
                }

                memory = 3500
            }

            service {
                name = "traefik"
                port = "https"
                check {
                    name     = "Traefik TCP Alive"
                    type     = "tcp"
                    interval = "10s"
                    timeout  = "2s"
                }
            }
        }
    }
}

@Fuco1
Copy link
Contributor

Fuco1 commented Sep 30, 2019

After tonight in the morning the metric is now a positive 122 MEGS unallocated. So at some point something in the nomad agent "fixed" itself

If there are some additional logs I can provide tell me where to find them.

@stale
Copy link

stale bot commented Dec 29, 2019

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@Fuco1
Copy link
Contributor

Fuco1 commented Dec 29, 2019

hey @Stale I'm still seeing the issue

@stale
Copy link

stale bot commented Mar 28, 2020

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@stale
Copy link

stale bot commented Apr 28, 2020

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@stale stale bot closed this as completed Apr 28, 2020
@github-actions
Copy link

github-actions bot commented Nov 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants