Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect CPU Usage Metrics #3228

Closed
epipho opened this issue Sep 16, 2017 · 3 comments
Closed

Incorrect CPU Usage Metrics #3228

epipho opened this issue Sep 16, 2017 · 3 comments

Comments

@epipho
Copy link

epipho commented Sep 16, 2017

Nomad version

Output from nomad version
Nomad v0.6.3

Operating system and Environment details

AWS r4.large instance (2 cores, fingerprint as 3000Mhz each)
CoreOS Container Linux 1465
Docker 17.05.0-ce

Issue

Allocation Resource Utilization for allocations is incorrect, reporting using way more CPU than is available on the host.

Both nomad CLI and the API agree on the (incorrect) number,

Reproduction steps

  • Create job that uses up a whole core (job below is just a while True loop)
  • Run nomad node-status -self to view resource allocation/utilization
  • Note that node stats are correct (1 core at 100% use)
  • Note that docker stats on launched container is correct
  • Note that allocation resource utilization is way off (20k+Mhz of only 6k), CPU Percent at 750% in api.

Output from nomad node-status -self

Truncated to resources section. Host Utilization seems fine, Allocation utilization is way off.

Allocated Resources
CPU            Memory          Disk            IOPS
3000/6000 MHz  128 MiB/15 GiB  100 MiB/14 GiB  0/0

Allocation Resource Utilization
CPU             Memory
22368/6000 MHz  104 KiB/15 GiB

Host Resource Utilization
CPU            Memory          Disk
3121/6000 MHz  835 MiB/15 GiB  2.1 GiB/17 GiB

Output from nomad-alloc-status

Truncated just to resources usage.

Task "spinner_service" is "running"
Task Resources
CPU             Memory           Disk     IOPS  Addresses
22494/3000 MHz  104 KiB/128 MiB  100 MiB  0

Output from docker stats

CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
153a                100.32%             392KiB / 128MiB     0.30%               1.15kB / 0B         0B / 0B             1

Output from v1/client/allocation//stats

{
  "ResourceUsage": {
    "CpuStats": {
      "Measured": [
        "Throttled Periods",
        "Throttled Time",
        "Percent"
      ],
      "Percent": 750.7910181818181,
      "SystemMode": 0,
      "ThrottledPeriods": 0,
      "ThrottledTime": 0,
      "TotalTicks": 22523.730545454542,
      "UserMode": 1498.419630437774
    },
    "MemoryStats": {
      "Cache": 24576,
      "KernelMaxUsage": 0,
      "KernelUsage": 0,
      "MaxUsage": 1933312,
      "Measured": [
        "RSS",
        "Cache",
        "Swap",
        "Max Usage"
      ],
      "RSS": 106496,
      "Swap": 0
    }
  },
  "Tasks": {
    "spinner_service": {
      "Pids": null,
      "ResourceUsage": {
        "CpuStats": {
          "Measured": [
            "Throttled Periods",
            "Throttled Time",
            "Percent"
          ],
          "Percent": 750.7910181818181,
          "SystemMode": 0,
          "ThrottledPeriods": 0,
          "ThrottledTime": 0,
          "TotalTicks": 22523.730545454542,
          "UserMode": 1498.419630437774
        },
        "MemoryStats": {
          "Cache": 24576,
          "KernelMaxUsage": 0,
          "KernelUsage": 0,
          "MaxUsage": 1933312,
          "Measured": [
            "RSS",
            "Cache",
            "Swap",
            "Max Usage"
          ],
          "RSS": 106496,
          "Swap": 0
        }
      },
      "Timestamp": 1505532852018018000
    }
  },
  "Timestamp": 1505532852018018000
}

Job file (if appropriate)

{
  "Region": "us-east-1",
  "ID": "spinner",
  "ParentID": "",
  "Name": "spinner",
  "Type": "service",
  "Priority": 10,
  "AllAtOnce": false,
  "Datacenters": [
    "dc1"
  ],
  "Constraints": null,
  "TaskGroups": [
    {
      "Name": "spinner_3",
      "Count": 1,
      "Constraints": null,
      "Tasks": [
        {
          "Name": "spinner_service",
          "Driver": "docker",
          "User": "",
          "Config": {
            "args": [
              "/bin/sh",
              "-c",
              "while true; do true; done"
            ],
            "dns_servers": [
              "172.17.0.1"
            ],
            "force_pull": true,
            "ignore_mem_limit": false,
            "image": "debian:jessie",
            "logging": [
              {
                "type": "journald"
              }
            ],
            "volumes": null
          },
          "Constraints": [
            {
              "LTarget": "${node.class}",
              "RTarget": "worker",
              "Operand": "set_contains"
            }         
          ],
          "Env": {
            "SERVICE_INSTANCE": "${NOMAD_ALLOC_ID}",
            "SERVICE_NAME": "spinner"
          },
          "Services": null,
          "Resources": {
            "CPU": 3000,
            "MemoryMB": 128,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": [
              {
                "Public": false,
                "CIDR": "",
                "ReservedPorts": null,
                "DynamicPorts": null,
                "IP": "",
                "MBits": 1
              }
            ]
          },
          "Meta": null,
          "KillTimeout": 270000000000,
          "LogConfig": {
            "MaxFiles": 5,
            "MaxFileSizeMB": 10
          },
          "Artifacts": null,
          "Vault": null,
          "Templates": null,
          "ID": "spinner.spinner_service"
        }
      ],
      "RestartPolicy": {
        "Interval": 60000000000,
        "Attempts": 2,
        "Delay": 15000000000,
        "Mode": "delay"
      },
      "EphemeralDisk": {
        "Sticky": false,
        "Migrate": false,
        "SizeMB": 100
      },
      "Meta": {
        "INSTANCE": "3"
      },
      "ID": "spinner.spinner_3"
    }
  ],
  "Update": {
    "Stagger": 1,
    "MaxParallel": 1
  },
  "Periodic": null,
  "Meta": null,
  "VaultToken": "",
  "Status": "running",
  "StatusDescription": "",
  "CreateIndex": 381861,
  "ModifyIndex": 384988,
  "JobModifyIndex": 384965
}
@epipho
Copy link
Author

epipho commented Sep 16, 2017

Ok, this was a weird one.

https://github.com/hashicorp/nomad/blob/master/client/driver/docker.go#L1660 uses the length of the per-cpu usage array to calculate the number of cores.

However, the docker stats API is not returning the expected 2 cores, it is returning 15 (see output below). Not sure if this is related to the underlying hardware (I tested it on a few random t2, c4, r4, r3, m4 instances and they all had 15 cores in the output). This causes the percentage calculations to be off, in my case w/ the r4.large, by 7.5x.

The docker stats output has an "online_cpus" field, but since that was not exposed in the docker client lib, I went ahead and changed it to use the runtime.NumCPU() value. This fixed all alloc stats, everything is exactly where it should be (3000/3000Mhz for the alloc, 3000/6000Mhz for the host, etc).

I opened #3229. I checked back in the history and the calculations were added in 693c8f9 so I am not sure if there was a historical reason for it working the way it did.

curl --unix-socket /var/run/docker.sock -XGET "http:/v1.21/containers/spinner_service-f1a5f49b-ed89-525e-594d-a987e04ad7fd/stats?stream=false" | jq

{
  "read": "2017-09-16T04:35:46.018030541Z",
  "preread": "2017-09-16T04:35:45.018038297Z",
  "pids_stats": {
    "current": 1
  },
  "blkio_stats": {
    "io_service_bytes_recursive": [],
    "io_serviced_recursive": [],
    "io_queue_recursive": [],
    "io_service_time_recursive": [],
    "io_wait_time_recursive": [],
    "io_merged_recursive": [],
    "io_time_recursive": [],
    "sectors_recursive": []
  },
  "num_procs": 0,
  "storage_stats": {},
  "cpu_stats": {
    "cpu_usage": {
      "total_usage": 144680312118,
      "percpu_usage": [
        133118158988,
        11562153130,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ],
      "usage_in_kernelmode": 30000000,
      "usage_in_usermode": 144650000000
    },
    "system_cpu_usage": 52853520000000,
    "online_cpus": 2,
    "throttling_data": {
      "periods": 0,
      "throttled_periods": 0,
      "throttled_time": 0
    }
  },
  "precpu_stats": {
    "cpu_usage": {
      "total_usage": 143686594496,
      "percpu_usage": [
        132124441366,
        11562153130,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ],
      "usage_in_kernelmode": 30000000,
      "usage_in_usermode": 143650000000
    },
    "system_cpu_usage": 52851530000000,
    "online_cpus": 2,
    "throttling_data": {
      "periods": 0,
      "throttled_periods": 0,
      "throttled_time": 0
    }
  },
  "memory_stats": {
    "usage": 438272,
    "max_usage": 2322432,
    "stats": {
      "active_anon": 106496,
      "active_file": 0,
      "cache": 0,
      "dirty": 0,
      "hierarchical_memory_limit": 134217728,
      "hierarchical_memsw_limit": 134217728,
      "inactive_anon": 0,
      "inactive_file": 0,
      "mapped_file": 0,
      "pgfault": 578,
      "pgmajfault": 0,
      "pgpgin": 467,
      "pgpgout": 441,
      "rss": 106496,
      "rss_huge": 0,
      "shmem": 0,
      "swap": 0,
      "total_active_anon": 106496,
      "total_active_file": 0,
      "total_cache": 0,
      "total_dirty": 0,
      "total_inactive_anon": 0,
      "total_inactive_file": 0,
      "total_mapped_file": 0,
      "total_pgfault": 578,
      "total_pgmajfault": 0,
      "total_pgpgin": 467,
      "total_pgpgout": 441,
      "total_rss": 106496,
      "total_rss_huge": 0,
      "total_shmem": 0,
      "total_swap": 0,
      "total_unevictable": 0,
      "total_writeback": 0,
      "unevictable": 0,
      "writeback": 0
    },
    "limit": 134217728
  },
  "name": "/spinner_service-f1a5f49b-ed89-525e-594d-a987e04ad7fd",
  "id": "bde452d00a950f3d20a52e6e05afe95654ec7c4e4f61d328a08e4bcd7356cd7b",
  "networks": {
    "eth0": {
      "rx_bytes": 1599,
      "rx_packets": 14,
      "rx_errors": 0,
      "rx_dropped": 0,
      "tx_bytes": 0,
      "tx_packets": 0,
      "tx_errors": 0,
      "tx_dropped": 0
    }
  }
}

@dadgar
Copy link
Contributor

dadgar commented Sep 17, 2017

@epipho Awesome write up and thanks for the PR 👍

@github-actions
Copy link

github-actions bot commented Dec 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants