Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad.nomad.blocked_evals.job.cpu unexpected values #12848

Closed
louievandyke opened this issue May 2, 2022 · 2 comments · Fixed by #13104
Closed

nomad.nomad.blocked_evals.job.cpu unexpected values #12848

louievandyke opened this issue May 2, 2022 · 2 comments · Fixed by #13104
Assignees
Labels
Milestone

Comments

@louievandyke
Copy link
Contributor

louievandyke commented May 2, 2022

Nomad version

ubuntu@server-1:~$ nomad --version
Nomad v1.2.6 (a6c6b475db5073e33885377b4a5c733e1161020c)

Operating system and Environment details

ubuntu@server-1:~$ cat /etc/*rel*
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS"
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Issue

The metric nomad.nomad.blocked_evals.job.cpu emits incorrect values while submitting jobs with cpu resources defined.

Reproduction steps

  • running a cluster with one nomad client.
  • Deploy a job which runs successfully
  • Deploy another job that purposefully has more resources than necessary. (testing autoscaling)
  • The above job will block with a metric value for nomad.nomad.blocked_evals.job.cpu and initially it looks correct.
  • Modify the same jobspec with different cpu resource (values that still should block) and monitor the same metric nomad.nomad.blocked_evals.job.cpu after deployment and the metric value seems incorrect.
➜  files git:(main) ✗ nomad job status
ID          Type     Priority  Status   Submit Date
grafana     service  50        running  2022-05-02T11:58:37-07:00
prometheus  service  50        running  2022-05-02T11:58:11-07:00
traefik     system   50        running  2022-05-02T11:57:22-07:00
➜  control git:(main) ✗ cat example.nomad | grep "resources {" -a2
      #     https://www.nomadproject.io/docs/job-specification/resources
      #
      resources {
        cpu    = 2000 # 500 MHz
        memory = 256 # 256MB
➜  files git:(main) ✗ curl "${NOMAD_ADDR}/v1/metrics" | jq . | gron | grep nomad.nomad.blocked_evals.job.cpu -a2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7592    0  7592    0     0  48050      0 --:--:-- --:--:-- --:--:-- 48050
➜  control git:(main) ✗ nomad job run example.nomad
==> 2022-05-02T12:16:14-07:00: Monitoring evaluation "267d0c79"
    2022-05-02T12:16:14-07:00: Evaluation triggered by job "example"
    2022-05-02T12:16:14-07:00: Evaluation within deployment: "b98efc31"
    2022-05-02T12:16:14-07:00: Allocation "c5a89144" created: node "72c5bf27", group "cache"
    2022-05-02T12:16:14-07:00: Evaluation status changed: "pending" -> "complete"
==> 2022-05-02T12:16:14-07:00: Evaluation "267d0c79" finished with status "complete"
==> 2022-05-02T12:16:14-07:00: Monitoring deployment "b98efc31"
  ✓ Deployment "b98efc31" successful

    2022-05-02T12:16:32-07:00
    ID          = b98efc31
    Job ID      = example
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    cache       1        1       1        0          2022-05-02T19:26:31Z
➜  control git:(main) ✗ nomad job status
ID          Type     Priority  Status   Submit Date
example     service  50        running  2022-05-02T12:16:14-07:00
grafana     service  50        running  2022-05-02T11:58:37-07:00
prometheus  service  50        running  2022-05-02T11:58:11-07:00
traefik     system   50        running  2022-05-02T11:57:22-07:00
➜  control git:(main) ✗ nomad alloc status -stats c5a89144
ID                  = c5a89144-02c7-2233-9fe4-c28c840f0ccd
Eval ID             = 267d0c79
Name                = example.cache[0]
Node ID             = 72c5bf27
Node Name           = clients000001
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 4m50s ago
Modified            = 4m33s ago
Deployment ID       = b98efc31
Deployment Health   = healthy

Allocation Addresses
Label  Dynamic  Address
*db    yes      10.0.2.6:24603 -> 6379

Task "redis" is "running"
Task Resources
CPU         Memory           Disk     Addresses
1/2000 MHz  6.2 MiB/256 MiB  300 MiB

Memory Stats
Cache  Max Usage  RSS      Swap  Usage
0 B    11 MiB     6.2 MiB  0 B   7.3 MiB

CPU Stats
Percent  Throttled Periods  Throttled Time
0.08%    0                  0

Task Events:
Started At     = 2022-05-02T19:16:20Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2022-05-02T12:16:20-07:00  Started     Task started by client
2022-05-02T12:16:14-07:00  Driver      Downloading image
2022-05-02T12:16:14-07:00  Task Setup  Building Task Directory
2022-05-02T12:16:14-07:00  Received    Task received by client
➜  control git:(main) ✗ curl "${NOMAD_ADDR}/v1/metrics" | jq . | gron | grep nomad.nomad.blocked_evals.job.cpu -a2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7650    0  7650    0     0  47222      0 --:--:-- --:--:-- --:--:-- 47222

There are no blocked evals because there are available resources. Now I'm going to deploy a new job that purposefully has cpu resources elevated beyond the cluster capacity.

➜  control git:(main) ✗ cat example-new.nomad | grep "resources {" -a2
      #     https://www.nomadproject.io/docs/job-specification/resources
      #
      resources {
        cpu    = 4000 # 500 MHz
        memory = 256 # 256MB
➜  control git:(main) ✗ nomad job plan example-new.nomad
+ Job: "example-new"
+ Task Group: "cache" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "cache" (failed to place 1 allocation):
    * Resources exhausted on 1 nodes
    * Class "hashistack" exhausted on 1 nodes
    * Dimension "cpu" exhausted on 1 nodes

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 example-new.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
➜  control git:(main) ✗ nomad job run example-new.nomad
==> 2022-05-02T12:26:56-07:00: Monitoring evaluation "776532e1"
    2022-05-02T12:26:56-07:00: Evaluation triggered by job "example-new"
    2022-05-02T12:26:56-07:00: Evaluation within deployment: "b981f236"
    2022-05-02T12:26:56-07:00: Evaluation status changed: "pending" -> "complete"
==> 2022-05-02T12:26:56-07:00: Evaluation "776532e1" finished with status "complete" but failed to place all allocations:
    2022-05-02T12:26:56-07:00: Task Group "cache" (failed to place 1 allocation):
      * Resources exhausted on 1 nodes
      * Class "hashistack" exhausted on 1 nodes
      * Dimension "cpu" exhausted on 1 nodes
    2022-05-02T12:26:56-07:00: Evaluation "c73f32ef" waiting for additional capacity to place remainder
==> 2022-05-02T12:26:56-07:00: Monitoring deployment "b981f236"
  ⠸ Deployment "b981f236" in progress...

    2022-05-02T12:27:03-07:00
    ID          = b981f236
    Job ID      = example-new
    Job Version = 0
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    cache       1        0       0        0          N/A
➜  control git:(main) ✗ curl "${NOMAD_ADDR}/v1/metrics" | jq . | gron | grep nomad.nomad.blocked_evals.job.cpu -a2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9009    0  9009    0     0  54271      0 --:--:-- --:--:-- --:--:-- 53946
json.Gauges[1].Labels.job = "example-new";
json.Gauges[1].Labels.namespace = "default";
json.Gauges[1].Name = "nomad.nomad.blocked_evals.job.cpu";
json.Gauges[1].Value = 4000;
json.Gauges[2] = {};

*^ this value is correct. I see 4000 json.Gauges[1].Value = 4000;

Now if I modify the same jobspec and change the cpu resource value from 4000 to 3500 I'd expect the value json.Gauges[1].Value = 3500;. However, the value will stick at 4000

➜  control git:(main) ✗ cat example-new.nomad | grep "resources {" -a2
      #     https://www.nomadproject.io/docs/job-specification/resources
      #
      resources {
        cpu    = 3500 # 500 MHz
        memory = 256 # 256MB
➜  control git:(main) ✗ nomad job plan example-new.nomad
+/- Job: "example-new"
+/- Task Group: "cache" (1 create/destroy update)
  +/- Task: "redis" (forces create/destroy update)
    +/- Resources {
      +/- CPU:         "4000" => "3500"
          Cores:       "0"
          DiskMB:      "0"
          IOPS:        "0"
          MemoryMB:    "256"
          MemoryMaxMB: "0"
        }

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "cache" (failed to place 1 allocation):
    * Resources exhausted on 1 nodes
    * Class "hashistack" exhausted on 1 nodes
    * Dimension "cpu" exhausted on 1 nodes

Job Modify Index: 23559
To submit the job with version verification run:

nomad job run -check-index 23559 example-new.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
➜  control git:(main) ✗ nomad job run example-new.nomad
==> 2022-05-02T12:40:10-07:00: Monitoring evaluation "52f8ab6d"
    2022-05-02T12:40:10-07:00: Evaluation triggered by job "example-new"
    2022-05-02T12:40:10-07:00: Evaluation within deployment: "9f5c480a"
    2022-05-02T12:40:10-07:00: Evaluation status changed: "pending" -> "complete"
==> 2022-05-02T12:40:10-07:00: Evaluation "52f8ab6d" finished with status "complete" but failed to place all allocations:
    2022-05-02T12:40:10-07:00: Task Group "cache" (failed to place 1 allocation):
      * Resources exhausted on 1 nodes
      * Class "hashistack" exhausted on 1 nodes
      * Dimension "cpu" exhausted on 1 nodes
    2022-05-02T12:40:10-07:00: Evaluation "04c5726d" waiting for additional capacity to place remainder
==> 2022-05-02T12:40:10-07:00: Monitoring deployment "9f5c480a"
  ⠼ Deployment "9f5c480a" in progress...

    2022-05-02T12:40:15-07:00
    ID          = 9f5c480a
    Job ID      = example-new
    Job Version = 3
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    cache       1        0       0        0          N/A^C
➜  control git:(main) ✗ curl "${NOMAD_ADDR}/v1/metrics" | jq . | gron | grep nomad.nomad.blocked_evals.job.cpu -a2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9177    0  9177    0     0  53982      0 --:--:-- --:--:-- --:--:-- 53666
json.Gauges[1].Labels.job = "example-new";
json.Gauges[1].Labels.namespace = "default";
json.Gauges[1].Name = "nomad.nomad.blocked_evals.job.cpu";
json.Gauges[1].Value = 4000;
json.Gauges[2] = {};

➜  control git:(main) ✗ curl "${NOMAD_ADDR}/v1/metrics" | jq . | gron | grep nomad.nomad.blocked_evals.job.cpu -a2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16281    0 16281    0     0  66182      0 --:--:-- --:--:-- --:--:-- 66182
json.Gauges[1].Labels.job = "example-new";
json.Gauges[1].Labels.namespace = "default";
json.Gauges[1].Name = "nomad.nomad.blocked_evals.job.cpu";
json.Gauges[1].Value = 4000;
json.Gauges[2] = {};

➜  control git:(main) ✗ nomad job status
ID           Type     Priority  Status   Submit Date
example      service  50        running  2022-05-02T12:16:14-07:00
example-new  service  50        running  2022-05-02T12:40:10-07:00
grafana      service  50        running  2022-05-02T11:58:37-07:00
prometheus   service  50        running  2022-05-02T11:58:11-07:00
traefik      system   50        running  2022-05-02T11:57:22-07:00
➜  control git:(main) ✗ nomad job status example-new
ID            = example-new
Name          = example-new
Submit Date   = 2022-05-02T12:40:10-07:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       1       0         1        0       0         0

Placement Failure
Task Group "cache":
  * Resources exhausted on 1 nodes
  * Class "hashistack" exhausted on 1 nodes
  * Dimension "cpu" exhausted on 1 nodes

Latest Deployment
ID          = 9f5c480a
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        0       0        0          N/A

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
97a8bf5f  72c5bf27  cache       1        run      running  36m15s ago  36m3s ago
➜  control git:(main) ✗ nomad alloc status -stats 97a8bf5f
ID                  = 97a8bf5f-1580-2c8a-7e18-0cae6d50a0f2
Eval ID             = 460bbd1e
Name                = example-new.cache[0]
Node ID             = 72c5bf27
Node Name           = clients000001
Job ID              = example-new
Job Version         = 1
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 36m35s ago
Modified            = 36m23s ago
Deployment ID       = bd787e3c
Deployment Health   = healthy

Allocation Addresses
Label  Dynamic  Address
*db    yes      10.0.2.6:31228 -> 6379

Task "redis" is "running"
Task Resources
CPU        Memory           Disk     Addresses
2/500 MHz  6.3 MiB/256 MiB  300 MiB

Memory Stats
Cache  Max Usage  RSS      Swap  Usage
0 B    13 MiB     6.3 MiB  0 B   7.3 MiB

CPU Stats
Percent  Throttled Periods  Throttled Time
0.11%    0                  0

Task Events:
Started At     = 2022-05-02T19:32:13Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2022-05-02T12:32:13-07:00  Started     Task started by client
2022-05-02T12:32:12-07:00  Task Setup  Building Task Directory
2022-05-02T12:32:12-07:00  Received    Task received by client

Expected Result

I expect the metric value to reflect the needed value of the pending job task.

Actual Result

The pending job requires 4000 cpu but the cluster does not have the capacity, and needs 4000 cpu in order to run the job. I modify the same jobspec to a value of 3500 cpu, which is still above available capacity. The expectation is that the results of curling the metric value nomad.nomad.blocked_evals.job.cpu would output a metric of 3500 cpu, but the value is still 4000.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@louievandyke
Copy link
Contributor Author

I also see negative values.

➜  control git:(main) ✗ curl "${NOMAD_ADDR}/v1/metrics" | jq . | gron | grep nomad.nomad.blocked_evals.job.cpu -a2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9212    0  9212    0     0  57937      0 --:--:-- --:--:-- --:--:-- 57937
json.Gauges[1].Labels.job = "example-new";
json.Gauges[1].Labels.namespace = "default";
json.Gauges[1].Name = "nomad.nomad.blocked_evals.job.cpu";
json.Gauges[1].Value = 4500;
json.Gauges[2] = {};
➜  control git:(main) ✗ nomad job stop example-new
==> 2022-05-02T14:03:40-07:00: Monitoring evaluation "8f9c02bd"
    2022-05-02T14:03:41-07:00: Evaluation triggered by job "example-new"
    2022-05-02T14:03:41-07:00: Evaluation within deployment: "e0e1045e"
    2022-05-02T14:03:41-07:00: Evaluation status changed: "pending" -> "complete"
==> 2022-05-02T14:03:41-07:00: Evaluation "8f9c02bd" finished with status "complete"
==> 2022-05-02T14:03:41-07:00: Monitoring deployment "e0e1045e"
  ! Deployment "e0e1045e" cancelled

    2022-05-02T14:03:41-07:00
    ID          = e0e1045e
    Job ID      = example-new
    Job Version = 1
    Status      = cancelled
    Description = Cancelled because job is stopped

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    cache       1        0       0        0          N/A
➜  control git:(main) ✗ nomad job status
ID           Type     Priority  Status          Submit Date
example      service  50        running         2022-05-02T13:57:30-07:00
example-new  service  50        dead (stopped)  2022-05-02T13:59:32-07:00
grafana      service  50        running         2022-05-02T11:58:37-07:00
prometheus   service  50        running         2022-05-02T11:58:11-07:00
traefik      system   50        running         2022-05-02T11:57:22-07:00
➜  control git:(main) ✗ curl "${NOMAD_ADDR}/v1/metrics" | jq . | gron | grep nomad.nomad.blocked_evals.job.cpu -a2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  8831    0  8831    0     0  54850      0 --:--:-- --:--:-- --:--:-- 54850
json.Gauges[1].Labels.job = "example-new";
json.Gauges[1].Labels.namespace = "default";
json.Gauges[1].Name = "nomad.nomad.blocked_evals.job.cpu";
json.Gauges[1].Value = -1000;
json.Gauges[2] = {};
➜  control git:(main) ✗ cat example-new.nomad| grep "resources {" -a2
      #     https://www.nomadproject.io/docs/job-specification/resources
      #
      resources {
        cpu    = 5500 # 500 MHz
        memory = 256 # 256MB
        
        
➜  control git:(main) ✗ cat example.nomad| grep "count ="
    count = 6
➜  control git:(main) ✗ cat example-new.nomad| grep "resources {" -a2
      #     https://www.nomadproject.io/docs/job-specification/resources
      #
      resources {
        cpu    = 5500 # 500 MHz
        memory = 256 # 256MB
        
➜  control git:(main) ✗ cat example-new.nomad| grep "count ="
    count = 1

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants