quotas(ent): quotas can be incorrectly calculated when nodes fail ranking #11848

jrasell · 2022-01-14T12:34:56Z

Nomad version

nomad-enterprise at main 316a8e1c78d123a36e861d9850343dd570b3a679

Issue

When performing allocation placement computation, the quota iterator can incorrectly report that a placement breaks the quota limit. This happens intermittently and despite confirmed resource available within the quota to correctly accommodate all planned allocations.

Reproduction steps

Run vagrant up within the root directory of the cloned Nomad repository.

The following configuration files are used to run one Nomad server and three Nomad clients on the single Vagrant machine. The files can be placed in other locations, but subsequent commands will need to be updated to reflect this.

/tmp/server1.hcl

log_level = "TRACE"
data_dir  = "/tmp/nomad-dev-cluster/server1"
name      = "server1"

server {
  enabled          = true
  bootstrap_expect = 1

  default_scheduler_config {
    scheduler_algorithm = "spread"
  }
}

/tmp/client1.hcl

log_level    = "DEBUG"
data_dir     = "/tmp/nomad-dev-cluster/client1"
name         = "client1"
enable_debug = true

client {
  enabled           = true
  node_class        = "one"
  cpu_total_compute = 1000
  memory_total_mb   = 1000

  server_join {
    retry_join = ["127.0.0.1:4647"]
  }
}

ports {
  http = 7646
}

/tmp/client2.hcl

log_level    = "DEBUG"
data_dir     = "/tmp/nomad-dev-cluster/client2"
name         = "client2"
enable_debug = true

client {
  enabled           = true
  node_class        = "two"
  cpu_total_compute = 1000
  memory_total_mb   = 1000

  server_join {
    retry_join = ["127.0.0.1:4647"]
  }
}

ports {
  http = 8646
}

/tmp/client3.hcl

log_level    = "DEBUG"
data_dir     = "/tmp/nomad-dev-cluster/client3"
name         = "client3"
enable_debug = true

client {
  enabled           = true
  node_class        = "two"
  cpu_total_compute = 1000
  memory_total_mb   = 1000

  server_join {
    retry_join = ["127.0.0.1:4647"]
  }
}

ports {
  http = 9646
}

Start each of the following agent processes in a separate SSH terminal running on the Vagrant machine.

$ sudo env NOMAD_LICENSE=<INSERT_LICENSE> nomad agent -config=/tmp/server1.hcl
$ sudo nomad agent -config=/tmp/client1.hcl
$ sudo nomad agent -config=/tmp/client2.hcl
$ sudo nomad agent -config=/tmp/client3.hcl

The following quota spec should be created and linked with the default namespace.

quota spec

name        = "default-quota"
description = "Limit the shared default namespace"

limit {
  region = "global"
  region_limit {
    cpu        = 3000
    memory     = 3000
    memory_max = 3000
  }
}

Apply the quota saved to disk using nomad quota apply <file>. Link this to the default namespace using nomad namespace apply -quota default-quota default.

This job fills 2 Nomad clients, which is a prerequisite of this bug. The job uses constraints to ensure it fully fills two clients and leaves a third empty. Save this file, and trigger a registration using the nomad job run <jobspec> command.

filler.nomad

job "filler" {
  datacenters = ["dc1"]
  namespace   = "default"
  constraint {
    attribute = "${node.class}"
    value     = "two"
  }

  group "cache" {
    count = "4"
    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"
      }

      resources {
        cpu    = 500
        memory = 500
      }
    }
  }
}

The job only needs to be planned, in order to see the bug. The plan may need to be run several times before it is hit. nomad job plan <file>.

quota.nomad

job "quota" {
  datacenters = ["dc1"]
  namespace   = "default"

  update {
    max_parallel      = 1
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = false
    canary            = 0
  }

  group "cache" {
    count = "3"
    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"
      }

      resources {
        cpu    = 290
        memory = 100
      }
    }
  }
}

Expected Result

The plan should complete successfully every time it is triggered.

Actual Result

The plan will fail with the following example output:

+ Job: "quota"
+ Task Group: "cache" (3 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "cache" (failed to place 1 allocation):
    * Resources exhausted on 1 nodes
    * Class "two" exhausted on 1 nodes
    * Dimension "cpu" exhausted on 1 nodes
    * Quota limit hit "cpu exhausted (3160 needed > 3000 limit)"

The text was updated successfully, but these errors were encountered:

changelog: add entry for #11848

github-actions · 2022-10-12T02:44:46Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jrasell added type/bug theme/scheduling theme/enterprise Issues related to Enterprise features theme/ent/quotas labels Jan 14, 2022

jrasell self-assigned this Jan 14, 2022

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 14, 2022

jrasell moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 14, 2022

jrasell added a commit that referenced this issue Jan 14, 2022

changelog: add entry for #11848

54cbfe0

jrasell mentioned this issue Jan 14, 2022

changelog: add entry for #11848 #11849

Merged

jrasell closed this as completed in #11849 Jan 17, 2022

Nomad - Community Issues Triage automation moved this from In Progress to Done Jan 17, 2022

jrasell added a commit that referenced this issue Jan 17, 2022

Merge pull request #11849 from hashicorp/b-changelog-11848

868ab23

changelog: add entry for #11848

lgfa29 pushed a commit that referenced this issue Jan 17, 2022

Merge pull request #11849 from hashicorp/b-changelog-11848

90d3f57

changelog: add entry for #11848

lgfa29 pushed a commit that referenced this issue Jan 17, 2022

Merge pull request #11849 from hashicorp/b-changelog-11848

5552a79

changelog: add entry for #11848

lgfa29 pushed a commit that referenced this issue Jan 18, 2022

Merge pull request #11849 from hashicorp/b-changelog-11848

c48d40d

changelog: add entry for #11848

github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quotas(ent): quotas can be incorrectly calculated when nodes fail ranking #11848

quotas(ent): quotas can be incorrectly calculated when nodes fail ranking #11848

jrasell commented Jan 14, 2022

github-actions bot commented Oct 12, 2022

quotas(ent): quotas can be incorrectly calculated when nodes fail ranking #11848

quotas(ent): quotas can be incorrectly calculated when nodes fail ranking #11848

Comments

jrasell commented Jan 14, 2022

Nomad version

Issue

Reproduction steps

Expected Result

Actual Result

github-actions bot commented Oct 12, 2022