Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quotas(ent): quotas can be incorrectly calculated when nodes fail ranking #11848

Closed
jrasell opened this issue Jan 14, 2022 · 1 comment · Fixed by #11849
Closed

quotas(ent): quotas can be incorrectly calculated when nodes fail ranking #11848

jrasell opened this issue Jan 14, 2022 · 1 comment · Fixed by #11849

Comments

@jrasell
Copy link
Member

jrasell commented Jan 14, 2022

Nomad version

nomad-enterprise at main 316a8e1c78d123a36e861d9850343dd570b3a679

Issue

When performing allocation placement computation, the quota iterator can incorrectly report that a placement breaks the quota limit. This happens intermittently and despite confirmed resource available within the quota to correctly accommodate all planned allocations.

Reproduction steps

Run vagrant up within the root directory of the cloned Nomad repository.

The following configuration files are used to run one Nomad server and three Nomad clients on the single Vagrant machine. The files can be placed in other locations, but subsequent commands will need to be updated to reflect this.

/tmp/server1.hcl
log_level = "TRACE"
data_dir  = "/tmp/nomad-dev-cluster/server1"
name      = "server1"

server {
  enabled          = true
  bootstrap_expect = 1

  default_scheduler_config {
    scheduler_algorithm = "spread"
  }
}
/tmp/client1.hcl
log_level    = "DEBUG"
data_dir     = "/tmp/nomad-dev-cluster/client1"
name         = "client1"
enable_debug = true

client {
  enabled           = true
  node_class        = "one"
  cpu_total_compute = 1000
  memory_total_mb   = 1000

  server_join {
    retry_join = ["127.0.0.1:4647"]
  }
}

ports {
  http = 7646
}
/tmp/client2.hcl
log_level    = "DEBUG"
data_dir     = "/tmp/nomad-dev-cluster/client2"
name         = "client2"
enable_debug = true

client {
  enabled           = true
  node_class        = "two"
  cpu_total_compute = 1000
  memory_total_mb   = 1000

  server_join {
    retry_join = ["127.0.0.1:4647"]
  }
}

ports {
  http = 8646
}
/tmp/client3.hcl
log_level    = "DEBUG"
data_dir     = "/tmp/nomad-dev-cluster/client3"
name         = "client3"
enable_debug = true

client {
  enabled           = true
  node_class        = "two"
  cpu_total_compute = 1000
  memory_total_mb   = 1000

  server_join {
    retry_join = ["127.0.0.1:4647"]
  }
}

ports {
  http = 9646
}

Start each of the following agent processes in a separate SSH terminal running on the Vagrant machine.

$ sudo env NOMAD_LICENSE=<INSERT_LICENSE> nomad agent -config=/tmp/server1.hcl
$ sudo nomad agent -config=/tmp/client1.hcl
$ sudo nomad agent -config=/tmp/client2.hcl
$ sudo nomad agent -config=/tmp/client3.hcl

The following quota spec should be created and linked with the default namespace.

quota spec
name        = "default-quota"
description = "Limit the shared default namespace"

limit {
  region = "global"
  region_limit {
    cpu        = 3000
    memory     = 3000
    memory_max = 3000
  }
}

Apply the quota saved to disk using nomad quota apply <file>. Link this to the default namespace using nomad namespace apply -quota default-quota default.

This job fills 2 Nomad clients, which is a prerequisite of this bug. The job uses constraints to ensure it fully fills two clients and leaves a third empty. Save this file, and trigger a registration using the nomad job run <jobspec> command.

filler.nomad
job "filler" {
  datacenters = ["dc1"]
  namespace   = "default"
  constraint {
    attribute = "${node.class}"
    value     = "two"
  }

  group "cache" {
    count = "4"
    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"
      }

      resources {
        cpu    = 500
        memory = 500
      }
    }
  }
}

The job only needs to be planned, in order to see the bug. The plan may need to be run several times before it is hit. nomad job plan <file>.

quota.nomad
job "quota" {
  datacenters = ["dc1"]
  namespace   = "default"

  update {
    max_parallel      = 1
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = false
    canary            = 0
  }

  group "cache" {
    count = "3"
    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"
      }

      resources {
        cpu    = 290
        memory = 100
      }
    }
  }
}

Expected Result

The plan should complete successfully every time it is triggered.

Actual Result

The plan will fail with the following example output:

+ Job: "quota"
+ Task Group: "cache" (3 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "cache" (failed to place 1 allocation):
    * Resources exhausted on 1 nodes
    * Class "two" exhausted on 1 nodes
    * Dimension "cpu" exhausted on 1 nodes
    * Quota limit hit "cpu exhausted (3160 needed > 3000 limit)"

@jrasell jrasell self-assigned this Jan 14, 2022
@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 14, 2022
@jrasell jrasell moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 14, 2022
jrasell added a commit that referenced this issue Jan 14, 2022
Nomad - Community Issues Triage automation moved this from In Progress to Done Jan 17, 2022
jrasell added a commit that referenced this issue Jan 17, 2022
lgfa29 pushed a commit that referenced this issue Jan 17, 2022
lgfa29 pushed a commit that referenced this issue Jan 17, 2022
lgfa29 pushed a commit that referenced this issue Jan 18, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

1 participant